共查询到20条相似文献,搜索用时 0 毫秒
1.
General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT),due to its flexibility,reliability and practicality.FFT is a representative application intensive in both computation and memory access,optimizing the FFT performance of a GPP also benefits the performances of many other applications.To facilitate the analysis of FFT,this paper proposes a theoretical model of the FFT processing.The model gives out a tight lower bound of the runtime of FFT on a GPP,and guides the architecture optimization for GPP as well.Based on the model,two theorems on optimization of architecture parameters are deduced,which refer to the lower bounds of register number and memory bandwidth.Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.The above investigations were adopted in the development of Godson-3B,which is an industrial GPP.The optimization techniques deduced from our performance model improve the FFT performance by about 40%,while incurring only 0.8% additional area cost.Consequently,Godson-3B solves the 1024-point single-precision complex FFT in 0.368 μs with about 40 Watt power consumption,and has the highest performance-per-watt in complex FFT among processors as far as we know.This work could benefit optimization of other GPPs as well. 相似文献
2.
The Complexity of Poor Man's Logic 总被引:1,自引:0,他引:1
3.
4.
5.
随着高性能嵌入式处理器的普及和硬件成本的不断降低,嵌入式系统的功能也越来越强.嵌入式技术在生产中的广泛应用加强了对现实的需求.本文就介绍了一款高性价比的Blackfin嵌入式处理器,本并提供了两种基于此处理器的TFT LCD驱动的实现方案. 相似文献
6.
7.
GPGPU性能模型及应用实例分析 总被引:1,自引:1,他引:1
现代图形处理器(GPU)的高性能吸引了大量非图形应用,为了有效地进行性能预测和优化,提出一种GPU处理通用计算问题的性能模型.通过分析现代GPU并行架构和工作原理,将GPU的通用计算过程划分为数据获取、计算、输出和传输4个并列的阶段,结合程序特点和硬件规格对各阶段进行量化分析,完成性能预测.通过实验分析得出两大性能影响要素:计算强度和访问密度,并将其作为性能优化的基本准则.该模型被用于分析几种常见的图像和视频处理算法在GPU上的实现,包括高斯卷积、离散余弦变换和运动估计.实验结果表明,通过增大计算强度和访问密度,文中优化方案显著地降低了GPU上的执行时间,使得计算效率提升了4~10倍,充分说明了该模型在性能预测和优化方面的有效性. 相似文献
8.
9.
A concise, complete and reliable algorithm for triangulating an arbitrary polygon is presented as a powerful core processor of PLAN-I (Production Layout Automation Nucleus) system1 . A polygon with inner loops is converted into a non-self-intersecting polygon (NIP) through adding'Bridge Edges'. A triangular splitting algorithm for NIP is described in detail and successfully implemented. The program is about 300 lines in FORTRAN-77. Sweeping and local operations with respect to the triangular-faceted boundary representation are designed with the algorithm to improve functions of solid modeling in PLAN-I. The algorithm of triangulation can also be applied to Boolean operations and other issues. 相似文献
10.
11.
12.
《Micro, IEEE》2006,26(5):42-51
A software-configurable processor combines a traditional RISC processor with a field-programmable instruction extension unit that lets the system designer tailor the processor to a particular application. To add application-specific instructions to the processor, the programmer adds a pragma before a C or C++ function declaration, and the compiler then turns the function into a single instruction 相似文献
13.
Switzerland's first 16-bit processor design may lead to a system capable of easing HLL implementation. It has already led to a better understanding of complex ICs. 相似文献
14.
A 5-GHz Mesh Interconnect for a Teraflops Processor 总被引:3,自引:0,他引:3
A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W. A 2D on-die mesh interconnection network operating at 5 GHz provides the high-performance communication fabric to connect the cores. The network delivers a bisection bandwidth of 2.56 Terabits per second and a per hop fall-through latency of 1 nanosecond. 相似文献
15.
An innovative algorithm for syntactic analysis could be the first step toward placing grammar on a chip. 相似文献
16.
Customizable FPGA IP Core Implementation of a General-Purpose Genetic Algorithm Engine 总被引:2,自引:0,他引:2
Fernando P. R. Katkoori S. Keymeulen D. Zebulum R. Stoica A. 《Evolutionary Computation, IEEE Transactions on》2010,14(1):133-149
17.
A General-Purpose Many-Accelerator Architecture Based on Dataflow Graph Clustering of Applications
下载免费PDF全文

The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average. 相似文献
18.
本文介绍一个通用的神经网络模拟系统。该系统中的计算在多个Transputer处理器组成的环式结构上进行,用于矩阵运算的算法基于S.Y.Kung的工作,但进行了改进。为充分开发Transputer的计算能力,处理器之间采用数据块传送数据,且通信次数被尽量减少。该系统还提供了友好的人-机交互,没有Transputer系统概念和知识的用户也可以在DOS环境下直接使用。因此,该系统不仅速度极快。而且易于使用。 相似文献
19.
Simple and efficient, this algorithm for Boolean shape operations makes use of the continuity of a shape. It is fastest when special hardware performs the repetitious geometric computations. 相似文献