首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
General-purpose processor (GPP) is an important platform for fast Fourier transform (FFT),due to its flexibility,reliability and practicality.FFT is a representative application intensive in both computation and memory access,optimizing the FFT performance of a GPP also benefits the performances of many other applications.To facilitate the analysis of FFT,this paper proposes a theoretical model of the FFT processing.The model gives out a tight lower bound of the runtime of FFT on a GPP,and guides the architecture optimization for GPP as well.Based on the model,two theorems on optimization of architecture parameters are deduced,which refer to the lower bounds of register number and memory bandwidth.Experimental results on different processor architectures (including Intel Core i7 and Godson-3B) validate the performance model.The above investigations were adopted in the development of Godson-3B,which is an industrial GPP.The optimization techniques deduced from our performance model improve the FFT performance by about 40%,while incurring only 0.8% additional area cost.Consequently,Godson-3B solves the 1024-point single-precision complex FFT in 0.368 μs with about 40 Watt power consumption,and has the highest performance-per-watt in complex FFT among processors as far as we know.This work could benefit optimization of other GPPs as well.  相似文献   

2.
The Complexity of Poor Man's Logic   总被引:1,自引:0,他引:1  
  相似文献   

3.
4.
5.
随着高性能嵌入式处理器的普及和硬件成本的不断降低,嵌入式系统的功能也越来越强.嵌入式技术在生产中的广泛应用加强了对现实的需求.本文就介绍了一款高性价比的Blackfin嵌入式处理器,本并提供了两种基于此处理器的TFT LCD驱动的实现方案.  相似文献   

6.
通用报表生成工具的设计   总被引:1,自引:0,他引:1  
本文从程序处理的角度给出了报表生成的定义:(Sm,En)-T.介绍了以此定又为基本思想的通用报表生成工具的设计方法。  相似文献   

7.
GPGPU性能模型及应用实例分析   总被引:1,自引:1,他引:1  
现代图形处理器(GPU)的高性能吸引了大量非图形应用,为了有效地进行性能预测和优化,提出一种GPU处理通用计算问题的性能模型.通过分析现代GPU并行架构和工作原理,将GPU的通用计算过程划分为数据获取、计算、输出和传输4个并列的阶段,结合程序特点和硬件规格对各阶段进行量化分析,完成性能预测.通过实验分析得出两大性能影响要素:计算强度和访问密度,并将其作为性能优化的基本准则.该模型被用于分析几种常见的图像和视频处理算法在GPU上的实现,包括高斯卷积、离散余弦变换和运动估计.实验结果表明,通过增大计算强度和访问密度,文中优化方案显著地降低了GPU上的执行时间,使得计算效率提升了4~10倍,充分说明了该模型在性能预测和优化方面的有效性.  相似文献   

8.
9.
A concise, complete and reliable algorithm for triangulating an arbitrary polygon is presented as a powerful core processor of PLAN-I (Production Layout Automation Nucleus) system1. A polygon with inner loops is converted into a non-self-intersecting polygon (NIP) through adding'Bridge Edges'. A triangular splitting algorithm for NIP is described in detail and successfully implemented. The program is about 300 lines in FORTRAN-77. Sweeping and local operations with respect to the triangular-faceted boundary representation are designed with the algorithm to improve functions of solid modeling in PLAN-I. The algorithm of triangulation can also be applied to Boolean operations and other issues.  相似文献   

10.
程锦松 《微机发展》1996,6(4):35-37
本文讨论在分布式系统中当相邻的处理机不能同时工作时的处理机调度算法.  相似文献   

11.
12.
《Micro, IEEE》2006,26(5):42-51
A software-configurable processor combines a traditional RISC processor with a field-programmable instruction extension unit that lets the system designer tailor the processor to a particular application. To add application-specific instructions to the processor, the programmer adds a pragma before a C or C++ function declaration, and the compiler then turns the function into a single instruction  相似文献   

13.
Sanchez  E. Sommer  P. Menu  J. Iseli  C. 《Micro, IEEE》1987,7(6):29-40
Switzerland's first 16-bit processor design may lead to a system capable of easing HLL implementation. It has already led to a better understanding of complex ICs.  相似文献   

14.
A 5-GHz Mesh Interconnect for a Teraflops Processor   总被引:3,自引:0,他引:3  
A multicore processor in 65-Nm technology with 80 single-precision, floatingpoint cores delivers performance in excess of a Teraflops while consuming less than 100 W. A 2D on-die mesh interconnection network operating at 5 GHz provides the high-performance communication fabric to connect the cores. The network delivers a bisection bandwidth of 2.56 Terabits per second and a per hop fall-through latency of 1 nanosecond.  相似文献   

15.
An innovative algorithm for syntactic analysis could be the first step toward placing grammar on a chip.  相似文献   

16.
Hardware implementation of genetic algorithms (GAs) is gaining importance because of their proven effectiveness as optimization engines for real-time applications (e.g., evolvable hardware). Earlier hardware implementations suffer from major drawbacks such as absence of GA parameter programmability, rigid predefined system architecture, and lack of support for multiple fitness functions. In this paper, we report the design of an IP core that implements a general-purpose GA engine that addresses these problems. Specifically, the proposed GA IP core can be customized in terms of the population size, number of generations, crossover and mutation rates, random number generator seed, and the fitness function. It has been successfully synthesized and verified on a Xilinx Virtex II Pro Field programmable gate arrays device (xc2vp30-7ff896) with only 13% logic slice utilization, 1% block memory utilization for GA memory, and a clock speed of 50 MHz. The GA core has been used as a search engine for real-time adaptive healing but can be tailored to any given application by interfacing with the appropriate application-specific fitness evaluation module as well as the required storage memory and by programming the values of the desired GA parameters. The core is soft in nature i.e., a gate-level netlist is provided which can be readily integrated with the user's system. The performance of the GA core was tested using standard optimization test functions. In the hardware experiments, the proposed core either found the globally optimum solution or found a solution that was within 3.7% of the value of the globally optimal solution. The experimental test setup including the GA core achieved a speedup of around 5.16$,times$ over an analogous software implementation.   相似文献   

17.
The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.  相似文献   

18.
本文介绍一个通用的神经网络模拟系统。该系统中的计算在多个Transputer处理器组成的环式结构上进行,用于矩阵运算的算法基于S.Y.Kung的工作,但进行了改进。为充分开发Transputer的计算能力,处理器之间采用数据块传送数据,且通信次数被尽量减少。该系统还提供了友好的人-机交互,没有Transputer系统概念和知识的用户也可以在DOS环境下直接使用。因此,该系统不仅速度极快。而且易于使用。  相似文献   

19.
Simple and efficient, this algorithm for Boolean shape operations makes use of the continuity of a shape. It is fastest when special hardware performs the repetitious geometric computations.  相似文献   

20.
主要讨论一个嵌入式微处理机数据采集系统的软/硬件设计方案。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号