首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 648 毫秒
1.
Some implementations of a power-of-two one-dimensional fast Fourier transform (FFT) on vector computers use radix-4 Stockham autosort kernels with a separate transpose step. This paper describes an algorithm that performs well on a Convex C4/XA vector supercomputer on large FFTs by using higher-radix kernels and moving the transpose step into the computational steps. For short transforms a different algorithm is used that calculates the FFT without storing any intermediate results to memory. Performance results using these techniques are given.  相似文献   

2.
The CRAY-2 is considered to be one of the most powerful supercomputers. Its state-of-the-art technology features a faster clock and more memory than any other supercomputer available today. In this report the single processor performance of the CRAY-2 is compared with the older, more mature CRAY X-MP. Benchmark results are included for both the slow and the fast memory DRAM MOS CRAY-2. Our comparison is based on a kernel benchmark set aimed at evaluating the performance of these two machines on some standard tasks in scientific computing. Particular emphasis is placed on evaluating the impact of the availability of large real memory on the CRAY-2 versus fast secondary memory on the CRAY X-MP with SSD. Our benchmark includes large linear equation solvers and FFT routines, which test the capabilities of the different approaches to providing large memory. We find that in spite of its higher processor speed the CRAY-2 does not perform as well as the CRAY X-MP on the Fortran kernel benchmark. We also find that for large-scale applications, which have regular and predictable memory access patterns, a high-speed secondary memory device such as the SSD can provide performance equal to the large real memory of the CRAY-2.The author is an employee of SCA Division of Boeing Computer Services.  相似文献   

3.
FFTs in external or hierarchical memory   总被引:2,自引:0,他引:2  
Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2 m -point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation.Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the CRAY-2, the CRAY X-MP, and the CRAY Y-MP systems. Using all eight processors on the CRAY Y-MP, this main memory routine runs at nearly two gigaflops.A condensed version of this paper previously appeared in the Proceedings of Supercomputing '89.  相似文献   

4.
Floating-point fast Fourier transform (FFT) has been widely expected in scientific computing and high-resolution imaging applications due to the wide dynamic range and high processing precision. However, it suffers high area and energy overhead problems in comparison to fixed-point implementations. To address these issues, this paper presents an area- and energy-efficient hybrid architecture for floating-point FFT computations. It minimizes the required arithmetic units and reduces the memory usage significantly by combining three different parts. The serial radix-4 butterfly (SR4BF) is used in the single-path delay commutator (SDC) part to minimize the required arithmetic units with 100% adder utilization ratio obtained. A modified single-path delay feedback (MSDF) architecture is proposed to achieve a tradeoff between arithmetic resources and memory usage by using the new half radix-4 butterfly (HR4BF) with 50% adder utilization ratio obtained. The intermediate caching buffer is modified accordingly in the MSDF part. By combining both the advantages on arithmetic units reducing and memory usage optimization in different parts, the optimized area and power are obtained without throughput loss. The logic synthesis results in a 65 nm CMOS technology show that the energy per FFT is about 331.5 nJ for 1024-point FFT computations at 400 MHz. The total hardware overhead is equivalent to 460k NAND2 gates.  相似文献   

5.
One of the many interesting architectural features of the CRAY-2 supercomputer is that each processor has access to 16K 64-bit words of local memory. This is in addition to the extremely large, 268-million-word common memory that is accessible by all four processors. By using local memory judiciously, it is possible to achieve increased performance on the CRAY-2. This is partly because accesses to local memory can be done simultaneously with accesses to common memory and other operations. Also, it is slightly faster to start up a vector access to local memory, and a processor does not have to compete with other processors when accessing its local memory. In this paper, we present an algorithm for computing the fast Fourier transform that takes advantage of the CRAY-2's local memory. It operates by solving subproblems, which are themselves Fourier transforms, entirely within local memory. By doing so it achieves a performance increase of between 25 and 40 percent over an equivalent algorithm that uses only common memory, and for some input sizes is able to outperform the CRAY-2 library FFT.  相似文献   

6.
An Efficient Two-Dimensional FFT Algorithm   总被引:1,自引:0,他引:1  
A new version of the radix-2 row-column method for computing two-dimensional fast Fourier transforms is proposed. It uses a ``multiple vector' FFT algorithm to compute the transforms of all the columns in an array simultaneously while avoiding all trivial multiplications. The minicomputer implementation of the algorithm runs faster than the 2 × 2 vector radix FFT algorithm. Analysis of the numbers of complex additions and multiplications required indicate that implementations of the radix-4 row-column FFT and 4 × 4 vector radix FFT on the same minicomputer would run slower than the multiple vector implementation.  相似文献   

7.
FFT处理器无冲突地址生成方法   总被引:8,自引:2,他引:6  
马余泰 《计算机学报》1995,18(11):875-880
本文提出了一种新的无冲突地址生成方法,使蝶式运算单元在一个周期内能够同时读取两个操作数。由于取消了地址奇偶判别电路,简化了存储体控制逻辑,同 时也加快了输入/输出地址生成,该方法还同样适用于基-4FFT处理器。  相似文献   

8.
基于FPGA的1024点高性能FFT处理器的设计   总被引:1,自引:0,他引:1  
为了提高FFT(Fast Fourier Transformation)处理数据的实时性,本文研究了16位1024点FFT并提出了几种有效的优化方案。在Xilinx公司Virtex-E系列FPGA上实现了工作频率50MHz以上、流水线型、基22单路径反馈结构(R22SDF)FFT处理器。仿真和性能评估结果表明本FFT处理器的有较高的性能。  相似文献   

9.
针对基-2 FFT 处理算法,采用分块存储思想,将存储器、处理机数据交换网络模型进行优化。优化后的网络模型数据通路数仅为20,降低为原来的4%以下,且不随 FFT 计算点数增多而增加。整个设计在 Virtex 系统芯片 XCV800上实现。  相似文献   

10.
一种改进FFT算法在DSP上的实现   总被引:3,自引:0,他引:3  
快速傅里叶变换(FFT)是数字信号处理中最为重要的工具之一。而在具体硬件实现中,如何减少内存引用次数,以降低功耗具有更重要的意义。论文以基2按时间抽取FFT为例,在深入分析旋转因子性质的基础上,提出了一种改进FFT算法可以减少旋转因子的引用次数,消除冗余的内存引用,并给出了在DSPVC5402平台上的实验数据。表明了该算法是切实有效的。  相似文献   

11.
基2与混合基快速Fourier变换算法性能比较   总被引:1,自引:0,他引:1  
目前快速Fourier变换算法主要有两大类,一类是针对点数为2的整数次幂,一类对应点数为其他长度的情况。在介绍基2和混合基的FFT算法原理的基础上,通过仿真数据对两种FFT算法的性能进行了比较分析。验证结果表明,基2算法在计算速度方面要占有优势,但在整周期截断的情况下,混合基快速算法却在频谱效果方面占有优势。  相似文献   

12.
高性能基4快速傅里叶变换处理器的设计   总被引:4,自引:1,他引:3       下载免费PDF全文
段小东  顾立志 《计算机工程》2008,34(24):238-240
研究并设计高性能基4快速傅里叶变换(FFT)处理器。采用基4算法、流水线结构的蝶形运算单元,提高了处理速度,使芯片能在更高的时钟频率上工作。运用溢出检测状态机对每个蝶形运算单元输出的数据进行块浮点检查,确保对溢出情况进行正确判断。验证与性能评估结果表明,该FFT处理器具有较高性能。  相似文献   

13.
管道腐蚀内检测中超声回波信号具有周期性特点,功率谱估计是重要的数据处理方法之一。基于分裂基的FFT算法具有较小的乘法次数和加法次数,且算法结构较好。采用频率抽取分裂基2/4 FFT算法对管道腐蚀超声内检测回波信号进行了处理.得到管道壁厚数据,经分裂基FFT算法和基2 FFT算法比较,分裂基FFT算法明显减少了数据处理时间,提高了检测速度。理论分析和实验结果表明,该分裂基算法精度高,数据处理速度快,满足管道腐蚀内检测的实时性要求。  相似文献   

14.
基于基2-FFT的GPS信号捕获算法研究   总被引:1,自引:0,他引:1  
讨论了各种GPS信号捕获算法,针对采用圆周相关法进行GPS信号捕获中FFT点数为非2的整数次幂的问题,分析了传统补零法和内插法的缺陷,并提出了一种新的基2-FFT捕获算法,该算法通过将采样数据和本地参考码均补零到大于自身长度两倍后做基2-FFT进行圆周相关,理论分析和仿真表明,该算法可以得到与非2的整数次幂点数的圆周相关一样的结果。  相似文献   

15.
程一飞  陈文莉 《微机发展》2007,17(10):155-157
椭圆曲线标量乘是椭圆曲线密码系统中最关键、最耗时的运算,因此如何快速高效实现标量乘运算是研究的重点。目前常见的标量乘算法有:double-and-add算法,NAF算法,MOF算法等,但它们都是基于radix-2编码表示的,无论采用何种编码,倍点运算的次数都不变,减少的只是点加(或点减)运算的次数。提出一个基于radix-8表示的新的编码方法,及一个基于radix-8表示的标量乘算法,通过用八倍点运算代替倍点运算,且编码是从左到右(即从最高位向最低位)进行,编码和主计算可以合并,提高实现效率并节省内存空间。实验结果表明,该算法较经典的double-and-add算法能够提高效率30%以上。  相似文献   

16.
针对地面数字视频广播(DVB-T)系统中高速FFT处理器的设计要求,提出了一种新的基16/8混合基算法及其实现结构。采用单个基16/8复用的蝶形运算单元顺序处理,并通过减少乘法器数目,有效降低了硬件消耗;运算单元内部采用“基4+基4/2”级联流水线方式,大大加快了运算速度;此外,应用对称乒乓RAM结构提高了蝶算单元的连续运算能力;并且使用改进的块浮点防溢出机制,以保证运算精度。仿真和实现结果表明该设计具有良好的性能,完全满足实际应用要求。  相似文献   

17.
Various scientific codes were benchmarked on three vector computers: the CRAY X-MP/48 and CRAY-2 supercomputers and the SCS-40/XM minisupercomputer. On the X-MP, two Fortran compilers were also compared. The benchmarks, which were initially all in Fortran, consisted of six research codes from Caltech, the 24 Livermore loops, and two cases from the LINPACK benchmark. As a corollary effort, the effect of manual optimization on the Caltech codes was also considered, including the selected use of assembly-language math routines.On each machine the ratio of the maximum to the minimum speeds for the various benchmarks was more than a factor of 50, even though the study was restricted to unitasked (i.e., single CPU) runs. The maximum speed for all-Fortran codes was more than 80% of the peak speed on the X-MP and SCS, but less than 40% of the peak speed on the CRAY-2.Despite having a clock that is 2.3 times faster, the CRAY-2 generally runs slower than the X-MP, typically by a factor of 1.3 for scalar code and even slower for moderately vectorized code. Only for highly vectorized codes does the CRAY-2 marginally outperform the X-MP, at least for in-core benchmarks. The poorer performance of the CRAY-2 is due to its slower scalar speed, its lack of chaining, its single port between each CPU and memory, and its relatively slow memory.The SCS runs slower than the X-MP by a factor of 2.6 in the scalar limit and by a factor of 4.7 (the clock ratio) in the vector limit when the same CFT compiler is used on both machines. Use of the newer CFT77 compiler on the X-MP negates the relative enhancement of the SCS scalar performance.On the X-MP, the CFT77 3.0 compiler produces significantly faster code than CFT 1.14, typically by a factor of 1.4. This is obtained, however, at the expense of compilation times that are three to five times longer. Regardless of the compiler, manual optimization is still worthwhile. For three of the six Caltech codes compiled with CFT77, run time speedups of 2, 4, and 16 were achieved due to Fortran optimization only.  相似文献   

18.
We study the performance and the use of vector computers for the solution of combinatorial optimization problems, particularly dynamic programming and shortest path problems. A general model for performance evaluation and vector implementations for the problems described above are studied. These implementations were done on a CRAY-1 vector computer and the computational results obtained show (i) the adequacy of the performance evaluation model and (ii) very important gains concerning computing times, showing that vector computers will be of great importance in the field of combinatorial optimization.  相似文献   

19.
We describe work done to improve the performance of NAG Library routines for nonlinear equations and nonlinear least squares problems on vector-processing machines. Calls to the Level 2 BLAS routines for matrix-vector operations were introduced wherever possible, so that further efforts to tune the code could be concentrated within the Level 2 BLAS, and advantage can be taken of optimized implementations of the Level 2 BLAS when they become available. Performance measurements from a CRAY-1S, an AMDAHL VP1100, and a CDC CYBER 205 are presented to illustrate the effectiveness of this strategy.  相似文献   

20.
Most implementations of a radix-2 fast Fourier transform on large scientific computers use algorithms that involve memory accesses whose strides are powers of two. (The term stride means the memory increment between successive elements stored or fetched.) Such strides are unacceptable for recently developed supercomputers, particularly the Cray-2, because of serious difficulties with memory bank conflicts.This article describes an algorithm for evaluating the fast Fourier transform that avoids this difficulty and thus could provide the basis for implementations that more fully utilize the power of the Cray-2. A Fortran program implementing this algorithm is included, and timing comparisons with the Cray assembly-coded library subroutine are shown.The author is with the Numerical Aerodynamic Simulation Systems Division at NASA Ames Research Center.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号