首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The fast Fourier transform (FFT) is undoubtedly an essential primitive that has been applied in various fields of science and engineering. In this paper, we present a decomposition method for the parallelization of multi-dimensional FFTs with the smallest communication amounts for all ranges of the number of processes compared to previously proposed methods. This is achieved by two distinguishing features: adaptive decomposition and transpose order awareness. In the proposed method, the FFT data is decomposed based on a row-wise basis that maps the multi-dimensional data into one-dimensional data, and translates the corresponding coordinates from multi-dimensions into one dimension so that the one-dimensional data can be divided and allocated equally to the processes using a block distribution. As a result and different from previous works that have the dimensions of decomposition pre-defined, our method can adaptively decompose the FFT data on the lowest possible dimensions depending on the number of processes. In addition, this row-wise decomposition provides plenty of alternatives in data transpose, and different transpose order results in different amounts of communication. We identify the best transpose orders with the smallest communication amounts for the 3-D, 4-D, and 5-D FFTs by analyzing all possible cases. We also develop a general parallel software package for the most popular 3-D FFT based on our method using the 2-D domain decomposition. Numerical results show good performance and scaling properties of our implementation in comparison with other parallel packages. Given both communication efficiency and scalability, our method is promising in the development of highly efficient parallel packages for the FFT.  相似文献   

2.
An Efficient Two-Dimensional FFT Algorithm   总被引:1,自引:0,他引:1  
A new version of the radix-2 row-column method for computing two-dimensional fast Fourier transforms is proposed. It uses a ``multiple vector' FFT algorithm to compute the transforms of all the columns in an array simultaneously while avoiding all trivial multiplications. The minicomputer implementation of the algorithm runs faster than the 2 × 2 vector radix FFT algorithm. Analysis of the numbers of complex additions and multiplications required indicate that implementations of the radix-4 row-column FFT and 4 × 4 vector radix FFT on the same minicomputer would run slower than the multiple vector implementation.  相似文献   

3.
基于软硬件的协同支持在众核上对1-DFFT算法的优化研究   总被引:2,自引:0,他引:2  
随着高性能计算需求的日益增加,片上众核(many-core)处理器成为未来处理器架构的发展方向.快速傅立叶变换(FFT)作为高性能计算中的重要应用,对计算能力和通信带宽都有较高的要求.因此基于众核处理器平台,实现高效、可扩展的FFT算法是算法和体系结构设计者共同面临的挑战.文中在众核处理器Godson-T平台上对1-D FFT算法进行了优化和评估,在节省几乎三分之一L2 Cache存储开销的情况下,通过隐藏矩阵转置,计算与通信重叠等优化策略,使得优化后的1-D FFT算法达到3倍以上的性能提升.并通过片上网络拥塞状况的实验分析,发现对于像FFT这样访存带宽受限的应用,增加L2 Cache的访问带宽,可以缓解因为爆发式读写带给片上网络和L2 Cache的压力,进一步提高程序的性能和扩展性.  相似文献   

4.
The implementation and performance of the multidimensional Fast Fourier Transform (FFT) on a distributed memory Beowulf cluster is examined. We focus on the three-dimensional (3D) real transform, an essential computational component of Galerkin and pseudo-spectral codes. The approach studied is a 1D domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors. Communication is based upon the standard portable message passing interface (MPI). We show that 1/P scaling for execution time at fixed problem size N3 (i.e., linear speedup) can be obtained provided that (1) the transpose algorithm is optimized for simultaneous block communication by all processors; and (2) communication is arranged for non-overlapping pairwise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed. This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components. An example is presented using a 3D passive scalar code.  相似文献   

5.
徐妮妮  于海艳  肖志涛 《计算机应用》2010,30(10):2777-2780
给出了频域抽取(DIF)多维向量基快速傅里叶变换(FFT)算法。对多维频域信号的每一维,采用向量基2频域抽取法,导出了快速算法蝶形运算的一般形式。该FFT算法适合于维数为任意整数的情况,当维数为1时,算法退化为著名的频域抽取向量基2 FFT算法。为了便于编程实现,以频域抽取3维向量基FFT算法为例,给出了快速算法实现流程,该流程易于向任意整数维推广。计算量比较结果显示,频域抽取多维向量基FFT算法比多维分离式FFT算法计算量低。  相似文献   

6.
快速傅里叶变换(Fast Fourier Transform,FFT)是最重要的基础算法之一,在科学计算、信号处理、图像处理等领域都有着广泛的应用。随着这些应用领域对实时性需求的进一步提高,FFT算法面临着越来越高的性能要求。在现有的FFT算法库中,FFT算法的求解速度和计算精度受到一定程度的限制,而且也少有研究者对偶数基Cooley-Tukey FFT的高性能实现提出相应的优化策略并对技术进行深入研究。基于此,文中提出了一套针对偶数基的Cooley-Tukey FFT的优化策略和方法。首先构建一个SIMD(Single Instruction Multiple Data)友好、支持混合基的蝶形网络,然后根据偶数基旋转因子特性最大限度地降低蝶形计算的复杂度,接着通过SIMD汇编优化、汇编指令重排及选择、寄存器分配策略制定、高性能矩阵转置算法等方法来优化应用,最后实现一个高性能的FFT算法库。目前,最流行、应用最广的FFT有FFTW和Intel MKL。实验结果表明,在X86计算平台上,新提出的这套针对偶数基Cooley-Tukey FFT的技术所实现的FFT算法库的性能全面优于MKL和FFTW。所提出的这套高性能算法优化和实现技术体系,可推广到除偶数基以外的其他基的实现和优化上,为进一步的研究开发工作奠定一定的基础,进而突破FFT算法在硬件平台上的性能瓶颈,实现一套针对特定平台的高性能FFT算法库。  相似文献   

7.
三维向量基快速傅立叶算法   总被引:1,自引:1,他引:0  
给出了三维向量基快速傅立叶变换(3-D Vector Radix FFT)算法。对三维信号采用基2时域抽取,导出了该算法蝶形运算的一般形式。计算量比较结果显示,三维向量基FFT算法比基于行列分解的三维FFT算法计算量低,计算效率高。  相似文献   

8.
提出了一种适用于短突发信号的低复杂度基于FFT的信道盲辨识算法。算法基于MCR(minimum cross-relation)算法只需最小冗余度信息可求解出信道向量的特性,通过将其建立的线性方程经过FFT变换后再求解信道向量,并结合算法中的矩阵 的秩信息提出了一种快速阶数估计算法。仿真表明,该算法既克服了传统辨识算法对于短突发观测数据下性能不佳的缺点,又降低了原有基于FFT的信道盲辨识算法的复杂度和提升其对阶数的鲁棒性。  相似文献   

9.
图像定位是图形图像学研究的重要方面,然而较慢的定位速度一直制约图像定位的实时应用。文章探讨了一种图像投影的快速定位算法,将二维图像信息的特征压缩成一个特征向量,将该特征向量作为定位的参数进行图像定位,大大提高了定位的速度。该算法比一般的相关算法、快速傅立叶算法具有非常明显的速度优势;并且将基于该算法的图像定位系统嵌入到二维移动工作台进行实时实验,取得了很好的实验结果。  相似文献   

10.
针对聚类算法需要处理数据集的规模越来越大、时效性要求越来越高,对算法的大数据适应能力和性能要求更高的问题,提出一种在Spark分布式内存计算平台下的模糊C均值(FCM)算法Spark-FCM。首先对矩阵通过水平分割实现分布式存储,不同向量存储在不同节点;然后基于FCM算法的计算特点,设计了分布式和缓存敏感的常用矩阵操作,包括乘法、转置和加法等;最后基于矩阵操作和Spark平台特点,设计了Spark-FCM算法,主要数据结构采用分布式矩阵存储,具有节点间数据移动少和每个步骤分布式计算特点。通过在单机和集群环境下测试,算法具有良好的可扩展性,并可以适应大规模数据集,算法性能与数据量成线性关系,集群环境下性能比单机提高2~3倍。  相似文献   

11.
实时SAR成像系统矩阵转置方法研究与实现   总被引:1,自引:0,他引:1       下载免费PDF全文
合成孔径雷达(SAR)是一种高分辨率成像雷达,而矩阵转置是实时SAR成像信号处理中一个很重要的操作,矩阵转置的效率高低将直接决定整个SAR成像信号处理系统的性能。对于矩阵转置,可采用行进列出或列进行出、两页式或三页式转置等方法进行处理,但这些方法处理时间较长,转置效率较低。在现有矩阵转置方法的基础上,提出了一种新的矩阵转置方法。在实际硬件平台上利用提出的矩阵转置方法进行了实时SAR成像处理,所得结果的矩阵转置效率为78%,整个SAR成像处理时间为10秒。测试结果表明,该方法对解决矩阵转置问题是行之有效的。  相似文献   

12.
提出了一种特征加权的核学习方法,其主要为了解决当前核方法在分类任务中对所有数据特征的同等对待的不足。在分类任务中,数据样本的每个特征所起的作用并不是相同的,有些特征对分类任务有促进作用,应该给予更多的关注。提出的算法集成了多核学习的优势,以加权的方式组合不同的核函数,但所需的计算复杂度更低。实验结果证明,提出的算法与支持向量机、多核学习算法相比,分类准确度优于支持向量机和多核学习算法,在计算复杂度上略高于支持向量机,但远远低于多核学习算法。  相似文献   

13.
We address the problem of probability density function estimation using a Gaussian mixture model updated with the expectation-maximization (EM) algorithm. To deal with the case of an unknown number of mixing kernels, we define a new measure for Gaussian mixtures, called total kurtosis, which is based on the weighted sample kurtoses of the kernels. This measure provides an indication of how well the Gaussian mixture fits the data. Then we propose a new dynamic algorithm for Gaussian mixture density estimation which monitors the total kurtosis at each step of the EM algorithm in order to decide dynamically on the correct number of kernels and possibly escape from local maxima. We show the potential of our technique in approximating unknown densities through a series of examples with several density estimation problems  相似文献   

14.
为了获得更加理想的网络流量预测结果,准确刻画网络流量的变化趋势,提出一种基于布谷鸟搜索算法优化组合核相关向量机的网络流量预测模型(CS-HRVM)。首先针对网络流量的混沌特性,采用相空间理论建立网络流量的多维学习样本,并采用组合核函数构建相关向量机,然后将学习样本输入到相关向量机中进行训练,并采用布谷鸟搜索算法对模型参数进行优化,从而建立网络流量预测模型,最后采用仿真实验对模型性能进行仿真对比实验。结果表明,CS-HRVM获得了比其他网络流量预模型更高的预测精度,而且可以对含噪网络流量进行准确预测。  相似文献   

15.
In this paper a set of techniques for improving the performance of the fast Fourier transform (FFT) algorithm on modern vector-oriented supercomputers is presented. Single-processor FFT implementations based on these techniques are developed for the CRAY-2 and the CRAY Y-MP, and it is shown that they achieve higher performance than previously measured on these machines. The techniques include (1) using gather/scatter operations to maintain optimum length vectors throughout all stages of small-to medium-sized FFTs, (2) using efficient radix-8 and radix-16 inner loops, which allow a large number of vector loads/stores to be overlapped, and (3) prefetching twiddle factors as vectors so that on the CRAY-2 they can later be fetched from local memory in parallel with common memory accesses. Performance results for Fortran implementations using these techniques demonstrate that they are faster than Cray's library FFT routine CFFT2. The actual speedups obtained, which depend on the size of the FFT being computed and the supercomputer being used, range from about 5 to over 300%.  相似文献   

16.
We give practical algorithms, complexity analysis and implementation for one-to-all broadcasting, all-to-all personalized communication and matrix transpose (with two-dimensional partitioning of the matrix) on hypercubes. We assume the following communication characteristics: circuit-switched, e-cube routing and one-port communication model. For one-to-all broadcasting, we give an algorithm that combines the well-known recursive doubling algorithm[1] and the algorithm based on edgedisjoint spanning trees[2]. The measured times of the combined algorithm are always superior to those of the edge-disjoint spanning tree algorithm and outperform the recursive doubling algorithm. For all-to-all personalized communication we propose a hybrid algorithm that combines the well-known recursive doubling algorithm[3,4] and the recently proposed direct-route algorithm[5,6] Our hybrid algorithm balances between data transfer time and start-up time of these two algorithms, and its communication complexity is estimated to be better than the two previous algorithms for a range of machine parameters. For matrix transpose with two-dimensional partitioning of the matrix, we relate a two-phase algorithm to the previous result in Reference 7. The algorithm is predicted to be better than the recursive transpose algorithm[8] by n nearest-neighbor communications[4]. It takes advantage of circuit-switched routing and is congestion-free within each phase. We also suggest a way of storing the matrix such that the transpose operation can be realized in one phase without congestion.  相似文献   

17.
提出一种新的稀疏贝叶斯回归算法.基于相关向量机,首先通过尺度核和小波核构造完备基以提高预测精度;然后利用保局投影对输入矩阵的列进行主成分提取以减少训练时间,从而形成算法的初步模型.为进一步减小较大规模训练数据集的回归时间压力,算法对训练数据集的分层采样建立了初步模型,进而产生实际较小规模的训练数据集.实验结果表明,算法在预测精度和鲁棒性上优于传统支持向量机和相关向量机,且其训练时间较相关向量机少.  相似文献   

18.
本文对片上网络中的确定性XY路由算法和基于拐弯模型的4种自适应路由算法进行分析,并采用Noxim模拟器在6种合成通信模式下对5种路由算法的性能进行评估。实验结果表明,在均匀随机通信模式下,XY路由算法的性能优于自适应路由算法;在置换1和混洗通信模式下,奇偶路由算法的性能优于其他路由算法;在置换2、位反和蝶形通信模式下,负向优先路由算法的性能优于其他路由算法。  相似文献   

19.
Chen  Zhengbo  Wang  Di  Yu  Qi  Zheng  Fang  Guo  Feng  Chen  Zuoning 《The Journal of supercomputing》2022,78(7):9456-9474

Matrix multiplication is widely used in a variety of application domains. When the input matrices and the product differ in the memory format, matrix transpose is required. The efficiency of matrix transpose has a non-negligible impact on performance. However, the state-of-the-art software solution and its optimizations suffer from low efficiency due to frequent interference to main pipeline and their inability to achieve parallel matrix transpose and multiplication. To address this issue, we propose AMT, an asynchronous and in-place matrix transpose mechanism based on C2R algorithm, to efficiently perform matrix transpose. AMT performs matrix transpose in an asynchronous processing module and uses two customized asynchronous matrix transpose instructions to facilitate processing. We implement the logic design of AMT using RTL and verify its correctness. Simulation results show that AMT achieves an average of 1.27x (up to 1.48x) speedup over a state-of-the-art software baseline, and is within 95.4% of an ideal method. Overhead analysis shows that AMT only incurs small area overhead and power consumption.

  相似文献   

20.
基于LSSVM的静态手势识别   总被引:2,自引:0,他引:2  
段洪伟  陈一民  林锋 《计算机工程与设计》2004,25(12):2352-2353,2368
支持向量机(Support Vector Machine,简称SVM),是基于统计学习理论的一种新的模式识别方法,较好地解决了小样本学习问题。通过使非线性空间变换为线性空间,降低了算法的复杂性。LSSVM(Least Squares Support Vector Machine)由于使用线性等式代替了标准的SVM算法中的线性不等式,进一步降低了运算量。利用傅立叶描述子获取静态手势特征向量,通过LSSVM大尺度算法求解方程组来得到LSSVM分类器,进行静态手势识别,取得了较高的识别率。说明如何把静态手势识别结果应用到机器人远程控制中,提高人机交互的友好性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号