首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
根据申威26010众核处理器的特点提出了基于两层分解的一维FFT众核并行算法.该算法基于迭代的Stockham FFT计算框架和Cooley-Tukey FFT算法,将大规模FFT分解成一系列的小规模FFT来计算,并通过设计合理的任务划分方式、寄存器通信、双缓冲以及SIMD向量化等与计算平台相关的优化方法来提高FFT的计算性能.最后对所提出算法的性能进行了测试,相比于单主核上运行的FFTW3.3.4库,获得了平均44.53x的加速比,最高加速比可达56.33x,且其带宽利用率最高可达83.45%.  相似文献   

2.
In this paper, we propose high-performance radix-2, 3 and 5 parallel 1-D complex FFT algorithms for distributed-memory parallel computers. We use the four-step or six-step FFT algorithms to implement the radix-2, 3 and 5 parallel 1-D complex FFT algorithms. In our parallel FFT algorithms, since we use cyclic distribution, all-to-all communication takes place only once. Moreover, the input data and output data are both in natural order.We also show that the suitability of a parallel FFT algorithm is machine-dependent because of the differences in the architecture of the processor elements in distributed-memory parallel computers. Experimental results of 2p3q5r point FFTs on distributed-memory parallel computers, HITACHI SR2201 and IBM SP2 are reported. We succeeded to get performances of about 130 GFLOPS on a 1024PE HITACHI SR2201 and about 1.25 GFLOPS on a 32PE IBM SP2.  相似文献   

3.
Some implementations of a power-of-two one-dimensional fast Fourier transform (FFT) on vector computers use radix-4 Stockham autosort kernels with a separate transpose step. This paper describes an algorithm that performs well on a Convex C4/XA vector supercomputer on large FFTs by using higher-radix kernels and moving the transpose step into the computational steps. For short transforms a different algorithm is used that calculates the FFT without storing any intermediate results to memory. Performance results using these techniques are given.  相似文献   

4.
With tens of petaflops supercomputers already in operation and exaflops machines expected to appear within the next 10 years, efficient parallel computational methods are required to take advantage of such extreme-scale machines. In this paper, we present a three-dimensional domain decomposition scheme for enabling large-scale electronic structure calculations based on density functional theory (DFT) on massively parallel computers. It is composed of two methods: (i) the atom decomposition method and (ii) the grid decomposition method. In the former method, we develop a modified recursive bisection method based on the moment of inertia tensor to reorder the atoms along a principal axis so that atoms that are close in real space are also close on the axis to ensure data locality. The atoms are then divided into sub-domains depending on their projections onto the principal axis in a balanced way among the processes. In the latter method, we define four data structures for the partitioning of grid points that are carefully constructed to make data locality consistent with that of the clustered atoms for minimizing data communications between the processes. We also propose a decomposition method for solving the Poisson equation using the three-dimensional FFT in Hartree potential calculation, which is shown to be better in terms of communication efficiency than a previously proposed parallelization method based on a two-dimensional decomposition. For evaluation, we perform benchmark calculations with our open-source DFT code, OpenMX, paying particular attention to the O(N)O(N) Krylov subspace method. The results show that our scheme exhibits good strong and weak scaling properties, with the parallel efficiency at 131,072 cores being 67.7% compared to the baseline of 16,384 cores with 131,072 atoms of the diamond structure on the K computer.  相似文献   

5.
Particle-in-cell (PIC) simulation is widely used in many branches of physics and engineering. In this paper, we give an analysis of the particle-field decomposition method and the domain decomposition method in parallel particle-in-cell beam dynamics simulation. The parallel performance of the two decomposition methods was studied on the Cray XT4 and the IBM Blue Gene/P Computers. The domain decomposition method shows better scalability but is slower than the particle-field decomposition in most cases (up to a few thousand processors) for macroparticle dominant applications. The particle-field decomposition method also shows less memory usage than the domain decomposition method due to its use of perfect static load balance. For applications with a smaller ratio of macroparticles to grid points, the domain decomposition method exhibits better scalability and faster speed. Application of the particle-field decomposition scheme to high-resolution macroparticle-dominant parallel beam dynamics simulation for a future light source linear accelerator is presented as an example.  相似文献   

6.
The implementations of the domain decomposition, SOR, multigrid and conjugate gradient method on CM-5 and Cray C-90 are described for the Laplace's equation on the unit square and L-shaped region. Domain decomposition method uses the Schwarz alternating method. In each domain we take the one-dimensional FFT to convert the problem into the tridiagonal systems which are solved by the scientific libraries installed in the CM-5 and the Cray C-90. On the CM-5 the V-cycle multigrid with symmetric smoothings on P-1 finite element spaces is run with red/black Gauss-Seidel relaxation. Multigrid with natural order Gauss-Seidel relaxation is used on the Cray C-90. While natural order SOR is used in the Cray C-90, R/B SOR is performed on the CM-5. Multigrid is the fastest method on the CM-5 and three methods except SOR give similar performances on Cray C-90.This research was partially supported by the National Science Foundation under Grant No. CDA-9024618.  相似文献   

7.
The parallel version of the sheet metal forming semi-implicit finite element code ITAS3D has been developed using the domain decomposition method and direct solution methods at both subdomain and interface levels. IBM Message Passing Library is used for data communication between tasks of the parallel code. Solutions of some sheet metal forming problems on IBM SP2 computer show that the adopted DDM algorithm with the direct solver provides acceptable parallel efficiency using a moderate number of processors. The speedup 6.7 is achieved for the problem with 20000 degrees-of-freedom on the 8-processor configuration.  相似文献   

8.
对于分布内存体系结构的并行计算机而言,如何对计算和数据进行合理划分以增加数据本地化减少处理器间的通信是提高其并行性能的关键,但在数据划分过程中,重分布通信有时不可避免,如何进行合理的数据和计算划分以减少通信并最大限度的利用程序的并行性是并行编译中的一个重要问题。该文主要讨论了一种支持数据重分布的自动进行计算和数据划分的算法。  相似文献   

9.
QCDOC is a massively parallel supercomputer with tens of thousands of nodes distributed on a six-dimensional torus network. The 6D structure of the network provides the needed communication resources for many communication-intensive applications. In this paper, we present a parallel algorithm for three-dimensional Fast Fourier Transform and its implementation for a 4096-node QCDOC prototype. Two techniques have been used to increase its parallel performance: simultaneous multi-dimensional communication and communication-and-computation overlapping. Benchmarking experiments suggest that 3D FFTs of size 128×128×128 can scale well on such platforms up to 4096 nodes. Our performance results suggest stronger scalability on QCDOC than on IBM BlueGene/L supercomputer.  相似文献   

10.
FFTs in external or hierarchical memory   总被引:2,自引:0,他引:2  
Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2 m -point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation.Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the CRAY-2, the CRAY X-MP, and the CRAY Y-MP systems. Using all eight processors on the CRAY Y-MP, this main memory routine runs at nearly two gigaflops.A condensed version of this paper previously appeared in the Proceedings of Supercomputing '89.  相似文献   

11.
对于分布内存体系结构的并行计算机而言,如何对计算和数据进行合理划分以增加数据本地化减少处理器间的通信是提高其并行性能的关键。本文主要讨论了一种自动实现无数据重组的静态计算和数据划分算法。  相似文献   

12.
The implementation and performance of the multidimensional Fast Fourier Transform (FFT) on a distributed memory Beowulf cluster is examined. We focus on the three-dimensional (3D) real transform, an essential computational component of Galerkin and pseudo-spectral codes. The approach studied is a 1D domain decomposition algorithm that relies on communication-intensive transpose operation involving P processors. Communication is based upon the standard portable message passing interface (MPI). We show that 1/P scaling for execution time at fixed problem size N3 (i.e., linear speedup) can be obtained provided that (1) the transpose algorithm is optimized for simultaneous block communication by all processors; and (2) communication is arranged for non-overlapping pairwise communication between processors, thus eliminating blocking when standard fast ethernet interconnects are employed. This method provides the basis for implementation of scalable and efficient spectral method computations of hydrodynamic and magneto-hydrodynamic turbulence on Beowulf clusters assembled from standard commodity components. An example is presented using a 3D passive scalar code.  相似文献   

13.
A hybrid message passing and shared memory parallelization technique is presented for improving the scalability of the adaptive integral method (AIM), an FFT based algorithm, on clusters of identical multi-core processors. The proposed hybrid MPI/OpenMP parallelization scheme is based on a nested one-dimensional (1-D) slab decomposition of the 3-D auxiliary regular grid and the associated AIM calculations: If there are M processors and T cores per processor, the scheme (i) divides the regular grid into M slabs and MT sub-slabs, (ii) assigns each slab/sub-slab and the associated operations to one of the processors/cores, and (iii) uses MPI for inter-processor data communication and OpenMP for intra-processor data exchange. The MPI/OpenMP parallel AIM is used to accelerate the solution of the combined-field integral equation pertinent to the analysis of time-harmonic electromagnetic scattering from perfectly conducting surfaces. The scalability of the scheme is investigated theoretically and verified on a state-of-the-art multi-core cluster for benchmark scattering problems. Timing and speedup results on up to 1024 quad-core processors show that the hybrid MPI/OpenMP parallelization of AIM exhibits better strong scalability (fixed problem size speedup) than pure MPI parallelization of it when multiple cores are used on each processor.  相似文献   

14.
并行编译中一种线性数据和计算划分算法   总被引:1,自引:1,他引:0       下载免费PDF全文
对于高性能并行计算机而言,如何找到一种好的计算和数据划分,对数据和计算进行合理划分,增加数据本地化来减少处理器间的通信是提高其并行性能的关键。该文讨论了一种线性的自动进行无数据重组的计算和数据划分算法。  相似文献   

15.
Equipped with 512-bit wide SIMD inst d large numbers of computing cores, the emerging x86-based Intel(R) Many Integrated Core (MIC) Architecture ot only high floating-point performance, but also substantial off-chip memory bandwidth. The 3D FFT (three-di fast Fourier transform) is a widely-studied algorithm; however, the conventional algorithm needs to traverse the three times. In each pass, it computes multiple 1D FFTs along one of three dimensions, giving rise to plenty of rided memory accesses. In this paper, we propose a two-pass 3D FFT algorithm, which mainly aims to reduce of explicit data transfer between the memory and the on-chip cache. The main idea is to split one dimension into ensions, and then combine the transform along each sub-dimension with one of the rest dimensions respectively erence in amount of TLB misses resulting from decomposition along different dimensions is analyzed in detail. el parallelism is leveraged on the many-core system for a high degree of parallelism and better data reuse of loc On top of this, a number of optimization techniques, such as memory padding, loop transformation and vectoriz employed in our implementation to further enhance the performance. We evaluate the algorithm on the Intel(R) PhiTM coprocessor 7110P, and achieve a maximum performance of 136 Gflops with 240 threads in offload mode, which ts the vendor-specific Intel(R)MKL library by a factor of up to 2.22X.  相似文献   

16.
多相空间数值模拟并行化研究   总被引:3,自引:1,他引:3  
1 引言三维绕流气体运动论数值模拟基于从稀薄流到连续流的气体运动论统一数值算法,是由中国空气动力研究与发展中心张涵信院士指导下发展起来的一种新算法,其数学模型和计算方法在文[1]中进行了详细论述。本文以此为例着重讨论多相空间并行化技术。三维绕流气体运动论数值模拟的计算空间是由离散速度坐标空间和位置空间组成的六维空间,形成二相空间。算法分成两个主要部分,一部分是在各个离散速度点处基于时间和  相似文献   

17.
基于软硬件的协同支持在众核上对1-DFFT算法的优化研究   总被引:2,自引:0,他引:2  
随着高性能计算需求的日益增加,片上众核(many-core)处理器成为未来处理器架构的发展方向.快速傅立叶变换(FFT)作为高性能计算中的重要应用,对计算能力和通信带宽都有较高的要求.因此基于众核处理器平台,实现高效、可扩展的FFT算法是算法和体系结构设计者共同面临的挑战.文中在众核处理器Godson-T平台上对1-D FFT算法进行了优化和评估,在节省几乎三分之一L2 Cache存储开销的情况下,通过隐藏矩阵转置,计算与通信重叠等优化策略,使得优化后的1-D FFT算法达到3倍以上的性能提升.并通过片上网络拥塞状况的实验分析,发现对于像FFT这样访存带宽受限的应用,增加L2 Cache的访问带宽,可以缓解因为爆发式读写带给片上网络和L2 Cache的压力,进一步提高程序的性能和扩展性.  相似文献   

18.
A parallel finite element procedure for contact-impact problems   总被引:2,自引:0,他引:2  
An efficient parallel finite element procedure for contact-impact problems is presented within the framework of explicit finite element analysis with thepenalty method. The procedure concerned includes a parallel Belytschko-Lin-Tsay shell element generation algorithm and a parallel contact-impact algorithm based on the master-slave slideline algorithm. An element-wise domain decomposition strategy and a communication minimization strategy are featured to achieve almost perfect load balancing among processors and to show scalability of the parallel performance. Throughout this work, a prototype code, named GT-PARADYN, is developed on the IBM SP2 to implement the procedure presented, under message-passing paradigm. Some examples are provided to demonstrate the timing results of the algorithms, discussing the accuracy and efficiency of the code.  相似文献   

19.
The one-dimensional fast Fourier transform (FFT) is the most popular tool for calculating the multidimensional Fourier transform. As a rule, to estimate the n-dimensional FFT, a standard method of combining one-dimensional FFTs, the so-called “by rows and columns” algorithm, is used in the literature. For fast calculations, different researchers try to use parallel calculation tools, the most successful of which are searches for the algorithms related to the computing device architecture: cluster, video card, GPU, etc. [1, 2]. The possibility of paralleling another algorithm for FFT calculation, which is an n-dimensional analog of the Cooley-Tukey algorithm [3, 4], is studied in this paper. The focus is on studying the analog of the Cooley-Tukey algorithm because the number of operations applied to calculate the n-dimensional FFT is considerably less than in the conventional algorithm nN n log2 N of addition operations and 1/2N n + 1log2 N of multiplication operations of addition operations and $\frac{{2^n - 1}} {{2^n }}N^n \log _2 N$ of multiplication operations against: N n + 1log2 N of addition operations and 1/2N n + 1log2 N of in combining one-dimensional FFTs.  相似文献   

20.
徐妮妮  于海艳  肖志涛 《计算机应用》2010,30(10):2777-2780
给出了频域抽取(DIF)多维向量基快速傅里叶变换(FFT)算法。对多维频域信号的每一维,采用向量基2频域抽取法,导出了快速算法蝶形运算的一般形式。该FFT算法适合于维数为任意整数的情况,当维数为1时,算法退化为著名的频域抽取向量基2 FFT算法。为了便于编程实现,以频域抽取3维向量基FFT算法为例,给出了快速算法实现流程,该流程易于向任意整数维推广。计算量比较结果显示,频域抽取多维向量基FFT算法比多维分离式FFT算法计算量低。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号