首页 | 本学科首页   官方微博 | 高级检索  
     

申威26010众核处理器上一维FFT实现与优化
引用本文:赵玉文,敖玉龙,杨超,刘芳芳,尹万旺,林蓉芬. 申威26010众核处理器上一维FFT实现与优化[J]. 软件学报, 2020, 31(10): 3184-3196
作者姓名:赵玉文  敖玉龙  杨超  刘芳芳  尹万旺  林蓉芬
作者单位:中国科学院软件研究所并行软件与计算科学实验室,北京100190;中国科学院大学,北京 100049;北京大学数学科学学院,北京 100871;中国科学院软件研究所并行软件与计算科学实验室,北京100190;北京大学数学科学学院,北京 100871;中国科学院软件研究所并行软件与计算科学实验室,北京100190;计算机科学国家重点实验室(中国科学院软件研究所),北京 100190;中国科学院大学,北京 100049;国家并行计算机工程技术研究中心,北京100190
基金项目:国家重点研发计划(2016YFB0200603);北京市自然科学基金(JQ18001)
摘    要:根据申威26010众核处理器的特点提出了基于两层分解的一维FFT众核并行算法.该算法基于迭代的StockhamFFT计算框架和Cooley-TukeyFFT算法,将大规模FFT分解成一系列的小规模FFT来计算,并通过设计合理的任务划分方式、寄存器通信、双缓冲以及SIMD向量化等与计算平台相关的优化方法来提高FFT的计算性能.最后对所提出算法的性能进行了测试,相比于单主核上运行的FFTW3.3.4库,获得了平均44.53x的加速比,最高加速比可达56.33x,且其带宽利用率最高可达83.45%.

关 键 词:申威26010处理器  一维FFT  两层分解  Cooley-Tukey  众核并行
收稿时间:2018-01-22
修稿时间:2018-09-20

General Implementation of 1-D FFT on the Sunway 26010 Processor
ZHAO Yu-Wen,AO Yu-Long,YANG Chao,LIU Fang-Fang,YIN Wan-Wang,LIN Rong-Fen. General Implementation of 1-D FFT on the Sunway 26010 Processor[J]. Journal of Software, 2020, 31(10): 3184-3196
Authors:ZHAO Yu-Wen  AO Yu-Long  YANG Chao  LIU Fang-Fang  YIN Wan-Wang  LIN Rong-Fen
Affiliation:Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;School of Mathematical Sciences, Peking University, Beijing 100871, China;Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;School of Mathematical Sciences, Peking University, Beijing 100871, China;Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;State Key Laboratory of Computer Science (Institute of Software, Chinese Academy of Sciences), Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China
Abstract:A two-layer decomposition 1-D FFT multi-core parallel algorithm is proposed according to the characteristics of Sunway 26010 processor. It is based on the iterative Stockholm FFT framework and the Cooley-Tukey FFT algorithm. It decomposes large scale FFT into a series of small scale FFTs. It improves the performance of the algorithm by means of designing reasonable task partitioning, register communication, double-buffering, and SIMD vectorization. Finally, the performance of the two-layer decomposition 1-D FFT multi-core parallel algorithm is tested. It achieves an average speedup of 44.53x, with a maximum speedup of up to 56.33x, and a maximum bandwidth utilization of 83.45%, compared to FFTW3.3.4 library running on the single MPE.
Keywords:Sunway 26010 processor  1-D FFT  two-layer decomposition  Cooley-Tukey  multi-core parallel
本文献已被 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号