首页 | 本学科首页   官方微博 | 高级检索  
     

快速多极子方法在申威众核处理器上的实现和优化
引用本文:王武,王舒扬,姜金荣,孟虹松.快速多极子方法在申威众核处理器上的实现和优化[J].计算机工程与科学,2019,41(7):1161-1167.
作者姓名:王武  王舒扬  姜金荣  孟虹松
作者单位:中国科学院计算机网络信息中心,北京,100190;中国科学院计算机网络信息中心,北京 100190;中国科学院大学,北京 100049;国家超级计算无锡中心,江苏 无锡,214072
基金项目:国家重点研发计划(2017YFB0203303);中国科学院十三五信息化应用工程项目(XXH13506-405)
摘    要:快速多极子方法(FMM)是一种求解N体问题的快速高效数值算法,在宇宙学和分子动力学等模拟中具有广泛的应用。申威SW26010是一款国产众核异构处理器,含260核心(4核组)。基于申威SW26010的众核架构设计和实现了快速多极子方法,并对核心函数(尤其是最耗时的粒子对相互作用)系统地进行了性能优化,包括异步DMA、SIMD向量化、循环展开、内联汇编指令调整等。以粒子对相互作用为例,优化后代码的计算速度约为主核上运行的原始代码的400倍,每个核组上的浮点性能达到250 GFLOPS,即理论峰值性能的32.5%。

关 键 词:快速多极子方法  异构众核处理器  N体问题  性能优化
收稿时间:2018-10-25
修稿时间:2019-07-25

Implementation and optimization of fast multipole method on Sunway manycore processors
WANG Wu,WANG Shu yang,JIANG Jin rong,MENG Hong song.Implementation and optimization of fast multipole method on Sunway manycore processors[J].Computer Engineering & Science,2019,41(7):1161-1167.
Authors:WANG Wu  WANG Shu yang  JIANG Jin rong  MENG Hong song
Affiliation:(1.Computer Network Information Center,Chinese Academy of Sciences,Beijing 100190; 2.University of Chinese Academy of Sciences,Beijing 100049; 3.National Supercomputing Center in Wuxi,Wuxi 214072,China)  
Abstract:The fast multipole method (FMM) is a fast and efficient numerical algorithm for solving the N body problem and has various applications in cosmology and molecular dynamics. Sunway SW26010 is a heterogeneous manycore processor developed independently by China with 260 cores (4 core groups). We design and implement an FMM on SW26010 manycore architecture. We also systematically optimize the performance of kernel functions (especially for the most time consuming particle pair interaction), including asynchronous direct memory access (DMA), SIMD vectorization, loop unrolling and inline assembly tuning. Taking the particle pair interaction kernel as an example, the computational speed after optimization is about 400 times higher than the raw code running on the host core, and the floating-point performance on each core group is 250 GFLOPS, which is 32.5% of the theoretical peak performance.
Keywords:fast multipole method (FMM)  heterogeneous manycore processor  N-body problem  performance optimization  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号