首页 | 本学科首页   官方微博 | 高级检索  
     

面向国产申威26010众核处理器的SpMV实现与优化
引用本文:刘芳芳,杨超,袁欣辉,吴长茂,敖玉龙.面向国产申威26010众核处理器的SpMV实现与优化[J].软件学报,2018,29(12):3921-3932.
作者姓名:刘芳芳  杨超  袁欣辉  吴长茂  敖玉龙
作者单位:中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049,中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;计算机科学国家重点实验室(中国科学院 软件研究所), 北京 100190;北京大学 数学科学学院, 北京 100871,国家并行计算机工程技术研究中心, 北京 100190,中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190,中国科学院 软件研究所 并行软件与计算科学实验室, 北京 100190;中国科学院大学, 北京 100049;北京大学 数学科学学院, 北京 100871
基金项目:国家重点研发计划(2016YFB0200603);国家自然科学基金(91530323)
摘    要:世界首台峰值性能超过100P的超级计算机——神威太湖之光已经研制完成,该超级计算机采用了国产申威异构众核处理器,该处理器不同于现有的纯CPU,CPU-MIC,CPU-GPU架构,采用了主-从核架构,单处理器峰值计算能力为3TFlops/s,访存带宽为130GB/s.稀疏矩阵向量乘SpMV(sparse matrix-vector multiplication)是科学与工程计算中的一个非常重要的核心函数,众所周知,其是带宽受限型的,且存在间接访存操作.国产申威处理器给稀疏矩阵向量乘的高效实现带来了很大的挑战.针对申威处理器提出了一种CSR格式SpMV操作的通用异构众核并行算法,该算法从任务划分、LDM空间划分方面进行精细设计,提出了一套动静态buffer的缓存机制以提升向量x的访存命中率,提出了一套动静态的任务调度方法以实现负载均衡.另外还分析了该算法中影响SpMV性能的几个关键因素,并开展了自适应优化,进一步提升了性能.采用Matrix Market矩阵集中具有代表性的16个稀疏矩阵进行了测试,相比主核版最高有10倍左右的加速,平均加速比为6.51.通过采用主核版CSR格式SpMV的访存量进行分析,测试矩阵最高可达该处理器实测带宽的86%,平均可达到47%.

关 键 词:稀疏矩阵向量乘  SpMV  申威26010处理器  异构众核并行  自适应优化
收稿时间:2017/1/11 0:00:00

General SpMV Implementation in Many-Core Domestic Sunway 26010 Processor
LIU Fang-Fang,YANG Chao,YUAN Xin-Hui,WU Chang-Mao and AO Yu-Long.General SpMV Implementation in Many-Core Domestic Sunway 26010 Processor[J].Journal of Software,2018,29(12):3921-3932.
Authors:LIU Fang-Fang  YANG Chao  YUAN Xin-Hui  WU Chang-Mao and AO Yu-Long
Affiliation:Institute of Software, The Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China,Institute of Software, The Chinese Academy of Sciences, Beijing 100190, China;State Key Laboratory of Computer Science(Institute of Software, The Chinese Academy of Sciences), Beijing 100190, China;School of Mathematical Sciences, Peking University, Beijing 100871, China,National Research Center of Parallel Computer Engineering and Technology, Beijing 100190, China,Institute of Software, The Chinese Academy of Sciences, Beijing 100190, China and Institute of Software, The Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China;School of Mathematical Sciences, Peking University, Beijing 100871, China
Abstract:The fastest supercomputer in the world-Sunway TaihuLight with performance of more than 100P has been released. It makes use of heterogeneous many-core processors which is different from the existing pure CPU, CPU-MIC, CPU-GPU architecture. Each processor has 4 core groups (CGs), with each including one management processing element (MPE) and one computing processing element (CPE) cluster of 64 CPEs. The peak performance of single processor is 3TFlops/s, the memory bandwidth is 130GB/s. Sparse matrix-vector multiplication is a very important kernel in scientific and engineering computing, which is bandwidth limited and subject to indirect memory access. Implementing an efficient SpMV kernel is a big challenge in Sunway processor. This paper proposes a general SpMV heterogeneous manycore algorithm for the traditional sparse matrix storage format CSR, which divides the task and LDM space in detail, a cache mechanism of dynamic and static buffers to improve the hit rate of vector x, and a dynamic-static task scheduling method to achieve load balancing. In addition, several key factors affecting the performance of SpMV are analyzed, and adaptive optimization is carried out to further enhance the performance. Finally 16 matrix from matrix market collection are used to perform tests. The experimental results show that the algorithm achieves bandwidth of 86% and average bandwidth utilization of 47%. Compared with the implementation of the controller core, the speedup can be up to 10x, and average speedup is 6.51x.
Keywords:sparse matrix-vector multiplication  SpMV  Sunway 26010 processor  heterogeneous many-core  adaptive optimization
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号