基于BLACS的2.5D并行矩阵乘法 New 2.5D Parallel Matrix Multiplication Algorithm Based on BLACS期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于BLACS的2.5D并行矩阵乘法

引用本文：	廖霞,李胜国,卢宇彤,杨灿群.基于BLACS的2.5D并行矩阵乘法[J].计算机学报,2021,44(5):1037-1050.

作者姓名：	廖霞李胜国卢宇彤杨灿群

作者单位：	国防科技大学计算机学院长沙 410073;中山大学国家超级计算中心广州 510006;国防科技大学计算机学院长沙 410073;国防科技大学并行与分布处理重点实验室长沙 410073

基金项目：	科技部重点研发计划项目(2018YFB0204301);国家重点研发计划(2018YFB0204303);国家自然科学基金项目(61872392,U1811461);国家数值风洞项目(NNW2019ZT6-B20、NNW2019ZT6-B21、NNW2019ZT5-A10);广东省自然科学基金(2018B030312002);广东省“珠江人才计划”引进创新创业团队(2016ZT06D211);湖南省面上项目(2019JJ40339);校科研项目(ZK18-03-01)资助~~

摘要：	并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel Universal Matrix Multiplication Algorithm)算法,该算法是通过将初始的进程分成c组,利用计算节点的额外内存,在每个进程组上同时存储矩阵A、B和执行1/c的PUMMA算法,最后通过规约操作来得到矩阵乘的最终结果.本文基于BLACS(Basic Linear Algebra Communication Subprograms)通信库实现了一种从2D到2.5D的新型数据重分配算法,与PUMMA算法相结合,最终得到2.5D PUMMA算法,可直接替换PDGEMM(Parallel Double-precision General Matrix-matrix Multiplication),具有良好的可移植性.与国际标准算法库ScaLAPACK(Scalable Linear Algebra PACKage)中的PDGEMM等经典2D算法相比,本文算法缩减了通信次数,提高了数据局部性,具有更好的可扩展性.在进程数较多时,例如4096进程时,系统测试表明相对PDGEMM的加速比可达到2.20~2.93.进一步地,本文将2.5D PUMMA算法应用于加速计算对称三对角矩阵的特征值分解,其加速比可达到1.2以上.本文通过大量数值算例分析了2.5D PUMMA算法的性能,并给出了实用性建议和总结了未来的工作.
关键词：	2.5D并行矩阵乘算法 SCALAPACK PUMMA矩阵乘算法 SUMMA算法分布式并行
New 2.5D Parallel Matrix Multiplication Algorithm Based on BLACS

LIAO Xia,LI Sheng-Guo,LU Yu-Tong,YANG Can-Qun.New 2.5D Parallel Matrix Multiplication Algorithm Based on BLACS[J].Chinese Journal of Computers,2021,44(5):1037-1050.

Authors:	LIAO Xia LI Sheng-Guo LU Yu-Tong YANG Can-Qun

Affiliation:	(College of Computer Science,National University of Defense Technology,Changsha 410073;National Supercomputer Center in Guangzhou,Sun Yat-Sen University,Guangzhou 510006;Science and Technology on Parallel and Distributed Processing Laboratory,National University of Defense Technology,Changsha 410073)

Abstract:	Matrix-matrix multiplication is one of the most important basic operations in linear algebra and a building block of many scientific applications.As HPC(High Performance Computing)moves towards Exa-scale,the degree of parallelism of HPC systems is increasing.The communication cost of parallel matrix multiplication will be more and more important.How to reduce the communication cost,improve the scalability of parallel matrix multiplication and make full use of the advantages of supercomputer computing resources are well-studied subjects.In this paper,a novel distributed parallel dense matrix multiplication algorithm is proposed,which can be regarded as 2.5D PUMMA(Parallel Universal Matrix Multiplication Algorithm)algorithm.The underlying idea of this implementation is improving the scalability by decreasing the data transfer volume between processors at the cost of the extra memory usage.The 2.5D matrix multiplication algorithm in this paper firstly divides the initial processes into c groups,stores matrix A and B,and perform 1/c PUMMA algorithm simultaneously on each process group by utilizing the extra memory on the computing nodes,and finally gets the final result of matrix multiplication through MPI communications.Based on BLACS(Basic Linear Algebra Communication Subprograms),this paper implements a new data redistribution algorithm from 2D to 2.5D,combined with the PUMMA algorithm,and finally gets the 2.5D PUMMA algorithm which can directly replace PDGEMM(Parallel Double-precision General Matrix-matrix Multiply).Compared with classic 2D algorithms such as PDGEMM in ScaLAPACK(Scalable Linear Algebra PACKage),the algorithm in this paper reduces the number of communications,improves data locality,and has better scalability.The performance experiments are carried out on a supercomputer system with 300 computing nodes,each of which consists of two 10-core Xeon E5-2692 v3 CPUs.These nodes are interconnected by Tianhe 2 interconnection networks(TH-Express 2)with fat tree structure.The effectiveness and efficiency of the 2.5D PUMMA algorithm are evaluated with various of the matrix size,degree of blocking,the stack size(duplication factor)and the process number.In the case that the number of processes is high,such as 4096 processes,system tests show that the acceleration ratio relative to PDGEMM can reach 2.20—2.93.When the number of processes is small,such as 64 or 256 processes,the performance of the 2.5D PUMMA algorithm may be worse than the PDGEMM function.In this case,besides the additional overhead of data redistribution,the computation is the dominant factor rather than the communication.Furthermore,the 2.5D PUMMA algorithm proposed in this paper is applied to solve symmetric eigenvalue problems,which is used to accelerate the tridiagonal eigenvalue decomposition step.The speedup over the original implementation in ScaLAPACK can reach more than 1.2.The next stage of work is to study the 2.5D PUMMA algorithm for special matrices,such as Cauchy matrix,Toeplitz matrix and Vandermonde matrix,which are widely used in scientific,industrial and clinical fields.In this paper,the performance of 2.5D PUMMA algorithm is analyzed through a large number of numerical experiments,practical suggestions are given and the future work is summarized and forecasted.

Keywords:	2 5D parallel matrix multiplication algorithm ScaLAPACK parallel universal matrix multiplication algorithm scalable universal matrix multiplication algorithm distributed parallel
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏