广义稠密对称特征问题标准化算法在GPU集群上的有效实现 Efficient Implementation of Generalized Dense Symmetric Eigenproblem Standardization Algorithm on GPU Cluster期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

广义稠密对称特征问题标准化算法在GPU集群上的有效实现

引用本文：	刘世芳,赵永华,于天禹,黄荣锋.广义稠密对称特征问题标准化算法在GPU集群上的有效实现[J].计算机科学,2020,47(4):6-12.

作者姓名：	刘世芳赵永华于天禹黄荣锋

作者单位：	中国科学院计算机网络信息中心北京 100080;中国科学院大学北京 100080;中国科学院计算机网络信息中心北京 100080

基金项目：	国家重点研发计划;中国科学院战略性先导科技专项

摘要：	广义稠密对称特征问题的求解是许多应用科学和工程的主要任务,并且是计算电磁学、电子结构、有限元模型和量子化学等计算中的重要部分。将广义对称特征问题转化为标准对称特征问题是求解广义稠密对称特征问题的关键计算步骤。针对GPU集群,文中给出了广义稠密对称特征问题标准化块算法在GPU集群上基于MPI+CUDA的实现。为了适应GPU集群的架构,广义对称特征问题标准化算法将正定矩阵的Cholesky分解与传统的广义特征问题标准化块算法相结合,降低了标准化算法中不必要的通信开销,并且增强了算法的并行性。在基于MPI+CUDA的标准化算法中,GPU与CPU之间的数据传输操作被用来掩盖GPU内的数据拷贝操作,这消除了拷贝所花费的时间,进而提高了程序的性能。同时,文中还给出了矩阵在二维通信网格中行通信域和列通信域之间完全并行的点对点的转置算法和基于MPI+CUDA的具有多个右端项的三角矩阵方程BX=A求解的并行块算法。在中科院计算机网络信息中心的超级计算机系统“元”上,每个计算节点配置2块Nvidia Tesla K20 GPGPU卡及2颗Intel E5-2680 V2处理器,使用多达32个GPU对不同规模矩阵的基于MPI+CUDA的广义对称特征问题标准化算法进行测试,取得了较好的加速效果与性能,并且具有良好的可扩展性。当使用32个GPU对50000×50000阶的矩阵进行测试时,峰值性能达到了约9.21 Tflops。
关键词：	广义对称特征问题标准化算法 GPU集群 CHOLESKY分解转置算法三角矩阵方程
Efficient Implementation of Generalized Dense Symmetric Eigenproblem Standardization Algorithm on GPU Cluster

LIU Shi-fang,ZHAO Yong-hua,YU Tian-yu,HUANG Rong-feng.Efficient Implementation of Generalized Dense Symmetric Eigenproblem Standardization Algorithm on GPU Cluster[J].Computer Science,2020,47(4):6-12.

Authors:	LIU Shi-fang ZHAO Yong-hua YU Tian-yu HUANG Rong-feng

Affiliation:	(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100080,China;University of Chinese Academy of Sciences,Beijing 100080,China)

Abstract:	The solution of the generalized dense symmetric eigenproblem is the main task of many applied sciences and enginee-ring,and is an important part in the calculation of electromagnetics,electronic structures,finite element models and quantum che-mistry.Transforming generalized symmetric eigenproblem into a standard symmetric eigenproblem is an important computational step for solving the generalized dense symmetric eigenproblem.For the GPU cluster,the generalized blocked algorithm for gene-ralized dense symmetric eigenproblem was presented based on MPI+CUDA on GPU cluster.In order to adapt to the architecture of the GPU cluster,the generalized symmetric eigenproblem standardization algorithm presented in this paper adopts the method of combining the Cholesky decomposition of the positive definite matrix with the traditional standardized blocked algorithm,which reduces the unnecessary communication overhead in the standardized algorithm and increases the parallelism of the algorithm.Moreover,In the MPI+CUDA based generalized symmetric eigenproblem standardization algorithm,the data transfer operation between the GPU and the CPU is utilized to mask the data copy operation in the GPU,which eliminates the time spent on copying,thereby improving the performance of the program.At the same time,a fully parallel point-to-point transposition algo-rithm between the row communication domain and the column communication domain in the two-dimensional communication grid was presented.In addition,a parallel blocked algorithm based on MPI+CUDA for the solution of the triangular matrix equation BX=A with multiple right-end terms was also given.On the supercomputer system“Era”of the Computer Network Information Center of Chinese Academy of Sciences,each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors.This paper tested different scale matrices using up to 32 GPUs.The implementation performance of the ge-neralized symmetric eigenproblem standardization algorithm based on MPI+CUDA has achieved better acceleration and perfor-mance,and have good scalability.When tested with 50000×50000-order matrix using 32 GPUs,the peak performance reach approximately 9.21 Tflops.

Keywords:	Generalized symmetric eigenproblem standardization blocked algorithm GPU cluster Cholesky decomposition Transpose algorithm Triangular matrix equations
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏