首页 | 本学科首页   官方微博 | 高级检索  
     

广义稠密对称特征问题标准化算法在GPU集群上的有效实现
引用本文:刘世芳,赵永华,于天禹,黄荣锋.广义稠密对称特征问题标准化算法在GPU集群上的有效实现[J].计算机科学,2020,47(4):6-12.
作者姓名:刘世芳  赵永华  于天禹  黄荣锋
作者单位:中国科学院计算机网络信息中心 北京 100080;中国科学院大学 北京 100080;中国科学院计算机网络信息中心 北京 100080
基金项目:国家重点研发计划;中国科学院战略性先导科技专项
摘    要:广义稠密对称特征问题的求解是许多应用科学和工程的主要任务,并且是计算电磁学、电子结构、有限元模型和量子化学等计算中的重要部分。将广义对称特征问题转化为标准对称特征问题是求解广义稠密对称特征问题的关键计算步骤。针对GPU集群,文中给出了广义稠密对称特征问题标准化块算法在GPU集群上基于MPI+CUDA的实现。为了适应GPU集群的架构,广义对称特征问题标准化算法将正定矩阵的Cholesky分解与传统的广义特征问题标准化块算法相结合,降低了标准化算法中不必要的通信开销,并且增强了算法的并行性。在基于MPI+CUDA的标准化算法中,GPU与CPU之间的数据传输操作被用来掩盖GPU内的数据拷贝操作,这消除了拷贝所花费的时间,进而提高了程序的性能。同时,文中还给出了矩阵在二维通信网格中行通信域和列通信域之间完全并行的点对点的转置算法和基于MPI+CUDA的具有多个右端项的三角矩阵方程BX=A求解的并行块算法。在中科院计算机网络信息中心的超级计算机系统“元”上,每个计算节点配置2块Nvidia Tesla K20 GPGPU卡及2颗Intel E5-2680 V2处理器,使用多达32个GPU对不同规模矩阵的基于MPI+CUDA的广义对称特征问题标准化算法进行测试,取得了较好的加速效果与性能,并且具有良好的可扩展性。当使用32个GPU对50000×50000阶的矩阵进行测试时,峰值性能达到了约9.21 Tflops。

关 键 词:广义对称特征问题标准化算法  GPU集群  CHOLESKY分解  转置算法  三角矩阵方程

Efficient Implementation of Generalized Dense Symmetric Eigenproblem Standardization Algorithm on GPU Cluster
LIU Shi-fang,ZHAO Yong-hua,YU Tian-yu,HUANG Rong-feng.Efficient Implementation of Generalized Dense Symmetric Eigenproblem Standardization Algorithm on GPU Cluster[J].Computer Science,2020,47(4):6-12.
Authors:LIU Shi-fang  ZHAO Yong-hua  YU Tian-yu  HUANG Rong-feng
Affiliation:(Computer Network Information Center,Chinese Academy of Sciences,Beijing 100080,China;University of Chinese Academy of Sciences,Beijing 100080,China)
Abstract:The solution of the generalized dense symmetric eigenproblem is the main task of many applied sciences and enginee-ring,and is an important part in the calculation of electromagnetics,electronic structures,finite element models and quantum che-mistry.Transforming generalized symmetric eigenproblem into a standard symmetric eigenproblem is an important computational step for solving the generalized dense symmetric eigenproblem.For the GPU cluster,the generalized blocked algorithm for gene-ralized dense symmetric eigenproblem was presented based on MPI+CUDA on GPU cluster.In order to adapt to the architecture of the GPU cluster,the generalized symmetric eigenproblem standardization algorithm presented in this paper adopts the method of combining the Cholesky decomposition of the positive definite matrix with the traditional standardized blocked algorithm,which reduces the unnecessary communication overhead in the standardized algorithm and increases the parallelism of the algorithm.Moreover,In the MPI+CUDA based generalized symmetric eigenproblem standardization algorithm,the data transfer operation between the GPU and the CPU is utilized to mask the data copy operation in the GPU,which eliminates the time spent on copying,thereby improving the performance of the program.At the same time,a fully parallel point-to-point transposition algo-rithm between the row communication domain and the column communication domain in the two-dimensional communication grid was presented.In addition,a parallel blocked algorithm based on MPI+CUDA for the solution of the triangular matrix equation BX=A with multiple right-end terms was also given.On the supercomputer system“Era”of the Computer Network Information Center of Chinese Academy of Sciences,each compute node is configured with 2 Nvidia Tesla K20 GPGPU cards and 2 Intel E5-2680 V2 processors.This paper tested different scale matrices using up to 32 GPUs.The implementation performance of the ge-neralized symmetric eigenproblem standardization algorithm based on MPI+CUDA has achieved better acceleration and perfor-mance,and have good scalability.When tested with 50000×50000-order matrix using 32 GPUs,the peak performance reach approximately 9.21 Tflops.
Keywords:Generalized symmetric eigenproblem standardization blocked algorithm  GPU cluster  Cholesky decomposition  Transpose algorithm  Triangular matrix equations
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号