首页 | 本学科首页   官方微博 | 高级检索  
     

国产异构系统上HPL的优化与分析
引用本文:水超洋,于献智,王银山,谭光明. 国产异构系统上HPL的优化与分析[J]. 软件学报, 2021, 32(8): 2319-2328
作者姓名:水超洋  于献智  王银山  谭光明
作者单位:中国科学院计算技术研究所,北京 100190;中国科学院大学,北京 100190
基金项目:国家重点研发计划(2018YFB0204400,2016YFB0201305,2016YFB0200803,2016YFB0200300);中国科学院战略性先导科技专项(C类)(XDC01030000);国家自然科学基金(61972377,61432018,61702483);中国科学院前沿科学重点研究计划(QYZDJ-SSW-JSC035)
摘    要:随着异构系统成为建造超级计算机的重要选择,如何让CPU与加速器协调工作以充分发挥异构系统的计算性能具有重要意义.HPL是高性能计算领域最重要的基准测试程序,传统面向纯CPU系统的HPL算法通过加速器加速矩阵乘法的做法已经无法取得很好的性能.针对这一问题,提出了基于国产处理器-国产加速器异构系统的HPL性能模型和多线程细...

关 键 词:HPL  异构系统  跨平台  性能建模  E级计算
收稿时间:2019-08-16
修稿时间:2019-12-05

Optimization and Analysis of HPL on Domestic Heterogeneous System
SHUI Chao-Yang,YU Xian-Zhi,WANG Yin-Shan,TAN Guang-Ming. Optimization and Analysis of HPL on Domestic Heterogeneous System[J]. Journal of Software, 2021, 32(8): 2319-2328
Authors:SHUI Chao-Yang  YU Xian-Zhi  WANG Yin-Shan  TAN Guang-Ming
Affiliation:Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100190, China
Abstract:As heterogeneous system becomes one of the most important choices to build super computers, how to orchestrate CPU and accelerator to leverage the great computability of heterogeneous systems is of great significance. HPL is the most important benchmark in HPC field, traditional HPL algorithm targeting at CPU-only systems cannot achieve high performance by only offloading matrix multiplication workload to accelerators. To solve this problem, this work proposes a HPL performance model and a multithread fine-grained pipelining algorithm for domestic-processor-domestic-accelerator heterogeneous system. Meanwhile, a light weight cross-platform heterogeneous framework is implemented to carry out a cross-platform HPL algorithm. The proposed performance model predicts HPL performance accurately on similar heterogeneous systems. On NVIDIA platform, the proposed HPL algorithm outperforms the NVIDIA proprietary counterparts by 9%. On domestic-processor-domestic-accelerator platform, the finally optimized Linpack program achieves 2.3 PFLOPS on 512 nodes, with floating-point efficiency 71.1%.
Keywords:HPL  heterogeneous system  cross-platform  performance modeling  exascale computing
本文献已被 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号