国产异构架构系统上HPL的优化与分析 Optimization and analysis of HPL on China heterogeneous system期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

国产异构架构系统上HPL的优化与分析

引用本文：	水超洋,于献智,王银山,谭光明.国产异构架构系统上HPL的优化与分析[J].软件学报,2020,31(7).

作者姓名：	水超洋于献智王银山谭光明

作者单位：	中科院计算技术研究所北京 100190

摘要：	随着异构系统成为建造超级计算机的重要选择，如何让CPU与加速器协调工作以充分发挥异构系统的计算性能具有重要意义.HPL是高性能计算领域最重要的基准测试程序，传统面向纯CPU系统的HPL算法通过利用加速器加速矩阵乘法的做法已经无法取得很好的性能.针对这一问题，本文基于新的国产处理器-国产加速器异构系统提出了一个新的HPL性能模型，设计了一种全新的多线程细粒度异构HPL算法.我们完成了一个轻量级跨平台异构加速框架HPCX用来实现跨平台的HPL算法.我们的性能模型能够准确的预测类似异构系统的HPL性能，我们的多线程细粒度异构HPL算法在NVIDIA GPU平台上性能超过目前NVIDIA平台上性能最好的NVIDIA官方闭源nvhpl程序9%.在国产处理器-国产加速器平台512节点的规模上，我们的新HPL算法实现了2.3PFLOPS实测峰值性能和71.1%的浮点效率.
关键词：	HPL 异构并行跨平台性能建模 E级计算
收稿时间：	2019/8/16 0:00:00
修稿时间：	2019/12/5 0:00:00
Optimization and analysis of HPL on China heterogeneous system

SHUI Chao-Yang,YU Xian-Zhi,WANG Yin-Shan,TAN GuangMing.Optimization and analysis of HPL on China heterogeneous system[J].Journal of Software,2020,31(7).

Authors:	SHUI Chao-Yang YU Xian-Zhi WANG Yin-Shan TAN GuangMing

Affiliation:	Institute of Computing Technology, Chinese Academy of Sciences, BeiJing 100190, China;University of Chinese Academy of Sciences, BeiJing 100190, China

Abstract:	As heterogeneous system becomes one of the most important choices to build super computers, how to orchestrate CPU and accelerator to leverage the great computability of heterogeneous systems is of great significance. HPL is the most important benchmark in HPC field, traditional HPL algorithm targeting at CPU-only systems can not achieve high performance by only offloading matrix multiplication workload to accelerators. To solve this problem, this work proposes a new HPL peroformance model and a multithread fine-grained pipelining algorithm for China Processor-China Accelerator heterogeneous system. Meanwhile, we implement a light weight cross-platform heterogeneous framework to implement a cross-platform HPL algorithm. Our performance model predicts HPL performance accurately on similar heterogeneous systems. On NVIDIA platform, our new HPL algoriothm outperforms the NVIDIA proprietary counterparts by 9%. On China Processor-China Accelerator platform, the finally optimized Linpack program achieves 2.3PFlops on 512 nodes, with floating-point efficiency 71.1%.

Keywords:	HPL heterogeneous architecture parallel cross-platform performance modeling exascale computing

	点击此处可从《软件学报》浏览原始摘要信息
	点击此处可从《软件学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏