首页 | 本学科首页   官方微博 | 高级检索  
     

异构HPL算法中CPU端高性能BLAS库优化
引用本文:蔡雨,孙成国,杜朝晖,刘子行,康梦博,李双双.异构HPL算法中CPU端高性能BLAS库优化[J].软件学报,2021,32(8):2289-2306.
作者姓名:蔡雨  孙成国  杜朝晖  刘子行  康梦博  李双双
作者单位:成都海光集成电路设计有限公司CPU架构设计部, 江苏 苏州 215000
摘    要:异构HPL(high-performance Linpack)效率的提高需要充分发挥加速部件和通用CPU计算能力,加速部件集成了更多的计算核心,负责主要的计算,通用CPU负责任务调度的同时也参与计算.在合理划分任务、平衡负载的前提下,优化CPU端计算性能对整体效率的提升尤为重要.针对具体平台体系结构特点对BLAS(basic linear algebra subprograms)函数进行优化往往可以更加充分地利用通用CPU计算能力,提高系统整体效率.BLIS(BLAS-like library instantiation software)算法库是开源的BLAS函数框架,具有易开发、易移植和模块化等优点.基于异构系统平台体系结构以及HPL算法特点,充分利用三级缓存、向量化指令和多线程并行等技术手段优化CPU端调用的各级BLAS函数,应用auto-tuning技术优化矩阵分块参数,从而形成了HygonBLIS算法库.与MKL相比,在异构环境下,HPL算法整体性能提高了11.8%.

关 键 词:BLAS  遗传算法auto-tuning  向量化指令  数据预取  多线程并行
收稿时间:2019/7/25 0:00:00
修稿时间:2020/3/19 0:00:00

CPU-side High Performance BLAS Library Optimization in Heterogeneous HPL Algorithm
CAI Yu,SUN Cheng-Guo,DU Zhao-Hui,LIU Zi-Xing,KANG Meng-Bo,LI Shuang-Shuang.CPU-side High Performance BLAS Library Optimization in Heterogeneous HPL Algorithm[J].Journal of Software,2021,32(8):2289-2306.
Authors:CAI Yu  SUN Cheng-Guo  DU Zhao-Hui  LIU Zi-Xing  KANG Meng-Bo  LI Shuang-Shuang
Affiliation:CPU Architecture Design Department, Chengdu Hygon IC Design Co., Ltd., Suzhou 215000, China
Abstract:Improving the efficiency of heterogeneous HPL needs to fully utilize the computing power of acceleration components and CPU, the acceleration components integrate more computing cores and are responsible for the main calculation. The general CPU is responsible for task scheduling and also participates in calculation. Under the premise of reasonable division of tasks and load balancing, optimizing CPU-side computing performance is particularly important to improve overall efficiency. Optimizing the basic linear algebra subprogram (BLAS) functions for specific platform architecture characteristics can often make full use of general-purpose CPU computing capabilities to improve the overall system efficiency. The BLAS-like Library Instantiation Software (BLIS) algorithm library is an open source BLAS function framework, which has the advantages of easy development, portability, and modularity. Based on the heterogeneous system platform architecture and HPL algorithm characteristics, this study uses three-level cache, vectorized instructions, and multi-threaded parallel technology to optimize the BLAS functions called by the CPU, applies auto-tuning technology to optimize the matrix block parameters, and eventually forms the HygonBLIS algorithm library. Compared with MKL, the overall performance of the HPL using HygonBLIS has been improved by 11.8% in the heterogeneous environment.
Keywords:BLAS  genetic algorithm auto-tuning  vectorization instruction  data prefetching  multi-threading parallelization
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号