首页 | 本学科首页   官方微博 | 高级检索  
     

基于申威众核处理器的1、2级BLAS函数优化研究
引用本文:孙家栋,孙乔,邓攀,杨超.基于申威众核处理器的1、2级BLAS函数优化研究[J].计算机系统应用,2017,26(11):101-108.
作者姓名:孙家栋  孙乔  邓攀  杨超
作者单位:中国科学院 软件研究所, 北京 100190;中国科学院大学, 北京 100049,中国科学院 软件研究所, 北京 100190,中国科学院 软件研究所, 北京 100190,中国科学院 软件研究所, 北京 100190
基金项目:国家自然科学基金重大研究计划集成项目(91530323);国家高技术研究发展计划(863计划)(2015AA01A302)
摘    要:BLAS (Basic Linear Algebra Subprograms)是一个以向量和矩阵为操作对象的基础函数库.该库中函数分为3个级别,各个级别分别提供了向量-向量(1级)、向量-矩阵(2级)、矩阵-矩阵(3级)之间的基本运算.本文研究如何在申威众核处理器上BLAS-1、2级函数的并行实现,并充分利用平台特性对它们进行深度的性能调优,归纳总结程序在申威平台上的并行实现与优化技巧.申威26010 CPU采用了异构众核架构,众多计算核心提供的大规模并行处理能力,使单块芯片具有3 TFLOPS的双精度浮点计算性能.实验结果显示BLAS-1、2级函数相对于GotoBLAS参考实现版的平均加速比分别高达11.x和6.x,对于每一优化手段,均有明显的性能加速.

关 键 词:BLAS  异构众核  任务并行  simd向量化
收稿时间:2017/2/21 0:00:00
修稿时间:2017/3/9 0:00:00

Research on the Optimization of BLAS Level 1 and 2 Functions on Shenwei Many-Core Processor
SUN Jia-Dong,SUN Qiao,DENG Pan and YANG Chao.Research on the Optimization of BLAS Level 1 and 2 Functions on Shenwei Many-Core Processor[J].Computer Systems& Applications,2017,26(11):101-108.
Authors:SUN Jia-Dong  SUN Qiao  DENG Pan and YANG Chao
Affiliation:Institute of Software, Chinese Academy of Sciences, Beijing 100190, China;University of Chinese Academy of Sciences, Beijing 100049, China,Institute of Software, Chinese Academy of Sciences, Beijing 100190, China,Institute of Software, Chinese Academy of Sciences, Beijing 100190, China and Institute of Software, Chinese Academy of Sciences, Beijing 100190, China
Abstract:BLAS (Basic Linear Algebra Subprograms) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication. The functions in this library are divided into three levels, and each level provides basic operations between vector-vector (level 1), vector-matrix (level 2), and matrix-matrix (level 3), respectively. In this paper, we study the parallel implementation of BLAS level 1 and level 2 functions on Shenwei many-core processor, and make full use of the characteristics of the platform to optimize their performance, and sum up the parallel implementation and optimization techniques of the program on Shenwei platform. Shenwei 26010 CPU uses heterogeneous multi-core architecture, and has an obvious advantage in operating speed. Many computing cores provide large-scale parallel processing capabilities, so that, double precision floating-point computing performance of one single chip can reach 3TFLOPS. The experimental results show that the average speedup of BLAS level 1 and level 2 functions is as high as 11.x and 6.x times of GotoBLAS reference implementations respectively.
Keywords:BLAS  heterogeneous multi-core  task parallelism  simd vectorization
点击此处可从《计算机系统应用》浏览原始摘要信息
点击此处可从《计算机系统应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号