面向深度学习的批处理矩阵乘法设计与实现 Design and Implementation of Batched GEMM for Deep Learning期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

面向深度学习的批处理矩阵乘法设计与实现

引用本文：	黄春,姜浩,全哲,左克,何楠,刘文超.面向深度学习的批处理矩阵乘法设计与实现[J].计算机学报,2022,45(2):225-239.

作者姓名：	黄春姜浩全哲左克何楠刘文超

作者单位：	国防科技大学计算机学院长沙410073,湖南大学信息科学与工程学院长沙410082

基金项目：	国家重点研究发展项目(2020YFA0709803)；;湖南省自然科学基金项目(2018JJ3616)；;国家自然科学基金项目(61907034)资助~~；

摘要：	本文设计并实现了面向深度学习的统一框架批处理矩阵乘法.我们细致地分析了利用矩阵乘法实现卷积的过程中卷积核、输入特征图和输出特征图在NCHW和NHWC两类存储格式下的矩阵数据排列特点,指出了其和矩阵行列主序的关系.在此基础上,为了更好复用共享的卷积核数据,我们提出将批量输入特征图转化为一个矩阵整体进行计算的方法.我们设计了统一框架的批处理分块矩阵乘法,该框架计算同一矩阵和多个不同矩阵的乘法,可以处理并输出任意存储格式的矩阵数据.我们优化了分块矩阵乘法实现,根据输入参数特征规划计算顺序,利用矩阵转置技巧复用核心计算模块,没有增加额外的数据组织操作.数值试验表明:本文设计实现的批处理单精度矩阵乘法的计算速度比循环调用原始单精度矩阵乘法的计算速度在处理中小尺度矩阵时在四款不同处理器平台上性能最高分别提高4.80%、26.57%、29.27%和25.55%,平均分别提升2.37%、14.37%、9.89%和15.72%.
关键词：	批处理矩阵乘法卷积分块算法深度学习数据排列
Design and Implementation of Batched GEMM for Deep Learning

HUANG Chun,JIANG Hao,QUAN Zhe,ZUO Ke,HE Nan,LIU Wen-Chao.Design and Implementation of Batched GEMM for Deep Learning[J].Chinese Journal of Computers,2022,45(2):225-239.

Authors:	HUANG Chun JIANG Hao QUAN Zhe ZUO Ke HE Nan LIU Wen-Chao

Affiliation:	(College of Computer Science and Technology,National University of Defense Technology,Changsha 410073;College of Computer Science and Electronic Engineering,Hunan University,Changsha 410082)

Abstract:	In this paper,we present the design and implementation of batched general matrix-matrix multiplication(GEMM)for deep learning.By analyzing the GEMM implementation of 2D convolution with the feature map layouts NCHW and NHWC,we show the relationship between the row-major(column-major)layouts of convolution matrix elements and the feature map layouts.We deem that the corresponding matrices of the input feature maps can be grouped together in a single batched routine to reuse the convolution kernels.We describe a unified and new API for our batched GEMM,which returns the multiplication of some matrix and several different matrices and can deal with any feature map in either data layout together.We optimize the block GEMM algorithm with the trick of matrix transposition and reorder the computing schedule without additional data rearrangement.Experimental results show that our batched GEMM enhance performance by up to 4.80%,26.57%,29.27%and 25.55%and by 2.37%,14.37%,9.89%and 15.72%in average than the convention GEMM recalled several times for four kinds of processors,respectively.

Keywords:	batch GEMM convolution block algorithm deep learning data layout
本文献已被维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏