细粒度任务并行GPU通用矩阵乘 Exploring fine grained task parallel GEMM on single-and multi-GPU systems期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

细粒度任务并行GPU通用矩阵乘

引用本文：	张帅,李涛,王艺峰,焦晓帆,杨愚鲁.细粒度任务并行GPU通用矩阵乘[J].计算机工程与科学,2015,37(5):847-856.

作者姓名：	张帅李涛王艺峰焦晓帆杨愚鲁

作者单位：	南开大学计算机与控制工程学院,天津,300071

摘要：	稠密线性代数运算对模式识别和生物信息等许多实际应用至关重要,而通用矩阵乘(GEMM)处于稠密线性代数运算的基础地位。在cuBLAS与MAGMA中,GEMM被实现为若干kernel函数,对大型GEMM计算能够达到很高的性能。然而,现有实现对批量的小型GEMM计算性能发挥则较为有限。而且,现有实现也不能在多个具有不同性能的GPU之间自动扩展并达到负载均衡。提出任务并行式GEMM(TPGEMM),用细粒度任务并行的方式实现批量矩阵乘和多GPU矩阵乘。一个或多个GEMM的计算能够被拆分为多个任务,动态地调度到一个或多个GPU上。TPGEMM避免了为批量矩阵乘启动多个kernel函数的开销,对批量矩阵乘能够取得显著高于cuBLAS与MAGMA的性能。在低开销细粒度任务调度的基础上,TPGEMM支持单个GEMM计算在多个GPU间的自动并行,在一台具有四个不同性能GPU的工作站上取得了接近100%的扩展效率。
关键词：	通用矩阵乘持久化kernel 任务并行负载均衡
收稿时间：	2014-10-15
修稿时间：	2014-12-20
Exploring fine grained task parallel GEMM on single-and multi-GPU systems

ZHANG Shuai , LI Tao , WANG Yi-feng , JIAO Xiao-fan , YANG Yu-lu.Exploring fine grained task parallel GEMM on single-and multi-GPU systems[J].Computer Engineering & Science,2015,37(5):847-856.

Authors:	ZHANG Shuai LI Tao WANG Yi-feng JIAO Xiao-fan YANG Yu-lu

Affiliation:	（College of Computer and Control Engineering,Nankai University,Tianjin 300071,China）

Abstract:	The Dense Linear Algebra (DLA), which is very important to many applications such as pattern recognition and bioinformatics,depends critically on the general matrix matrix multiplication (GEMM) routine.In current cuBLAS and MAGMA libraries,GEMM is implemented with kernel functions to achieve high performance for large GEMM.However,they are not efficient for multiple independent small matrices,even though the interfaces for batched small GEMMs are provided in cuBLAS.Moreover,they cannot automatically scale across multiple different GPUs with good load balancing.In this paper,we propose a task parallel GEMM (TPGEMM) that explores fine grained task parallelism for batched and multi-GPU GEMMs.The workloads of one or more GEMMs are decomposed into tasks which are scheduled to persistent GPU kernels at runtime.The TPGEMM avoids the overhead for launching multiple kernels and achieves better performance for batched small GEMMs compared with the cuBLAS and MAGMA libraries.Based on the fine grained task scheduling with low overhead, TPGEMM supports auto-parallelization across multiple GPUs and achieves an efficiency close to 100% on a workstation with 4 different GPUs.

Keywords:	GEMM persistent kernel task parallelism load balancing
本文献已被万方数据等数据库收录！
	点击此处可从《计算机工程与科学》浏览原始摘要信息
	点击此处可从《计算机工程与科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏