首页 | 本学科首页   官方微博 | 高级检索  
     


Highly parallel GEMV with register blocking method on GPU architecture
Affiliation:1. Department of Computer, Shandong University, Weihai, China;2. Key Lab. of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;3. State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China;4. Institute of Microelectronics, Tsinghua University, Beijing, China;5. State Grid Information & Communication Company of Hunan EPC, China;1. Sichuan Province Key Lab of Signal and Information Processing, Southwest Jiaotong University, Chengdu 610031, PR China;2. School of Computer & Information Engineering, Yibin University, Yibin 644000, PR China;1. Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Computer Science and Technology, Anhui University, Hefei, Anhui, China;2. Department of Computer Science and Engineering, University of Texas at Arlington, Engineering Research Building, Room 529, 500 UTA Blvd, Arlington, TX 76019, USA;1. School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China;2. State Key Laboratory of ISN, Xidian University, Xi’an, China;1. L.I.M. Faculty of Sciences Dhar el Mahraz, USMBA, FÃs¨, Morocco;2. DESTEC, FLSHR, University of Mohammed V-Agdal, Rabat, Morocco;3. Institut Polytechnique Bordeaux/ENSEIRB-MATMECA, Laboratoire IMS CNRS UMR 5218, Groupe Signal et Image, France;4. LRIT URAC 29, University of Mohammed V-Agdal, Rabat, Morocco
Abstract:GPUs can provide powerful computing ability especially for data parallel applications, such as video/image processing applications. However, the complexity of GPU system makes the optimization of even a simple algorithm difficult. Different optimization methods on a GPU often lead to different performances. The matrix–vector multiplication routine for general dense matrices (GEMV) is an important kernel in video/image processing applications. We find that the implementations of GEMV in CUBLAS or MAGMA are not efficient, especially for small or fat matrix. In this paper, we propose a novel register blocking method to optimize GEMV on GPU architecture. This new method has three advantages. First, instead of using only one thread, we use a warp to compute an element of vector y so that the method can exploit the highly parallel GPU architecture. Second, the register blocking method is used to reduce the requirement of off-chip memory bandwidth. At last, the memory access order is elaborately arranged for the threads in one warp so that coalesced memory access is ensured. The proposed optimization methods for GEMV are comprehensively evaluated on different matrix sizes. The performance of the register blocking method with different block sizes is also evaluated in the experiment. Experiment results show that the new method can achieve very high speedup for small square matrices and fat matrices compared to CUBLAS or MAGMA, and can also achieve higher performance for large square matrices.
Keywords:GEMV  Register blocking  Data reuse  Memory bandwidth  GPU  Many-core  Parallelization  CUDA
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号