首页 | 本学科首页   官方微博 | 高级检索  
     


Efficient automatic parallelization of a single GPU program for a multiple GPU system
Affiliation:1. Computer Science Department, University of Southern California, Los Angeles, USA;2. Computer Engineering Department, University of Jordan, Amman, Jordan;3. Electrical Engineering Department, University of Southern California, Los Angeles, USA;1. School of Computer and Information, Hefei University of Technology, 193 Tunxi Road, Hefei 230009, China;2. School of Electronic Science and Applied Physics, Hefei University of Technology, 193 Tunxi Road, Hefei 230009, China;1. College of Computer, National University of Defense Technology, Changsha, 410073, China;2. State Key Laboratory of High Performance Computing, National University of Defense Technology, Changsha, 410073, China;3. Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA 15261, USA;4. Department of Computer Science, University of Pittsburgh, Pittsburgh, PA 15260, USA
Abstract:Single GPU scaling is unable to keep pace with the soaring demand for high throughput computing. As such executing an application on multiple GPUs connected through an off-chip interconnect will become an attractive option to explore. However, much of the current code is written for a single GPU system. Porting such a code for execution on multiple GPUs is difficulty task. In particular, it requires programmer effort to determine how data is partitioned across multiple GPU cards and then launch the appropriate thread blocks that mostly accesses the data that is local to that card. Otherwise, cross-card data movement is an expensive operation. In this work we explore hardware support to efficiently parallelize a single GPU code for execution on multiple GPUs. In particular, our approach focuses on minimizing the number of remote memory accesses across the off-chip network without burdening the programmer to perform data partitioning and workload assignment. We propose a data-location aware thread block scheduler to schedule the thread blocks on the GPU that has most of its input data. The scheduler exploits well known observation that GPU workloads tend to launch a kernel multiple times iteratively to process large volumes of data. The memory accesses of the thread block across different iterations of a kernel launch exhibit correlated behavior. Our data location aware scheduler exploits this predictability to track memory access affinity of each thread block to a specific GPU card and stores this information to make scheduling decisions for future iterations. To further reduce the number of remote accesses we propose a hybrid mechanism that enables migrating or copying the pages between the memory of multiple GPUs based on their access behavior. Hence, most of the memory accesses are to the local GPU memory. Over an architecture consisting of two GPUs, our proposed schemes are able to improve the performance by 1.55× when compared to single GPU execution across widely used Rodinia 17], Parboil 18], and Graph 23] benchmarks.
Keywords:Multi-GPU  Automatic parallelization  Data movement
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号