首页 | 本学科首页   官方微博 | 高级检索  
     


A Computation+Communication Load Balanced Loop Partitioning Method for Distributed Memory Systems
Authors:Santosh Pande  Tareq Bali
Affiliation:Compiler Research Laboratory, Department of ECECS, P.O. Box 210030, University of Cincinnati, Cincinnati, Ohio, 45221-0030, f1
Abstract:Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D.
Keywords:parallelizing compilers  distributed memory systems  data partitioning  communication optimizations  loop partitioning  and scheduling
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号