共查询到10条相似文献,搜索用时 171 毫秒
1.
《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-3):113-141
In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. 相似文献
2.
3.
Statement-Level Communication-Free Partitioning Techniques for Parallelizing Compilers 总被引:3,自引:3,他引:0
This paper addresses the problem of communication-free partition of iteration spaces and data spaces along hyperplanes. To finding more possible communication-free hyperplane partitions, we treat statements within a loop body as separate schedulable units. Instead of using the information about data dependence distance or direction vectors, our technique explicitly formulates array references as transformations from statement-iteration spaces to data spaces. Based on these transformations, the necessary and sufficient conditions for communication-free partition along hyperplanes to be feasible have been proposed. This approach can be applied to all programs with an imperfectly nested loop or sequences of imperfectly nested loops, whose array references are affine functions of outer loop indices or loop invariant variables. The proposed approach is more practical than existing methods in finding the data and computation distribution patterns that can cause the processor to execute fully-parallel on multicomputers without any interprocessor communication. 相似文献
4.
5.
Gary W. Elsesser Viet N. Ngo Sourav Bhattacharya Wei -Tek Tsai 《The Journal of supercomputing》1994,8(2):135-161
Load balance is important because it may affect the speedup attained through the concurrent execution of loop iterations on a parallel processor. We study loop load balance in the context of the well-known Perfect benchmarks. Several static and dynamic characteristics of the Perfect benchmark DOALL loops are observed and interpreted. Thelate arrival of processors is noted as a major source of load imbalance. This observation suggested the idea ofprocessor preallocation. An analytic cost model is presented and the advantages of processor preallocation are demonstrated by experimental evaluation on a CRAY Y-MP8 under the Unicos operating system. 相似文献
6.
Agarwal A. Kranz D.A. Natarajan V. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(9):943-962
Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor 相似文献
7.
《International Journal of Parallel, Emergent and Distributed Systems》2012,27(3):227-248
Abstract This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation. 相似文献
8.
9.
10.
Inter‐iteration dependences in loops can hinder loop‐level parallelism. For some loops, existing thread‐level speculation techniques fail to expose their inherent loop‐level parallelism, because some inter‐iteration dependences are too costly to synchronize, predict, pre‐compute and isolate. This paper presents a compiler technique called loop recreation to change the nature of some dependences (by turning some inter‐iteration dependences into intra‐iteration ones and vice versa) in a loop so that the inter‐iteration dependences in the transformed loop are less costly to enforce at runtime than those in the original loop. We present an algorithm for finding an optimal loop recreation transformation with respect to a simple misspeculation cost model and demonstrate the performance advantages of loop recreation over two recent techniques for multicore systems running nine representative irregular applications. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献