共查询到20条相似文献,搜索用时 953 毫秒
1.
Goumas G. Athanasaki M. Koziris N. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(10):1021-1034
This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code. 相似文献
2.
Hogstedt K. Carter L. Ferrante J. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(3):307-321
Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape. 相似文献
3.
Kazuaki Ishizaki Hideaki Komatsu Toshio Nakatani 《International journal of parallel programming》2000,28(2):135-154
Overlapping communication with computation is a well-known approach to improving performance. Previous research has focused on optimizations performed by the programmer. This paper presents a compiler algorithm that automatically determines the appropriate loop indices of a given nested loop and applies loop interchange and tiling in order to overlap communication with computation. The algorithm avoids generating redundant communication by providing a framework for combining information on data dependence, communication, and reuse. It also describes a method of generating messages to exchange data between processors for tiled loops on distributed memory machines. The algorithm has been implemented in our High Performance Fortran (HPF) compiler, and experimental results have shown its effectiveness on distributed memory machines, such as the RISC System/6000 Scalable POWERparallel System. This paper also discusses the architectural problems of efficient optimization. 相似文献
4.
This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data dependences and vector spaces to quantify data reuse in the program, a reuse-driven transformational approach is presented, which aims at maximizing the amount of data reuse carried in the tiled dimensions of the iteration space while keeping the number of tiled dimensions to a minimum (to reduce loop control overhead). In the special case of one single fully permutable loop nest, an algorithm is presented that tiles the program optimally so that all data reuse is carried in the tiled dimensions. In the general case of multiple fully permutable loop nests, data dependences can prevent all data reuse to be carried in the tiled dimensions. An algorithm is presented that aims at localizing data reuse in the tiled dimensions so that the reuse space localized has the largest dimensionality possible. 相似文献
5.
Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests 总被引:2,自引:0,他引:2
Nawaaz Ahmed Nikolay Mateev Keshav Pingali 《International journal of parallel programming》2001,29(5):493-544
Linear loop transformations and tiling are known to be very effective for enhancing locality of reference in perfectly-nested loops. However, they cannot be applied directly to imperfectly-nested loops. Some compilers attempt to convert imperfectly-nested loops into perfectly-nested loops by using statement sinking, loop fusion, etc., and then apply locality enhancing transformations to the resulting perfectly-nested loops, but the approaches used are fairly ad hoc and may fail even for simple programs. In this paper, we present a systematic approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of each statement into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop nest, so embedding generalizes techniques like statement sinking and loop fusion which are used in ad hoc ways in current compilers to produce perfectly-nested loops from imperfectly-nested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space can itself be transformed to increase locality further, after which fully permutable loops can be tiled. The final code generation step may produce imperfectly-nested loops as output if that is desirable. We present experimental evidence for the effectiveness of this approach, using dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. 相似文献
6.
《Journal of Parallel and Distributed Computing》1995,29(1):43-59
This paper deals with parallel scheduling techniques for uniform and affine loop nests. We deal with affine-by-statement scheduling, a powerful extension of Lamport′s hyperplane method where each statement within the loop nest is scheduled by a different timing function. We present a new, constructive and efficient method to determine the optimal (i.e., with smallest latency) affine-by-statement scheduling. We also consider parametric loop nests, where loop limits (in addition to being affine functions of outer loops) involve program variables whose values may not be known at compile-time (but are runtime constants). We then derive parameter-independent affine-by-statement schedules, and we show that these schedules are asymptotically as efficient as parameter-dependent solutions while much more regular. This theoretical result is of importance in practice, as regularity is a key factor for loop rewriting and code generation. 相似文献
7.
Yeong-Sheng Chen Sheng-De Wang Chien-Min Wang 《Journal of Parallel and Distributed Computing》1996,35(2):123
In this paper, an approach to tiling nested loops for maximizing parallelism is proposed. The proposed method aims at aggregating independent computations of a loop nest into rectangular blocks and maximizing the block sizes for maximizing parallelism. At first, all the independent computations that can be executed in the first time unit are identified. These computations are called the initially independent computations. Then it is shown that all of them can be collected as a union of rectangular blocks. So, based on these, the entire iteration space of the loops is partitioned into rectangular blocks for maximizing parallelism. The proposed method is formulated as systematic procedures which can easily be implemented in a parallelizing compiler. It is shown that when the wavefront transformation is combined with the proposed method, the loops can always be tiled so that the tile size is greater than one. In comparison with previous work on tiling, the proposed method is shown to have several advantages as summarized in the conclusions of this paper. 相似文献
8.
An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five 相似文献
9.
This paper presents a new compiler approach to minimizing the number of barriers executed in parallelized programs. A simple procedure is developed to reduce the complexity of barrier placement by eliminating certain data dependences, without affecting optimality. An algorithm is presented which, provably, places the minimal number of barriers in perfect loop nests and in certain imperfect loop nest structures. This scheme is generalized to accept entire, well-structured control-flow programs containing arbitrary nesting of IF constructs, loops, and subroutines. It has been implemented in a prototype parallelizing compiler and applied to several well-known benchmarks where it has been shown to place significantly fewer synchronization points than existing techniques. Experiments indicate that on average the number of barriers executed is reduced by 70 percent and there is a three fold improvement in execution time when evaluated on a 32-processor SGI Origin 2000 相似文献
10.
Fernandez A. Llaberia J.M. Valero-Garcia M. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(8):832-840
Linear transformations are widely used to vectorize and parallelize loops. A subset of these transformations are unimodular transformations. When a unimodular transformation is used, the exact bounds of the transformed loop nest are easily computed and the steps of the loops are equal to 1. Unimodular loop transformations have been widely used since they permit the implementation of many useful loop transformations. Recently, nonunimodular transformations have been proposed to reduce communication requirements or to use the memory hierarchy efficiently. The methods used for unimodular transformations do not work in the case of nonunimodular transformations, since they do not produce the exact bounds of the transformed loop nest. In this paper, we present a method for nested loop transformation which gives the exact bounds for both unimodular and nonunimodular transformations. The basic idea is to use the Hermite Normal Form (HNF) of the transformation matrix 相似文献
11.
12.
《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-3):113-141
In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. 相似文献
13.
循环分块是一种提高循环Cache命中率的循环变换技术,循环分块的大小是决定循环分块效率的关键因素,CME(cache miss equations)是一种精确分析程序中循环Cache命中率的数学模型,从CME理论模型出发,通过比较循环分块前后CME的变化,结合PADDING技术可以得出一个循环分块算法。实验表明,通过该算法计算出的块大小较之经典的LRW循环分块算法,在确保完全消除循环中数且引用数据访问Cache自冲突的同时,可以获得更大的分块,从而提高了循环分块的分块效率。 相似文献
14.
Thomas Fahringer 《Concurrency and Computation》1996,8(4):261-282
In order to improve a parallel program's performance it is critical to evaluate how even the work contained in a program is distributed over all processors dedicated to the computation. Traditional work distribution analysis is commonly performed at the machine level. The disadvantage of this method is that it cannot identify whether the processors are performing useful or redundant (replicated) work. The paper describes a novel method of statically estimating the useful work distribution of distributed-memory parallel programs at the program level, which carefully distinguishes between useful and redundant work. The amount of work contained in a parallel program, which correlates with the number of loop iterations to be executed by each processor, is estimated by accurately modeling loop iteration spaces, array access patterns and data distributions. A cost function defines the useful work distribution of loops, procedures and the entire program. Lower and upper bounds of the described parameter are presented. The computational complexity of the cost function is independent of the program's problem size, statement execution and loop iteration counts. As a consequence, estimating the work distribution based on the described method is considerably faster than simulating or actually compiling and executing the program. Automatically estimating the useful work distribution is fully implemented as part of P3T, which is a static parameter based performance prediction tool under the Vienna Fortran Compilation System (VFCS). The Lawrence Livermore Loops are used as a test case to verify the approach. 相似文献
15.
多核处理器已广泛应用于高性能计算领域,如何有效地将传统串行程序转换为并行代码并减少程序中嵌套循环所占用时间仍是该领域的挑战性问题。本文首先基于多面体模型对嵌套循环进行依赖特征分析并实现瓦片分割,据此自动生成粗粒度并行代码。针对多核阵列处理器的结构特点,采用遗传算法生成通信优化的瓦片任务序列,在此基础上建立了有效的任务调度模型。最后将上述方法应用于LU分解,结果表明该方法与传统调度算法相比,在增加数据局部性、实现负载平衡方面具有更好效果。 相似文献
16.
17.
Constructive methods for scheduling uniform loop nests 总被引:1,自引:0,他引:1
This paper surveys scheduling techniques for loop nests with uniform dependences. First, we introduce the hyperplane method and related variants. Then we extend it by using a different affine scheduling for each statement within the nest. In both cases, we present a new, constructive, and efficient method to determine optimal solutions, i.e., schedules whose total execution time is minimum 相似文献
18.
19.
Dror E. Maydan John L. Hennessy Monica S. Lam 《International journal of parallel programming》1995,23(1):63-81
Data dependence testing is the basic step in detecting loop level parallelism in numerical programs. The problem is undecidable
in the general case. Therefore, work has been concentrated on a simplified problem, affine memory disambiguation. In this
simpler domain, array references and loops bounds are assumed to be linear integer functions of loop variables. Dataflow information
is ignored. For this domain, we have shown that in practice the problem can be solved accurately and efficiently.(1) This paper studies empirically the effectiveness of this domain restriction, how many real references are affine and flow
insensitive. We use Larus's llpp system(2) to find all the data dependences dynamically. We compare these to the results given by our affine memory disambiguation system.
This system is exact for all the cases we see in practice. We show that while the affine approximation is reasonable, memory
disambiguation is not a sufficient approximation for data dependence analysis. We propose extensions to improve the analysis.
This research was supported in part by a fellowship from AT & T Bell Laboratories and by DARPA contract N00014-87-K-0828. 相似文献
20.
《Journal of Parallel and Distributed Computing》1999,58(2):190-235
Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependencies and may not be very successful in optimizing imperfectly nested loops and explicitly parallelized programs. Although pure data transformations are not constrained by data dependencies, the impact of a data transformation on an array might be program-wide; that is, it can affect all the references to that array in all the loop nests. Therefore, in this paper we argue for an integrated approach that employs both loop and data transformations. The method enjoys the advantages of most of the previous techniques for enhancing locality and is efficient. In our approach, the loop nests in a program are processed one by one and the data layout constraints obtained from one nest are propagated for optimizing the remaining loop nests. We show a simple and effective matrix-based framework to implement this process. The search space that we consider for possible loop transformations can be represented by general nonsingular linear transformation matrices and the data layouts that we consider are those that can be expressed using hyperplanes. Experiments with several floating-point programs on an 8-processor SGI Origin 2000 distributed-shared-memory machine demonstrate the efficacy of our approach. 相似文献