期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An efficient code generation technique for tiled iteration spaces

Goumas G. Athanasaki M. Koziris N. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(10):1021-1034

This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code. 相似文献

2.

On the parallel execution time of tiled loops

Hogstedt K. Carter L. Ferrante J. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(3):307-321

Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape. 相似文献

3.

A Loop Transformation Algorithm for Communication Overlapping

Kazuaki Ishizaki Hideaki Komatsu Toshio Nakatani 《International journal of parallel programming》2000,28(2):135-154

Overlapping communication with computation is a well-known approach to improving performance. Previous research has focused on optimizations performed by the programmer. This paper presents a compiler algorithm that automatically determines the appropriate loop indices of a given nested loop and applies loop interchange and tiling in order to overlap communication with computation. The algorithm avoids generating redundant communication by providing a framework for combining information on data dependence, communication, and reuse. It also describes a method of generating messages to exchange data between processors for tiled loops on distributed memory machines. The algorithm has been implemented in our High Performance Fortran (HPF) compiler, and experimental results have shown its effectiveness on distributed memory machines, such as the RISC System/6000 Scalable POWERparallel System. This paper also discusses the architectural problems of efficient optimization. 相似文献

4.

Reuse-Driven Tiling for Improving Data Locality

Jingling Xue Chua-Huang Huang 《International journal of parallel programming》1998,26(6):671-696

This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data dependences and vector spaces to quantify data reuse in the program, a reuse-driven transformational approach is presented, which aims at maximizing the amount of data reuse carried in the tiled dimensions of the iteration space while keeping the number of tiled dimensions to a minimum (to reduce loop control overhead). In the special case of one single fully permutable loop nest, an algorithm is presented that tiles the program optimally so that all data reuse is carried in the tiled dimensions. In the general case of multiple fully permutable loop nests, data dependences can prevent all data reuse to be carried in the tiled dimensions. An algorithm is presented that aims at localizing data reuse in the tiled dimensions so that the reuse space localized has the largest dimensionality possible. 相似文献

5.

Synthesizing Transformations for Locality Enhancement of Imperfectly-Nested Loop Nests 总被引：2，自引：0，他引：2

Nawaaz Ahmed Nikolay Mateev Keshav Pingali 《International journal of parallel programming》2001,29(5):493-544

Linear loop transformations and tiling are known to be very effective for enhancing locality of reference in perfectly-nested loops. However, they cannot be applied directly to imperfectly-nested loops. Some compilers attempt to convert imperfectly-nested loops into perfectly-nested loops by using statement sinking, loop fusion, etc., and then apply locality enhancing transformations to the resulting perfectly-nested loops, but the approaches used are fairly ad hoc and may fail even for simple programs. In this paper, we present a systematic approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of each statement into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop nest, so embedding generalizes techniques like statement sinking and loop fusion which are used in ad hoc ways in current compilers to produce perfectly-nested loops from imperfectly-nested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space can itself be transformed to increase locality further, after which fully permutable loops can be tiled. The final code generation step may produce imperfectly-nested loops as output if that is desirable. We present experimental evidence for the effectiveness of this approach, using dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks. 相似文献

6.

Affine-by-Statement Scheduling of Uniform and Affine Loop Nests over Parametric Domains

《Journal of Parallel and Distributed Computing》1995,29(1):43-59

This paper deals with parallel scheduling techniques for uniform and affine loop nests. We deal with affine-by-statement scheduling, a powerful extension of Lamport′s hyperplane method where each statement within the loop nest is scheduled by a different timing function. We present a new, constructive and efficient method to determine the optimal (i.e., with smallest latency) affine-by-statement scheduling. We also consider parametric loop nests, where loop limits (in addition to being affine functions of outer loops) involve program variables whose values may not be known at compile-time (but are runtime constants). We then derive parameter-independent affine-by-statement schedules, and we show that these schedules are asymptotically as efficient as parameter-dependent solutions while much more regular. This theoretical result is of importance in practice, as regularity is a key factor for loop rewriting and code generation. 相似文献

7.

Tiling Nested Loops into Maximal Rectangular Blocks

Yeong-Sheng Chen Sheng-De Wang Chien-Min Wang 《Journal of Parallel and Distributed Computing》1996,35(2):123

In this paper, an approach to tiling nested loops for maximizing parallelism is proposed. The proposed method aims at aggregating independent computations of a loop nest into rectangular blocks and maximizing the block sizes for maximizing parallelism. At first, all the independent computations that can be executed in the first time unit are identified. These computations are called the initially independent computations. Then it is shown that all of them can be collected as a union of rectangular blocks. So, based on these, the entire iteration space of the loops is partitioned into rectangular blocks for maximizing parallelism. The proposed method is formulated as systematic procedures which can easily be implemented in a parallelizing compiler. It is shown that when the wavefront transformation is combined with the proposed method, the loops can always be tiled so that the tile size is greater than one. In comparison with previous work on tiling, the proposed method is shown to have several advantages as summarized in the conclusions of this paper. 相似文献

8.

A loop transformation theory and an algorithm to maximizeparallelism

Wolf M.E. Lam M.S. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(4):452-471

An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five 相似文献

9.

Compile time barrier synchronization minimization

O'Boyle M. Stohr E. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(6):529-543

This paper presents a new compiler approach to minimizing the number of barriers executed in parallelized programs. A simple procedure is developed to reduce the complexity of barrier placement by eliminating certain data dependences, without affecting optimality. An algorithm is presented which, provably, places the minimal number of barriers in perfect loop nests and in certain imperfect loop nest structures. This scheme is generalized to accept entire, well-structured control-flow programs containing arbitrary nesting of IF constructs, loops, and subroutines. It has been implemented in a prototype parallelizing compiler and applied to several well-known benchmarks where it has been shown to place significantly fewer synchronization points than existing techniques. Experiments indicate that on average the number of barriers executed is reduced by 70 percent and there is a three fold improvement in execution time when evaluated on a 32-processor SGI Origin 2000 相似文献

10.

Loop transformation using nonunimodular matrices

Fernandez A. Llaberia J.M. Valero-Garcia M. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(8):832-840

Linear transformations are widely used to vectorize and parallelize loops. A subset of these transformations are unimodular transformations. When a unimodular transformation is used, the exact bounds of the transformed loop nest are easily computed and the steps of the loops are equal to 1. Unimodular loop transformations have been widely used since they permit the implementation of many useful loop transformations. Recently, nonunimodular transformations have been proposed to reduce communication requirements or to use the memory hierarchy efficiently. The methods used for unimodular transformations do not work in the case of nonunimodular transformations, since they do not produce the exact bounds of the transformed loop nest. In this paper, we present a method for nested loop transformation which gives the exact bounds for both unimodular and nonunimodular transformations. The basic idea is to use the Hermite Normal Form (HNF) of the transformation matrix 相似文献

11.

面向SIMD机器的全局自动数据分割

林进朱宁宁张兆庆乔如良《计算机学报》1999,22(6):596-602

提出了一种面向ＳＩＭＤ机器的全局数据自动分割算法,该算法能处理多个非紧嵌折循环嵌套,并且数组下标存取为循环变量的线性式,首先通过数据与迭代映射抽象了计算中的通信方式,然事提出识别规则模式通信模式的形式比条件,接着建立包含对准信息和相应通信开销的数据迭代图,并在数据迭代图的基础上提出了一个启发式算法来计算较优的数据分布和迭代分布,以优化处理单元之间的通信开销,通过发析多个循环嵌套所涉及的多个数组映和相似文献

12.

COMPILE TIME PARTITIONING OF NESTED LOOP ITERATION SPACES WITH NON-UNIFORM DEPENDENCES*

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-3):113-141

In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. 相似文献

13.

一个新的循环分块算法

舒辉康绯《计算机研究与发展》2002,39(10):1303-1306

循环分块是一种提高循环Cache命中率的循环变换技术，循环分块的大小是决定循环分块效率的关键因素，CME（cache miss equations)是一种精确分析程序中循环Cache命中率的数学模型，从CME理论模型出发，通过比较循环分块前后CME的变化，结合PADDING技术可以得出一个循环分块算法。实验表明，通过该算法计算出的块大小较之经典的LRW循环分块算法，在确保完全消除循环中数且引用数据访问Cache自冲突的同时，可以获得更大的分块，从而提高了循环分块的分块效率。相似文献

14.

On estimating the useful work distribution of parallel programs under P3T: a static performance estimator

Thomas Fahringer 《Concurrency and Computation》1996,8(4):261-282

In order to improve a parallel program's performance it is critical to evaluate how even the work contained in a program is distributed over all processors dedicated to the computation. Traditional work distribution analysis is commonly performed at the machine level. The disadvantage of this method is that it cannot identify whether the processors are performing useful or redundant (replicated) work. The paper describes a novel method of statically estimating the useful work distribution of distributed-memory parallel programs at the program level, which carefully distinguishes between useful and redundant work. The amount of work contained in a parallel program, which correlates with the number of loop iterations to be executed by each processor, is estimated by accurately modeling loop iteration spaces, array access patterns and data distributions. A cost function defines the useful work distribution of loops, procedures and the entire program. Lower and upper bounds of the described parameter are presented. The computational complexity of the cost function is independent of the program's problem size, statement execution and loop iteration counts. As a consequence, estimating the work distribution based on the described method is considerably faster than simulating or actually compiling and executing the program. Automatically estimating the useful work distribution is fully implemented as part of P³T, which is a static parameter based performance prediction tool under the Vienna Fortran Compilation System (VFCS). The Lawrence Livermore Loops are used as a test case to verify the approach. 相似文献

15.

基于多核阵列体系结构的嵌套循环并行优化

杨子煜严明赵鹏《计算机工程与科学》2009,31(Z1)

多核处理器已广泛应用于高性能计算领域,如何有效地将传统串行程序转换为并行代码并减少程序中嵌套循环所占用时间仍是该领域的挑战性问题。本文首先基于多面体模型对嵌套循环进行依赖特征分析并实现瓦片分割,据此自动生成粗粒度并行代码。针对多核阵列处理器的结构特点,采用遗传算法生成通信优化的瓦片任务序列,在此基础上建立了有效的任务调度模型。最后将上述方法应用于LU分解,结果表明该方法与传统调度算法相比,在增加数据局部性、实现负载平衡方面具有更好效果。相似文献

16.

分布存储系统中优化通信的冗余计算分割 总被引：1，自引：0，他引：1

陈莉张兆庆冯晓兵《计算机学报》2003,26(2):180-187

针对并行循环套序列，提出一种冗余计算分割的通信优化方法，根据数据流分析，文中给出用以确定每个循环套的冗余计算量的一般方法，并在此基础上提出冗余计算分割的实现和判定，针对规则依赖的程序，该文还提出了一个高效的冗余计算分割的实现方法，该技术已经在一个并行编译器中实现，试验结果表明，它比传统的通信优化技术有明显的优越性。相似文献

17.

Constructive methods for scheduling uniform loop nests 总被引：1，自引：0，他引：1

Darte A. Robert Y. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(8):814-822

This paper surveys scheduling techniques for loop nests with uniform dependences. First, we introduce the hyperplane method and related variants. Then we extend it by using a different affine scheduling for each statement within the nest. In both cases, we present a new, constructive, and efficient method to determine optimal solutions, i.e., schedules whose total execution time is minimum 相似文献

18.

程序自动并行化中的数组终写关系分析 总被引：1，自引：0，他引：1

罗勇张平龚雪容《计算机工程》2008,34(16):95-97

在程序自动并行化中过程中,数据收集阶段可能产生冗余通信,该文利用数组终写关系分析的方法来消除冗余通信,实现嵌套循环中数组数据最后写关系的快速求解,并将结果提供给编译器后端,生成精确数据收集代码。描述数组终写关系的研究目的和内容,将所处理的嵌套循环根据其结构特征进行分类,给出实现算法的过程。测试结果证明了该算法的正确性和高效性,所产生的精确数据收集代码能够有效地消除部分冗余通信,从而优化和提高了并行化程序的性能。相似文献

19.

Effectiveness of data dependence analysis

Dror E. Maydan John L. Hennessy Monica S. Lam 《International journal of parallel programming》1995,23(1):63-81

Data dependence testing is the basic step in detecting loop level parallelism in numerical programs. The problem is undecidable in the general case. Therefore, work has been concentrated on a simplified problem, affine memory disambiguation. In this simpler domain, array references and loops bounds are assumed to be linear integer functions of loop variables. Dataflow information is ignored. For this domain, we have shown that in practice the problem can be solved accurately and efficiently.⁽¹⁾ This paper studies empirically the effectiveness of this domain restriction, how many real references are affine and flow insensitive. We use Larus's llpp system⁽²⁾ to find all the data dependences dynamically. We compare these to the results given by our affine memory disambiguation system. This system is exact for all the cases we see in practice. We show that while the affine approximation is reasonable, memory disambiguation is not a sufficient approximation for data dependence analysis. We propose extensions to improve the analysis. This research was supported in part by a fellowship from AT & T Bell Laboratories and by DARPA contract N00014-87-K-0828. 相似文献

20.

A Matrix-Based Approach to Global Locality Optimization

《Journal of Parallel and Distributed Computing》1999,58(2):190-235

Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependencies and may not be very successful in optimizing imperfectly nested loops and explicitly parallelized programs. Although pure data transformations are not constrained by data dependencies, the impact of a data transformation on an array might be program-wide; that is, it can affect all the references to that array in all the loop nests. Therefore, in this paper we argue for an integrated approach that employs both loop and data transformations. The method enjoys the advantages of most of the previous techniques for enhancing locality and is efficient. In our approach, the loop nests in a program are processed one by one and the data layout constraints obtained from one nest are propagated for optimizing the remaining loop nests. We show a simple and effective matrix-based framework to implement this process. The search space that we consider for possible loop transformations can be represented by general nonsingular linear transformation matrices and the data layouts that we consider are those that can be expressed using hyperplanes. Experiments with several floating-point programs on an 8-processor SGI Origin 2000 distributed-shared-memory machine demonstrate the efficacy of our approach. 相似文献