首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
This paper presents an approach to modeling loop transformations using linear algebra. Compound transformations are modeled as integer matrices. The nonsingular linear transformations presented here subsume the class of unimodular transformations. The loop transformations included are the unimodular transformations-reversal, skewing, and permutation- and a new transformation, namelystretching. Nonunimodular transformations (with determinant 1) create holes in the transformed iteration space, rendering code generation difficult. We solve this problem by suitably changing the step size of loops in order to skip these holes when traversing the transformed iteration space. For the class of nonunimodular loop transformations, we present algorithms for deriving the loop bounds, the array access expressions, and the step sizes of loops in the nest. To derive the step sizes, we compute the Hermite normal form of the transformation matrix; the step sizes are the entries on the diagonal of this matrix. We then use the theory of Hessenberg matrices in the derivation of exact loop bounds for nonunimodular transformations. We illustrate the use of this approach in several problems such as the generation of tile sets and distributed-memory code generation. This approach provides a framework for optimizing programs for a variety of architectures.Supported in part by an NSF Young Investigator Award CCR-9457768, an NSF grant CCR-9210422, and by the Louisiana Board of Regents through contract LEQSF (1991–94)-RD-A-09.  相似文献   

2.
This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.  相似文献   

3.
《Parallel Computing》1997,22(12):1621-1645
A framework is described in which a class of imperfectly nested loops can be restructured using unimodular transformations. In this framework, an imperfect loop nest is converted to a perfect loop nest using Abu-Sufah's Non-Basic-to-Basic-Loop transformation. Conditions for the legality of this transformation and techniques for their verification are discussed. An iteration space, which extends the usual concept so as to represent explicitly the executions of individual statements, is proposed to model the converted loop nest. Since the converted loop nest is a perfect loop nest, data dependences can be extracted and optimal transformations can be selected for parallelism and/or locality in the normal manner. To generate the restructured code for a unimodular transformation, a code generation method is provided that produces the restructured code that is free of if statements by construction.  相似文献   

4.
针对应用自动单模变换的两大困难:如何自动找出使多重循环并行化的恰当的单模变换矩阵以及如何解决妨碍单模矩阵计算的非常数归约相关距离,提出了如何对给定常数距离矩阵,自动找出使循环并行化的恰当的单模变换矩阵的技术;然后提出将数组归约相关表示为最小常数距离向量,从而使存在归约相关的多重循环也能够应用自动单模变换技术,为自动单模变换技术走向实用化提供了理论依据.  相似文献   

5.
An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five  相似文献   

6.
A general method for the identification of the independent subsets in loops with constant dependence vectors is presented. It is shown that the dependence relation remains invariant under a unimodular transformation. Then a unimodular transformation is used to bring the dependence matrix into a form where the independent subsets are obtained by a direct and inexpensive partitioning algorithm. This leads to a procedure for the automatic conversion of a serial loop into a nest of parallel DO-ALL loops. Another unimodular transformation results in an algorithm to label the dependent iterations of an n-fold nested loop in O(n2) time. This provides a multithreaded dynamic scheduling scheme requiring only one fork and one join primitive  相似文献   

7.
A 3D iteration space visualizer (ISV) is presented to analyze the parallelism in loops and to find loop transformations which enhance the parallelism. Using automatic program instrumentation, the iteration space dependency graph (ISDG) is constructed, which shows the exact data dependencies of arbitrarily nested loops. Various graphical operations such as rotation, zooming, clipping, coloring and filtering, permit a detailed examination of the dependence relations. Furthermore, an animated dataflow execution shows the maximal parallelism and the parallel loops are indicated automatically by an embedded data dependence analysis. In addition, the user may discover and indicate additional parallelism for which a suitable unimodular loop transformation is calculated and verified. The ISV has been applied to parallelize algorithmic kernel programs, a computational fluid dynamics (CFD) simulation code, the detection of statement-level parallelism and loop variable privatization. The applications show that the visualizer is a versatile and easy to use tool for the high-performance application programmer.  相似文献   

8.
自动寻找使多重串行循环并行化的幺模变换   总被引:2,自引:0,他引:2  
对于已知n维距离向量矩阵的多重串行循环,过去的并行化编译研究还缺乏寻找使循环外层并行化的幺模矩阵的可行算法.文章介绍了多重串行循环并行化的幺模变换方法,不仅从理论上证明满足外层并行化要求的合法幺模矩阵是存在的,而且通过构造性证明给出一个计算外层并行化幺模变换矩阵的可行算法,并探讨了扩大其适用范围于非完全嵌套和非常数相关距离循环的有效途径.  相似文献   

9.
This paper presents a new cost-effective algorithm to compute exact loop bounds when multilevel tiling is applied to a loop nest having affine functions as bounds (nonrectangular loop nest). Traditionally, exact loop bounds computation has not been performed because its complexity is doubly exponential on the number of loops in the multilevel tiled code and, therefore, for certain classes of loops (i.e., nonrectangular loop nests), can be extremely time consuming. Although computation of exact loop bounds is not very important when tiling only for cache levels, it is critical when tiling includes the register level. This paper presents an efficient implementation of multilevel tiling that computes exact loop bounds and has a much lower complexity than conventional techniques. To achieve this lower complexity, our technique deals simultaneously with all levels to be tiled, rather than applying tiling level by level as is usually done. For loop nests having very simple affine functions as bounds, results show that our method is between 15 and 28 times faster than conventional techniques. For loop nests caving not so simple bounds, we have measured speedups as high as 2,300. Additionally, our technique allows eliminating redundant bounds efficiently. Results show that eliminating redundant bounds in our method is between 22 and 11 times faster than in conventional techniques for typical linear algebra programs.  相似文献   

10.
Loop transformations,such as loop interchange,reversal and skewing,have been unified under linear matrix transformations.A legal transformation matrix is usually generated based upon distance vectors or direction vectors.Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors, Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors,on the other hand,may not contain useful information.We propose the use of linear equations or inequalities of distance vectors to approximate data dependence.This approach is advantageous since(1) many loops having no constant distance vectors have very simple equations of distance vectors;(2) these equations contain more information than direction vectors do,thus the chance of exploiting potential parallelism is improved.In general,the equations or inequalities that approximate the data dependence of a given nested loop is not unique,hence classification is discussed for the purpose of loop transformationEfficient algorithms are developed to generate all kinds of linear equations of distance vectors for a given nested loop.The issue of how to obtain a desired transformation matrix from those equations is also addressed.  相似文献   

11.
Linear loop transformations and tiling are known to be very effective for enhancing locality of reference in perfectly-nested loops. However, they cannot be applied directly to imperfectly-nested loops. Some compilers attempt to convert imperfectly-nested loops into perfectly-nested loops by using statement sinking, loop fusion, etc., and then apply locality enhancing transformations to the resulting perfectly-nested loops, but the approaches used are fairly ad hoc and may fail even for simple programs. In this paper, we present a systematic approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of each statement into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop nest, so embedding generalizes techniques like statement sinking and loop fusion which are used in ad hoc ways in current compilers to produce perfectly-nested loops from imperfectly-nested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space can itself be transformed to increase locality further, after which fully permutable loops can be tiled. The final code generation step may produce imperfectly-nested loops as output if that is desirable. We present experimental evidence for the effectiveness of this approach, using dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks.  相似文献   

12.
Lamport's parallelization algorithm (cf. [7]) is generalized to a broader class of loops, and the complexity of the transformation process has been estimated. It is shown that every loop can be parallelized using methods similar to those in [7]; moreover, they also have the property that all their inner loops are devoid of data dependencies, and so are fully parallelizable. Unfortunately, without restricting the nature of the loop to be parallelized, the negative solution to Hilbert's tenth problem (cf. [3]) can be applied to show that the parallelizing transformations are not computable. The class of affine loops was therefore introduced. This class is more general than that considered by Lamport, and it is shown that parallelizing transformations for affine loops are computable. In general, however, the complexity estimates for finding such loops suggest that the parallelization procedure will take longer than executing the original loop sequentially. It is further shown that, if the loop satisfies an additional, nondegeneracy condition, then the loop can be efficiently transformed.

Finally, although more generally applicable, these methods are best applied to vectorization problems.  相似文献   


13.
Many abstractions of program dependences have already been proposed, such as the Dependence Distance, the Dependence Direction Vector, the Dependence Level or the Dependence Cone. These different abstractions have different precisions. Theminimal abstraction associated to a transformation is the abstraction that contains the minimal amount of information necessary to decide when such a transformation is legal. Minimal abstractions for loop reordering and unimodular transformations are presented. As an example, the dependence cone, which approximates dependences by a convex cone of the dependence distance vectors, is the minimal abstraction for unimodular transformations. It also contains enough information for legally applying all loop reordering transformations and finding the same set of valid mono- and multi-dimensional linear schedules as the dependence distance set.  相似文献   

14.
Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependencies and may not be very successful in optimizing imperfectly nested loops and explicitly parallelized programs. Although pure data transformations are not constrained by data dependencies, the impact of a data transformation on an array might be program-wide; that is, it can affect all the references to that array in all the loop nests. Therefore, in this paper we argue for an integrated approach that employs both loop and data transformations. The method enjoys the advantages of most of the previous techniques for enhancing locality and is efficient. In our approach, the loop nests in a program are processed one by one and the data layout constraints obtained from one nest are propagated for optimizing the remaining loop nests. We show a simple and effective matrix-based framework to implement this process. The search space that we consider for possible loop transformations can be represented by general nonsingular linear transformation matrices and the data layouts that we consider are those that can be expressed using hyperplanes. Experiments with several floating-point programs on an 8-processor SGI Origin 2000 distributed-shared-memory machine demonstrate the efficacy of our approach.  相似文献   

15.
Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem  相似文献   

16.
面向SLP 的多重循环向量化   总被引:1,自引:0,他引:1  
魏帅  赵荣彩  姚远 《软件学报》2012,23(7):1717-1728
如今,越来越多的处理器集成了SIMD(single instruction multiple data)扩展,现有的编译器大多也实现了自动向量化的功能,但是一般都只针对最内层循环进行向量化,对于多重循环缺少一种通用、易行的向量化方法.为此,提出了一种面向SLP(superword level parallelism)的多重循环向量化方法,从外至内依次对各个循环层次进行分析,收集各层循环对应的一些影响向量化效果的属性值,主要包括能否对该循环进行直接循环展开和压紧、有多少数组引用相对于该循环索引连续以及该循环所包含的区域等,然后根据这些属性值决定在哪些循环层次进行直接循环展开和压紧,最后通过SLP对循环中的语句进行向量化.实验结果表明,该算法相对于内层循环向量化和简单的外层循环向量化平均加速比提升了2.13和1.41,对于一些常用的核心循环可以得到高达5.3的加速比.  相似文献   

17.
Loop skewing is a new procedure to derive the wavefront method of execution of nested loops. The wavefront method is used to execute nested loops on parallel and vector computers when none of the loops can be done in vector mode. Loop skewing is a simple transformation of loop bounds and is combined with loop interchanging to generate the wavefront. This derivation is particularly suitable for implementation in compilers that already perform automatic detection of parallelism and generation of vector and parallel code, such as are available today. Loop normalization, a loop transformation used by several vectorizing translators, is related to loop skewing, and we show how loop normalization, applied blindly, can adversely affect the parallelism detected by these translators.  相似文献   

18.
阐述了一种适用于核外计算程序的变换技术,它通过联合使用循环变换和数据变换这两种编译优化技术来增强程序的局部性,提高数据存取效率。该方法不仅能优化单独一个嵌套循环,还能同时处理多个嵌套循环。实验结果表明了该方法能显著提高核外计算的性能。  相似文献   

19.
20.
On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号