首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
循环倾斜是程序优化中一种循环变换的手段,它改变空间迭代形式,将循环存在的跨迭代的并行用传统的并行标识出来,使得循环可以并行执行。但是循环倾斜后,并行执行的数据在内存中是离散的,而且每次迭代执行的次数是不一致的。为了更有效地利用SIMD,本文提出一种基于全局数据重组的循环倾斜优化方法。首先分析循环倾斜优化,针对数据离散的问题实现全局数据重组,改善数据局部性,循环易于向量化操作;针对迭代执行次数不一致问题,实现非满载向量操作,使尾循环得以向量执行。最后选择wavefront程序进行测试,优化后,程序计算可以获得平均10.73倍的加速效果。  相似文献   

2.
Loop unwinding is a known technique for reducing loop overhead, exposing parallelism, and increasing the efficiency of pipelining. Traditional loop unwinding is limited to the innermost loop in a group of nested loops and the amount of unwinding either is fixed or must be specified by the user, on a case by case basis. In this paper we present a general technique for automatically unwinding multiply nested loops, explain its advantages over other transformation techniques, and illustrate its practical effectiveness. Loop Quantization could be beneficial by itself or coupled with other loop transformations (e.g., Do-across).  相似文献   

3.
Loop tiling is an efficient loop transformation, mainly applied to detect coarse-grained parallelism in loops. It is a difficult task to apply n-dimensional non-rectangular tiles to generate parallel loops. This paper offers an efficient scheme to apply non-rectangular n-dimensional tiles in non-rectangular iteration spaces, to generate parallel loops. In order to exploit wavefront parallelism efficiently, all the tiles with equal sum of coordinates are assumed to reside on the same wavefront. Also, in order to assign parallelepiped tiles on each wavefront to different processors, an improved block scheduling strategy is offered in this paper.  相似文献   

4.
In this paper, an approach to tiling nested loops for maximizing parallelism is proposed. The proposed method aims at aggregating independent computations of a loop nest into rectangular blocks and maximizing the block sizes for maximizing parallelism. At first, all the independent computations that can be executed in the first time unit are identified. These computations are called the initially independent computations. Then it is shown that all of them can be collected as a union of rectangular blocks. So, based on these, the entire iteration space of the loops is partitioned into rectangular blocks for maximizing parallelism. The proposed method is formulated as systematic procedures which can easily be implemented in a parallelizing compiler. It is shown that when the wavefront transformation is combined with the proposed method, the loops can always be tiled so that the tile size is greater than one. In comparison with previous work on tiling, the proposed method is shown to have several advantages as summarized in the conclusions of this paper.  相似文献   

5.
Loop transformations,such as loop interchange,reversal and skewing,have been unified under linear matrix transformations.A legal transformation matrix is usually generated based upon distance vectors or direction vectors.Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors, Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors,on the other hand,may not contain useful information.We propose the use of linear equations or inequalities of distance vectors to approximate data dependence.This approach is advantageous since(1) many loops having no constant distance vectors have very simple equations of distance vectors;(2) these equations contain more information than direction vectors do,thus the chance of exploiting potential parallelism is improved.In general,the equations or inequalities that approximate the data dependence of a given nested loop is not unique,hence classification is discussed for the purpose of loop transformationEfficient algorithms are developed to generate all kinds of linear equations of distance vectors for a given nested loop.The issue of how to obtain a desired transformation matrix from those equations is also addressed.  相似文献   

6.
Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel wit low run-time overhead is thus very important for achieving high performance in paralleo processing systems.However,in parallel processing systems with caches of local memories in memory hierarchies,“thrashing problemmay” may arise when data move back and forth frequently between the caches or local memories in different processors.The techniques associated with parallel compiler to solve the problem are not completely developed.In this paper,we present two restructuring techniques called loopg staggering,loop staggering and compacting,with which we can not only eliminate the cache or local memory thrashing phemomena significantly,but also exploit the potential parallelism existing in outer serial loop.Loop staggering benefits the dynamic loop scheduling strategies,whereas loop staggering and compacting is good for static loop scheduling strategies,Our method especially benefits parallel programs,in which a parallel loop is enclosed by a serial loop and array elements are repeatedly used in the different iterations of the parallel loop.  相似文献   

7.
在并行编译中,循环变换是开发程序并行度的主要方法,但存在复杂控制流的非紧密嵌套循环往往无法得到有效的并行化。文章结合分析Benchmark和实现自动并行化系统AFT中复杂非紧密嵌套循环变换的经验,给出复杂非紧密嵌套循环变换的特点及其在并行编译中的应用。  相似文献   

8.
循环是程序中蕴含并行性最为丰富的一种结构,因此成为并行化编译最主要的对象.但循环内的过程调用严重妨碍了循环的数据相关性分析,使得循环语句潜在的大量并行性得不到开发.本文提出的循环嵌入方法使部分含过程调用循环语句的并行化成为可能,对部分用其它过程间分析技术也能开发其并行性的这一类循环语句采用循环嵌入方法,并行化开销低,并且分析更精确.采用循环嵌入方法还可降低程序由于多次过程调用带来的调度开销.这一方法在作者开发的自动并行化编译系统AFT(automaticPortrantransformer)中得到了实现,对Spec92测试程序包的试验结果表明了本文提出的方法是行之有效的.  相似文献   

9.
Loop unrolling is a well known loop transformation that has been used in optimizing compilers for over three decades. In this paper, we address the problems of automatically selecting unroll factors for perfectly nested loops, and generating compact code for the selected unroll factors. Compared to past work, the contributions of our work include (i) a more detailed cost model that includes register locality, instruction-level parallelism and instruction-cache considerations; (ii) a new code generation algorithm that generates more compact code than the unroll-and-jam transformation; and (iii) a new algorithm for efficiently enumerating feasible unroll vectors. Our experimental results confirm the wide applicability of our approach by showing a 2.2× speedup on matrix multiply, and an average 1.08× speedup on seven of the SPEC95fp benchmarks (with a 1.2× speedup for two benchmarks). Larger performance improvements can be expected on processors that have larger numbers of registers and larger degrees of instruction-level parallelism than the processor used for our measurements (PowerPC 604).  相似文献   

10.
Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms  相似文献   

11.
《Parallel Computing》1997,22(12):1621-1645
A framework is described in which a class of imperfectly nested loops can be restructured using unimodular transformations. In this framework, an imperfect loop nest is converted to a perfect loop nest using Abu-Sufah's Non-Basic-to-Basic-Loop transformation. Conditions for the legality of this transformation and techniques for their verification are discussed. An iteration space, which extends the usual concept so as to represent explicitly the executions of individual statements, is proposed to model the converted loop nest. Since the converted loop nest is a perfect loop nest, data dependences can be extracted and optimal transformations can be selected for parallelism and/or locality in the normal manner. To generate the restructured code for a unimodular transformation, a code generation method is provided that produces the restructured code that is free of if statements by construction.  相似文献   

12.
A 3D iteration space visualizer (ISV) is presented to analyze the parallelism in loops and to find loop transformations which enhance the parallelism. Using automatic program instrumentation, the iteration space dependency graph (ISDG) is constructed, which shows the exact data dependencies of arbitrarily nested loops. Various graphical operations such as rotation, zooming, clipping, coloring and filtering, permit a detailed examination of the dependence relations. Furthermore, an animated dataflow execution shows the maximal parallelism and the parallel loops are indicated automatically by an embedded data dependence analysis. In addition, the user may discover and indicate additional parallelism for which a suitable unimodular loop transformation is calculated and verified. The ISV has been applied to parallelize algorithmic kernel programs, a computational fluid dynamics (CFD) simulation code, the detection of statement-level parallelism and loop variable privatization. The applications show that the visualizer is a versatile and easy to use tool for the high-performance application programmer.  相似文献   

13.
赵捷  赵荣彩  丁锐  黄品丰 《软件学报》2012,23(10):2695-2704
传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合OpenMP代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合MPI代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和MPI并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的MPI并行代码其性能加速比提高了20%以上.  相似文献   

14.
Loop parallelization is an important issue in the acceleration of the execution of scientific programs. To exploit parallelism in loops a system of equations representing the dependencies between the loop iterations and a system of non-equations indicating the loop boundary conditions has to be solved. This is a NP-Complete problem. Our major contribution in this paper has been to apply genetic algorithm to solve system of equation and non-equation resulted from loop dependency analysis techniques to find two dependent loop iterations. We use distance vector to find the rest of dependencies.  相似文献   

15.
This paper describes several loop transformation techniques for extracting parallelism from nested loop structures. Nested loops can then be scheduled to run in parallel so that execution time is minimized. One technique is called selective cycle shrinking, and the other is called true dependence cycle shrinking. It is shown how selective shrinking is related to linear scheduling of nested loops and how true dependence shrinking is related to conflict-free mappings of higher dimensional algorithms into lower dimensional processor arrays. Methods are proposed in this paper to find the selective and true dependence shrinkings with minimum total execution time by applying the techniques of finding optimal linear schedules and optimal and conflict-free mappings proposed by W. Shang and A.B. Fortes  相似文献   

16.
SIMD扩展部件是集成到通用处理器中的加速部件,旨在发掘多媒体和科学计算等领域程序的数据级并行.当前两种基本的向量发掘方法分别是发掘迭代间并行的Loop-based方法和发掘迭代内并行的SLP方法.Loop-aware方法是对SLP方法的改进,其思想是首先通过循环展开将迭代间并行转换为迭代内并行,使循环体内的同构语句条数足够多,再利用SLP方法进行向量发掘.但当循环展开不合法或者并行度低于向量化因子时,Loop-aware方法无法实现程序向量并行性的发掘.因此提出了向量并行度指导的循环向量化方法,依据迭代间并行度、迭代内并行度和向量化因子,构建循环向量化方法选择方案,同时提出不充分向量化方法发掘并行度低于向量化因子的循环向量并行性,最后依据向量并行度对生成的向量循环进行展开.经过标准测试集测试,向量并行度指导的循环SIMD向量化方法比Loop-aware方法识别率提升107.5%,性能提升12.1%.  相似文献   

17.
Abstract

This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation.  相似文献   

18.
GCC编译器是一种受广大研究者青睐的开源优化编译器,但它仅仅能够对完美嵌套循环进行依赖分析。为了更好地挖掘嵌套循环粗粒度的并行,深入研究了GCC5.1数据依赖分析过程,提出了一种能够处理分支嵌套循环的依赖测试方法。首先识别出分支嵌套循环,然后分析数组下标与分支嵌套循环外层索引变量的关系,最后计算出外层循环索引变量的距离向量,并通过检测距离向量判断循环是否存在依赖。实验结果表明,该方法能够正确、有效地分析出分支嵌套循环的依赖关系。  相似文献   

19.
循环是程序中的热代码,对循环进行有效的优化可以显著缩短程序的执行时间。软件流水是一种开发循环体指令级并行的细粒度循环优化技术,它通过调度循环中连续迭代之间的指令使其并行执行,从而提高了循环的执行效率。实验数据表明,用Cerngoop程序包进行测试,循环优化效果明显。  相似文献   

20.
The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号