首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Cross-iterations data dependences in DOACROSS loops require explicit data synchronizations to enforce them. However, the composite effect of some data synchronizations may cover the other dependences and make the enforcement of those covered dependences redundant. In this paper, we propose an efficient and general algorithm to identify redundant synchronizations in multiply nested DOACROSS loops which may have multiple statements and loop-exit control branches. Eliminating redundant synchronizations in DOACROSS loops allows more efficient execution of such loops. We also address the issues of enforcing data synchronizations in iterations near the boundary of the iteration space. Because some dependences may not exist in those boundary iterations, it adds complexity in determining the redundant synchronizations for those boundary iterations. The necessary and sufficient condition under which the synchronization is uniformly redundant is also studied. These results allow a parallelizing compiler to generate efficient data synchronization instructions for DOACROSS loops  相似文献   

2.
The author presents strategies for static loop decomposition and scheduling as well as computer-assisted run-time scheduling that take into account, in addition to the cost of performing operations, the overhead costs associated with a decomposition and schedule. An algorithm for static decomposition of multidimensional loops based on the operation execution costs, communication costs, and synchronization costs is discussed. Synchronization instructions are introduced to ensure correct program execution following program decomposition. An algorithm for determining the explicit synchronization instruction that should be introduced to ensure correct execution of a program with arbitrarily nested loops is presented. Techniques for reducing run-time scheduling and communication and synchronization costs due to self-scheduling of multidimensional loops are also presented. Experiments performed on the Encore multiprocessor system demonstrate that the techniques developed can reduce overhead costs  相似文献   

3.
In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques.  相似文献   

4.
Abstract

This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation.  相似文献   

5.
Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem  相似文献   

6.
随着多核处理器逐渐成为处理器发展的新趋势,为了持续提高程序性能,必须并行执行应用程序.传统的自动并行技术能够很好地并行科学计算应用中的规则循环,但对于含有大量函数调用和指针引用的不规则程序,目前还不能有效地对其实施并行.针对这一现状,文中提出了基于区域平均执行时间和数据依赖信息的可能并行区域识别方法来对一些不规则程序实施高效并行,主要贡献如下:(1)自动识别程序中的多种并行性,不仅包括传统并行性分析中的循环迭代间的细粒度并行性,而且也包括传统并行性分析尚不能有效处理的循环体和函数调用点间的粗粒度并行性.对于程序中蕴含的众多并行性,文中基于区域平均执行时间实施收益分析来选择合适的并行区域实施并行;(2)自动识别可能并行区域间数据依赖关系的数量、类型以及导致数据依赖关系的程序变量.基于文中的分析结果,作者使用面向行为的投机并行系统(behavior oriented parallelism)对SPEC2006中的4个测试用例实现了并行化.并行化后的程序在Intel和AMD多核处理器上分别得到了300%和260%的平均性能加速.  相似文献   

7.
Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which prevent parallelism. In addition, performance losses result from cache conflicts in fused loops. In this paper, we present new techniques to: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) maintain parallelism and allow the parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We describe algorithms for implementing these techniques in compilers. The techniques are evaluated on a 56-processor KSR2 multiprocessor and on a 18-processor Convex SPP-1000 multiprocessor. The results demonstrate performance improvements for both kernels and complete applications. The results also indicate that careful evaluation of the profitability of fusion is necessary as more processors are used  相似文献   

8.
Presents the hierarchical task graph (HTG) as an intermediate parallel program representation which encapsulates minimal data and control dependences, and which can be used for the extraction and exploitation of functional, or task-level parallelism. The hierarchical nature of the HTG facilitates efficient task-granularity control during code generation, and thus applicability to a variety of parallel architectures. The construction of the HTG at a given hierarchy level, the derivation of the execution conditions of tasks which maximizes task-level parallelism, and the optimization of these conditions which results in reducing synchronization overhead imposed by data and control dependences are emphasized. Algorithms for the formation of tasks and their execution conditions based on data and control dependence constraints are presented. The issue of optimization of such conditions is discussed, and optimization algorithms are proposed. The HTG is used as the intermediate representation of parallel Fortran and C programs for generating parallel source as well as parallel machine code  相似文献   

9.
The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. The utility of these techniques can be demonstrated in an increasingly broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible.  相似文献   

10.
Partial redundancy elimination (PRE) is a program transformation that identifies and eliminates expressions that are redundant on at least one (but not necessarily all) execution paths of a program without increasing any path length. Chow, Kennedy and co‐workers devised an algorithm (SSAPRE) for performing partial redundancy elimination on intermediate representations in static single assignment (SSA) form. The practicality of that algorithm is limited by the following concerns: (1) it makes assumptions about the namespace that are stronger than those of SSA form and that may not be valid if other optimizations have already been performed on the program; (2) if redundancies occur in nested expressions, the algorithm may expose but not eliminate them (requiring a second pass of the algorithm); (3) it misses cases covered by the state of the art in PRE; and (4) it is difficult to understand and implement. We present an algorithm (A‐SSAPRE) structurally similar to SSAPRE that uses anticipation rather than availability; this algorithm is simpler than SSAPRE, covers more cases, eliminates nested redundancies on a single pass, and makes no assumptions about the namespace. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

11.
This paper describes several loop transformation techniques for extracting parallelism from nested loop structures. Nested loops can then be scheduled to run in parallel so that execution time is minimized. One technique is called selective cycle shrinking, and the other is called true dependence cycle shrinking. It is shown how selective shrinking is related to linear scheduling of nested loops and how true dependence shrinking is related to conflict-free mappings of higher dimensional algorithms into lower dimensional processor arrays. Methods are proposed in this paper to find the selective and true dependence shrinkings with minimum total execution time by applying the techniques of finding optimal linear schedules and optimal and conflict-free mappings proposed by W. Shang and A.B. Fortes  相似文献   

12.
The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development.  相似文献   

13.
Curare, the program restructurer described in this paper automatically transforms a sequential Lisp program into an equivalent concurrent program that runs on a multiprocessor.Data dependences constrain the program's concurrent execution because, in general, two conflicting statements cannot execute in a different order without affecting the program's result. Not all dependences are essential to produce the program's result.Curare attempts to transform the program so it computes its result with fewer conflicts. An optimized program will execute with less synchronization and more concurrency. Curare then examines loops in a program to find those that are unconstrained or lightly constrained by dependences. By necessity,Curare treats recursive functions as loops and does not limit itself to explicit program loops. Recursive functions offer several advantages over explicit loops since they provide a convenient framework for inserting locks and handling the dynamic behavior of symbolic programs. Loops that are suitable for concurrent execution are changed to execute on a set of concurrent server processes. These servers execute single loop iterations and therefore need to be extremely inexpensive to invoke.Restructured programs execute significantly faster than the original sequential programs. This improvement is large enough to attract programmers to a multiprocessor, particularly since it requires little effort on their part.This research was funded by DARPA contract numbers N00039-85-C-0269 (SPUR) and N00039-84-C-0089 (XCS) and by an NSF Presidential Young Investigator award to Paul N. Hilfinger. Additional funding came from the California MICRO program (in conjunction with Texas Instruments, Xerox, Honeywell, and Phillips/Signetics).  相似文献   

14.
Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism.  相似文献   

15.
With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses the selection of an optimal supernode shape of a supernode transformation (also known as tiling). We identify three parameters of a supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For supernode transformations on algorithms with perfectly nested loops and uniform dependencies, we prove the optimality of a constant linear schedule vector and give a necessary and sufficient condition for optimal relative side lengths. We also prove that the total running time is minimized by a cutting hyperplane direction matrix from a particular subset of all valid directions and we discuss the cases where this subset is unique. The results are derived in continuous space and should be considered approximate. Our model does not include cache effects and assumes an unbounded number of available processors, the communication cost approximated by a constant, uniform dependences, and loop bounds known at compile time. A comprehensive example is discussed with an application of the results to the Jacobi algorithm.  相似文献   

16.
Partitioning and mapping of nested loops for linear array multicomputers   总被引:1,自引:1,他引:0  
In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers.  相似文献   

17.
Automatic coarse-grained parallelization of program loops is of great importance for parallel computing systems. This paper presents the theory of Iteration Space Slicing aimed at extracting synchronization-free parallelism available in arbitrarily nested program loops. We demonstrate that Iteration Space Slicing algorithms permits for extracting more coarse-grained parallelism than that extracted by means of the Affine Transformation Framework provided that we are able to calculate the transitive closure of the union of relations describing all dependences in the affine loop. Experimental results show that by means of Iteration Space Slicing algorithms, we are able to extract coarse-grained parallelism for many loops of NAS and UTDSP benchmarks. Problems to be resolved in order to enhance the theory of Iteration Space Slicing are discussed.  相似文献   

18.
一个有效的并行分析算法   总被引:3,自引:0,他引:3  
并行分析在并行编译系统中有着很重要的作用,它的优劣直接影响到编译系统的成败,随着机群系统及其并行开发环境的发展,多数的并行系统可支持多重并行循环的运行。而对只支持一重并行循环的编程系统,选择并行运行效率最高的循环,也是很重要的。为此,本文提出了一个有效的循环并行分析方案,它不但能给出多层循环的并行性,而且能够处理绝大部分实际应用中的并行性问题,本文对传统的并行分析算法进行修改,并给出了一个有效的并  相似文献   

19.
An experimental evaluation of data dependence analysis techniques   总被引:1,自引:0,他引:1  
Optimizing compilers rely upon program analysis techniques to detect data dependences between program statements. Data dependence information captures the essential ordering constraints of the statements in a program that need to be preserved in order to produce valid optimized and parallel code. Data dependence testing is very important for automatic parallelization, vectorization, and any other code transformation. In this paper, we examine the impact of data dependence analysis in practice. A number of data dependence tests have been proposed in the literature. In each test, there are different trade offs between accuracy and efficiency. We present an experimental evaluation of several data dependence tests, including the Banerjee test, the I-Test, and the Omega test. We compare these tests in terms of data dependence accuracy, compilation efficiency, effectiveness in parallelization, and program execution performance. We analyze the reasons why a data dependence test can be inexact and we explain how the examined tests handle such cases. We run various experiments using the Perfect Club Benchmarks and the scientific library Lapack. We present the measured accuracy of each test and the reasons for any approximation. We compare these tests in term's of efficiency and we analyze the trade offs between accuracy and efficiency. We also determine the impact of each data dependence test on the total compilation time. Finally, we measure the number of loops parallelized by each test and we compare the execution performance of each benchmark on a multiprocessor. Our results indicate that the Omega test is more accurate, but also very inefficient in the cases where the other two tests are inaccurate. In general, the cost of the Omega test is high and uses a significant percentage of the total compilation time. Furthermore, the difference in accuracy of the Omega test over the Banerjee test and the l-Test does not improve parallelization and program execution performance.  相似文献   

20.
Loops are the richest source of parallelism in scientific applications. A large number of loop scheduling schemes have therefore been devised for loops with and without data dependencies (modeled as dependence distance vectors) on heterogeneous clusters. The loops with data dependencies require synchronization via cross‐node communication. Synchronization requires fine‐tuning to overcome the communication overhead and to yield the best possible overall performance. In this paper, a theoretical model is presented to determine the granularity of synchronization that minimizes the parallel execution time of loops with data dependencies when these are parallelized on heterogeneous systems using dynamic self‐scheduling algorithms. New formulas are proposed for estimating the total number of scheduling steps when a threshold for the minimum work assigned to a processor is assumed. The proposed model uses these formulas to determine the synchronization granularity that minimizes the estimated parallel execution time. The accuracy of the proposed model is verified and validated via extensive experiments on a heterogeneous computing system. The results show that the theoretically optimal synchronization granularity, as determined by the proposed model, is very close to the experimentally observed optimal synchronization granularity, with no deviation in the best case, and within 38.4% in the worst case. Copyright © 2012 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号