期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Redundant synchronization elimination for DOACROSS loops

Ding-Kai Chen Pen-Chung Yew 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(5):459-470

Cross-iterations data dependences in DOACROSS loops require explicit data synchronizations to enforce them. However, the composite effect of some data synchronizations may cover the other dependences and make the enforcement of those covered dependences redundant. In this paper, we propose an efficient and general algorithm to identify redundant synchronizations in multiply nested DOACROSS loops which may have multiple statements and loop-exit control branches. Eliminating redundant synchronizations in DOACROSS loops allows more efficient execution of such loops. We also address the issues of enforcing data synchronizations in iterations near the boundary of the iteration space. Because some dependences may not exist in those boundary iterations, it adds complexity in determining the redundant synchronizations for those boundary iterations. The necessary and sufficient condition under which the synchronization is uniformly redundant is also studied. These results allow a parallelizing compiler to generate efficient data synchronization instructions for DOACROSS loops 相似文献

2.

Synchronization and communication costs of loop partitioning onshared-memory multiprocessor systems

Gupta R. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(4):505-512

The author presents strategies for static loop decomposition and scheduling as well as computer-assisted run-time scheduling that take into account, in addition to the cost of performing operations, the overhead costs associated with a decomposition and schedule. An algorithm for static decomposition of multidimensional loops based on the operation execution costs, communication costs, and synchronization costs is discussed. Synchronization instructions are introduced to ensure correct program execution following program decomposition. An algorithm for determining the explicit synchronization instruction that should be introduced to ensure correct execution of a program with arbitrarily nested loops is presented. Techniques for reducing run-time scheduling and communication and synchronization costs due to self-scheduling of multidimensional loops are also presented. Experiments performed on the Encore multiprocessor system demonstrate that the techniques developed can reduce overhead costs 相似文献

3.

COMPILE TIME PARTITIONING OF NESTED LOOP ITERATION SPACES WITH NON-UNIFORM DEPENDENCES*

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-3):113-141

In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. 相似文献

4.

EIGENVECTORS-BASED PARALLELISATION OF NESTED LOOPS WITH AFFINE DEPENDENCES

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(3):227-248

Abstract

This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation. 相似文献

5.

Loop coalescing and scheduling for barrier MIMD architectures

O'Keefe M.T. Dietz H.G. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(9):1060-1064

Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem 相似文献

6.

基于区域平均执行时间和数据依赖信息的可能并行区域识别 总被引：1，自引：0，他引：1

张超王蕾向晓娅冯晓兵《计算机学报》2008,31(10)

随着多核处理器逐渐成为处理器发展的新趋势,为了持续提高程序性能,必须并行执行应用程序.传统的自动并行技术能够很好地并行科学计算应用中的规则循环,但对于含有大量函数调用和指针引用的不规则程序,目前还不能有效地对其实施并行.针对这一现状,文中提出了基于区域平均执行时间和数据依赖信息的可能并行区域识别方法来对一些不规则程序实施高效并行,主要贡献如下:(1)自动识别程序中的多种并行性,不仅包括传统并行性分析中的循环迭代间的细粒度并行性,而且也包括传统并行性分析尚不能有效处理的循环体和函数调用点间的粗粒度并行性.对于程序中蕴含的众多并行性,文中基于区域平均执行时间实施收益分析来选择合适的并行区域实施并行;(2)自动识别可能并行区域间数据依赖关系的数量、类型以及导致数据依赖关系的程序变量.基于文中的分析结果,作者使用面向行为的投机并行系统(behavior oriented parallelism)对SPEC2006中的4个测试用例实现了并行化.并行化后的程序在Intel和AMD多核处理器上分别得到了300%和260%的平均性能加速. 相似文献

7.

Fusion of loops for parallelism and locality

Manjikian N. Abdelrahman T.S. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):193-209

Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which prevent parallelism. In addition, performance losses result from cache conflicts in fused loops. In this paper, we present new techniques to: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) maintain parallelism and allow the parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We describe algorithms for implementing these techniques in compilers. The techniques are evaluated on a 56-processor KSR2 multiprocessor and on a 18-processor Convex SPP-1000 multiprocessor. The results demonstrate performance improvements for both kernels and complete applications. The results also indicate that careful evaluation of the profitability of fusion is necessary as more processors are used 相似文献

8.

Automatic extraction of functional parallelism from ordinaryprograms

Girkar M. Polychronopoulos C.D. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(2):166-178

Presents the hierarchical task graph (HTG) as an intermediate parallel program representation which encapsulates minimal data and control dependences, and which can be used for the extraction and exploitation of functional, or task-level parallelism. The hierarchical nature of the HTG facilitates efficient task-granularity control during code generation, and thus applicability to a variety of parallel architectures. The construction of the HTG at a given hierarchy level, the derivation of the execution conditions of tasks which maximizes task-level parallelism, and the optimization of these conditions which results in reducing synchronization overhead imposed by data and control dependences are emphasized. Algorithms for the formation of tasks and their execution conditions based on data and control dependence constraints are presented. The issue of optimization of such conditions is discussed, and optimization algorithms are proposed. The HTG is used as the intermediate representation of parallel Fortran and C programs for generating parallel source as well as parallel machine code 相似文献

9.

Parallelization of Control Recurrences for ILP Processors

Michael Schlansker Vinod Kathail Sadun Anik 《International journal of parallel programming》1996,24(1):65-102

The performance of applications executing on processors with instruction level parallelism is often limited by control and data dependences. Performance bottlenecks caused by dependences can frequently be eliminated through transformations which reduce the height of critical paths through the program. The utility of these techniques can be demonstrated in an increasingly broad range of important situations. This paper focuses on the height reduction of control recurrences within loops with data dependent exits. Loops with exits are transformed so as to alleviate performance bottlenecks resulting from control dependences. A compilation approach to effect these transformations is described. The techniques presented in this paper used in combination with prior work on reducing the height of data dependences provide a comprehensive approach to accelerating loops with conditional exits. In many cases, loops with conditional exits provide a degree of parallelism traditionally associated with vectorization. Multiple iterations of a loop can be retired in a single cycle on a processor with adequate instruction level parallelism with no cost in code redundancy. In more difficult cases, height reduction requires redundant computation or may not be feasible. 相似文献

10.

Anticipation‐based partial redundancy elimination for static single assignment form

Thomas VanDrunen Antony L. Hosking 《Software》2004,34(15):1413-1439

Partial redundancy elimination (PRE) is a program transformation that identifies and eliminates expressions that are redundant on at least one (but not necessarily all) execution paths of a program without increasing any path length. Chow, Kennedy and co‐workers devised an algorithm (SSAPRE) for performing partial redundancy elimination on intermediate representations in static single assignment (SSA) form. The practicality of that algorithm is limited by the following concerns: (1) it makes assumptions about the namespace that are stronger than those of SSA form and that may not be valid if other optimizations have already been performed on the program; (2) if redundancies occur in nested expressions, the algorithm may expose but not eliminate them (requiring a second pass of the algorithm); (3) it misses cases covered by the state of the art in PRE; and (4) it is difficult to understand and implement. We present an algorithm (A‐SSAPRE) structurally similar to SSAPRE that uses anticipation rather than availability; this algorithm is simpler than SSAPRE, covers more cases, eliminates nested redundancies on a single pass, and makes no assumptions about the namespace. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

11.

On loop transformations for generalized cycle shrinking

Weijia Shang O'Keefe M.T. Fortes J.A.B. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(2):193-204

This paper describes several loop transformation techniques for extracting parallelism from nested loop structures. Nested loops can then be scheduled to run in parallel so that execution time is minimized. One technique is called selective cycle shrinking, and the other is called true dependence cycle shrinking. It is shown how selective shrinking is related to linear scheduling of nested loops and how true dependence shrinking is related to conflict-free mappings of higher dimensional algorithms into lower dimensional processor arrays. Methods are proposed in this paper to find the selective and true dependence shrinkings with minimum total execution time by applying the techniques of finding optimal linear schedules and optimal and conflict-free mappings proposed by W. Shang and A.B. Fortes 相似文献

12.

EFFECTIVE PARALLELIZATION TECHNIQUES FOR LOOP NESTS WITH NON-UNIFORM DEPENDENCES

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1):37-64

The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development. 相似文献

13.

Compiling lisp programs for parallel execution

James R. Larus 《LISP and Symbolic Computation》1991,4(1):29-99

Curare, the program restructurer described in this paper automatically transforms a sequential Lisp program into an equivalent concurrent program that runs on a multiprocessor.Data dependences constrain the program's concurrent execution because, in general, two conflicting statements cannot execute in a different order without affecting the program's result. Not all dependences are essential to produce the program's result.Curare attempts to transform the program so it computes its result with fewer conflicts. An optimized program will execute with less synchronization and more concurrency. Curare then examines loops in a program to find those that are unconstrained or lightly constrained by dependences. By necessity,Curare treats recursive functions as loops and does not limit itself to explicit program loops. Recursive functions offer several advantages over explicit loops since they provide a convenient framework for inserting locks and handling the dynamic behavior of symbolic programs. Loops that are suitable for concurrent execution are changed to execute on a set of concurrent server processes. These servers execute single loop iterations and therefore need to be extremely inexpensive to invoke.Restructured programs execute significantly faster than the original sequential programs. This improvement is large enough to attract programmers to a multiprocessor, particularly since it requires little effort on their part.This research was funded by DARPA contract numbers N00039-85-C-0269 (SPUR) and N00039-84-C-0089 (XCS) and by an NSF Presidential Young Investigator award to Paul N. Hilfinger. Additional funding came from the California MICRO program (in conjunction with Texas Instruments, Xerox, Honeywell, and Phillips/Signetics). 相似文献

14.

Optimizing FORTRAN Programs for Hierarchical Memory Parallel Processing Systems

下载免费PDF全文

Jin Guohua Chen Fujie 《计算机科学技术学报》1993,8(3):19-30

Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism. 相似文献

15.

On time optimal supernode shape

Hodzic E. Shang W. 《Parallel and Distributed Systems, IEEE Transactions on》2002,13(12):1220-1233

With the objective of minimizing the total execution time of a parallel program on a distributed memory parallel computer, this paper discusses the selection of an optimal supernode shape of a supernode transformation (also known as tiling). We identify three parameters of a supernode transformation: supernode size, relative side lengths, and cutting hyperplane directions. For supernode transformations on algorithms with perfectly nested loops and uniform dependencies, we prove the optimality of a constant linear schedule vector and give a necessary and sufficient condition for optimal relative side lengths. We also prove that the total running time is minimized by a cutting hyperplane direction matrix from a particular subset of all valid directions and we discuss the cases where this subset is unique. The results are derived in continuous space and should be considered approximate. Our model does not include cache effects and assumes an unbounded number of available processors, the communication cost approximated by a constant, uniform dependences, and loop bounds known at compile time. A comprehensive example is discussed with an application of the results to the Jacobi algorithm. 相似文献

16.

Partitioning and mapping of nested loops for linear array multicomputers 总被引：1，自引：1，他引：0

Jang-Ping Sheu Tzung-Shi Chen 《The Journal of supercomputing》1995,9(1-2):183-202

In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers. 相似文献

17.

Coarse-grained loop parallelization: Iteration Space Slicing vs affine transformations

Anna Beletska Albert Cohen 《Parallel Computing》2011,37(8):479-497

Automatic coarse-grained parallelization of program loops is of great importance for parallel computing systems. This paper presents the theory of Iteration Space Slicing aimed at extracting synchronization-free parallelism available in arbitrarily nested program loops. We demonstrate that Iteration Space Slicing algorithms permits for extracting more coarse-grained parallelism than that extracted by means of the Affine Transformation Framework provided that we are able to calculate the transitive closure of the union of relations describing all dependences in the affine loop. Experimental results show that by means of Iteration Space Slicing algorithms, we are able to extract coarse-grained parallelism for many loops of NAS and UTDSP benchmarks. Problems to be resolved in order to enhance the theory of Iteration Space Slicing are discussed. 相似文献

18.

一个有效的并行分析算法 总被引：3，自引：0，他引：3

胡永刚乔如良《计算机学报》1999,22(2):134-140

并行分析在并行编译系统中有着很重要的作用,它的优劣直接影响到编译系统的成败,随着机群系统及其并行开发环境的发展,多数的并行系统可支持多重并行循环的运行。而对只支持一重并行循环的编程系统,选择并行运行效率最高的循环,也是很重要的。为此,本文提出了一个有效的循环并行分析方案,它不但能给出多层循环的并行性,而且能够处理绝大部分实际应用中的并行性问题,本文对传统的并行分析算法进行修改,并给出了一个有效的并相似文献

19.

An experimental evaluation of data dependence analysis techniques 总被引：1，自引：0，他引：1

Psarris K. Kyriakopoulos K. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(3):196-213

Optimizing compilers rely upon program analysis techniques to detect data dependences between program statements. Data dependence information captures the essential ordering constraints of the statements in a program that need to be preserved in order to produce valid optimized and parallel code. Data dependence testing is very important for automatic parallelization, vectorization, and any other code transformation. In this paper, we examine the impact of data dependence analysis in practice. A number of data dependence tests have been proposed in the literature. In each test, there are different trade offs between accuracy and efficiency. We present an experimental evaluation of several data dependence tests, including the Banerjee test, the I-Test, and the Omega test. We compare these tests in terms of data dependence accuracy, compilation efficiency, effectiveness in parallelization, and program execution performance. We analyze the reasons why a data dependence test can be inexact and we explain how the examined tests handle such cases. We run various experiments using the Perfect Club Benchmarks and the scientific library Lapack. We present the measured accuracy of each test and the reasons for any approximation. We compare these tests in term's of efficiency and we analyze the trade offs between accuracy and efficiency. We also determine the impact of each data dependence test on the total compilation time. Finally, we measure the number of loops parallelized by each test and we compare the execution performance of each benchmark on a multiprocessor. Our results indicate that the Omega test is more accurate, but also very inefficient in the cases where the other two tests are inaccurate. In general, the cost of the Omega test is high and uses a significant percentage of the total compilation time. Furthermore, the difference in accuracy of the Omega test over the Banerjee test and the l-Test does not improve parallelization and program execution performance. 相似文献

20.

Towards the optimal synchronization granularity for dynamic scheduling of pipelined computations on heterogeneous computing systems

I. Riakiotakis F. M. Ciorba T. Andronikos G. Papakonstantinou A. T. Chronopoulos 《Concurrency and Computation》2012,24(18):2302-2327

Loops are the richest source of parallelism in scientific applications. A large number of loop scheduling schemes have therefore been devised for loops with and without data dependencies (modeled as dependence distance vectors) on heterogeneous clusters. The loops with data dependencies require synchronization via cross‐node communication. Synchronization requires fine‐tuning to overcome the communication overhead and to yield the best possible overall performance. In this paper, a theoretical model is presented to determine the granularity of synchronization that minimizes the parallel execution time of loops with data dependencies when these are parallelized on heterogeneous systems using dynamic self‐scheduling algorithms. New formulas are proposed for estimating the total number of scheduling steps when a threshold for the minimum work assigned to a processor is assumed. The proposed model uses these formulas to determine the synchronization granularity that minimizes the estimated parallel execution time. The accuracy of the proposed model is verified and validated via extensive experiments on a heterogeneous computing system. The results show that the theoretically optimal synchronization granularity, as determined by the proposed model, is very close to the experimentally observed optimal synchronization granularity, with no deviation in the best case, and within 38.4% in the worst case. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献