期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Loop coalescing and scheduling for barrier MIMD architectures

O'Keefe M.T. Dietz H.G. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(9):1060-1064

Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem 相似文献

2.

A Systematic Framework for the Dependence Cycle Removal in Practical Loops

《Journal of Parallel and Distributed Computing》1995,27(2):157-171

相似文献

3.

On the parallel execution time of tiled loops

Hogstedt K. Carter L. Ferrante J. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(3):307-321

Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape. 相似文献

4.

Efficient processor assignment algorithms and loop transformationsfor executing nested parallel loops on multiprocessors

Chien-Min Wang Sheng-De Wang 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(1):71-82

An important issue for the efficient use of multiprocessor systems is the assignment of parallel processors to nested parallel loops. It is desirable for a processor assignment algorithm to be fast and always generate an optimal processor assignment. The paper proposes two efficient algorithms to decide the optimal number of processors assigned to each individual loop. Efficient parallel counterparts of these two algorithms are also presented. These algorithms not only always generate an optimal processor assignment, but also are much faster than the exiting optimal algorithm in the literature. The paper discusses improving the performance of parallel execution by transforming a nested parallel loop into a semantically equivalent one. Three loop transformations are investigated. It is observed that, in most cases, the parallel execution time is improved after applying these transformations 相似文献

5.

Partitioning and mapping of nested loops for linear array multicomputers 总被引：1，自引：1，他引：0

Jang-Ping Sheu Tzung-Shi Chen 《The Journal of supercomputing》1995,9(1-2):183-202

In distributed-memory multicomputers, minimizing interprocessor communication is the key to the efficient execution of parallel programs. In order to reduce the amount of communication overhead, parallel programs on multicomputers must be carefully scheduled by parallelizing compilers. This paper proposes some compilation techniques for partitioning and mapping nested loops with constant data dependences onto linear array multicomputers. First, a systematic partition strategy is proposed to project ann-dimensional computational structure, representing ann-nested loop, onto a line to form a one-dimensional projected structure with low communication overhead. Then, a mapping algorithm is proposed for mapping the partitioned loops onto linear arrays in a way that balances the workload and minimizes the communication cost among processors. Finally, parallel execution codes can be automatically generated for such linear array multicomputers. 相似文献

6.

Finding the best compromise in compiling compound loops to Verilog

Yosi Ben-Asher Nadav Rotem Eddie Shochat 《Journal of Systems Architecture》2010,56(9):474-486

In this work we consider a special optimization problem involved with compiling compound loops (combining nested and consecutive sub-loops) to Verilog. Each sub-loop of the compound loop may require a different optimized hardware configuration (OHC) for optimized execution times. For example, one loop requires at least two memory ports and one multiplier for an optimized execution time, while another loop may require only one memory port but two multipliers, yet one OHC should be selected for both loops. The goal is to compute a minimal OHC which, based on the different heat levels (expected number of iterations) of the sub-loops, is a good compromise between all the conflicting requirements of each sub-loop. Though synthesis of nested loops has been implemented in quite a few systems this aspect has not been considered so far. We avoid the use of time consuming integer linear programming (ILP) techniques and instead use a fast space exploration technique combined with an efficient variant of list scheduling.Another novel aspect of the proposed system is the observation that the real latencies of the hardware units should be considered as variables of the OHC rather than fixed real values as is usually done in high-level synthesis systems. Experimental results show a significant improvement in the OHC without a significant increase in the execution time due to the use of this search procedure.¹ 相似文献

7.

Data-Centric Transformations for Locality Enhancement

Kodukula Induprakas Pingali Keshav 《International journal of parallel programming》2001,29(3):319-364

On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five. 相似文献

8.

Specification of software pipelining using petri nets 总被引：1，自引：0，他引：1

M. Rajagopalan V. H. Allan 《International journal of parallel programming》1994,22(3):273-301

This paper presents a flexible model for software pipelining using the petri nets. Our technique, called the Petri Net Pacemaker (PNP), can create near optimal pipelines with less algorithmic effort than other techniques. The pacemaker is a novel idea which exploits the cyclic behavior of petri nets to model the problem of scheduling operations of a loop body for software pipelining. A way of improving the performance of loops containing predicates is given. The PNP technique also shows how nested loops can be pipelined. A comparison with some of the other techniques is presented. THis work was partially supported by the National Science Foundation under grants CDA-9100788 and CDA-9200371. 相似文献

9.

Loops skewing: The wavefront method revisited

Michael Wolfe 《International journal of parallel programming》1986,15(4):279-293

Loop skewing is a new procedure to derive the wavefront method of execution of nested loops. The wavefront method is used to execute nested loops on parallel and vector computers when none of the loops can be done in vector mode. Loop skewing is a simple transformation of loop bounds and is combined with loop interchanging to generate the wavefront. This derivation is particularly suitable for implementation in compilers that already perform automatic detection of parallelism and generation of vector and parallel code, such as are available today. Loop normalization, a loop transformation used by several vectorizing translators, is related to loop skewing, and we show how loop normalization, applied blindly, can adversely affect the parallelism detected by these translators. 相似文献

10.

COMPILE TIME PARTITIONING OF NESTED LOOP ITERATION SPACES WITH NON-UNIFORM DEPENDENCES*

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-3):113-141

In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. 相似文献

11.

Achieving full parallelism using multidimensional retiming

Passos N.L. Sha E.H.-M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(11):1150-1163

Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms 相似文献

12.

Optimizing FORTRAN Programs for Hierarchical Memory Parallel Processing Systems

下载免费PDF全文

Jin Guohua Chen Fujie 《计算机科学技术学报》1993,8(3):19-30

Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel with low run-time overhead is thus very important for achieving high performance in parallel processing systems.However,in parallel processing systems with caches or local memories in memory hierarchies,“thrashing problemmay”may arise whenever data move back and forth between the caches or local memories in different processors.Previous techniques can only deal with the rather simple cases with one linear function in the perfactly nested loop.In this paper,we present a parallel program optimizing technique called hybri loop interchange(HLI)for the cases with multiple linear functions and loop-carried data dependences in the nested loop.With HLI we can easily eliminate or reduce the thrashing phenomena without reucing the program parallelism. 相似文献

13.

Removal of redundant dependences in DOACROSS loops with constantdependences

Krothapalli V.P. Sadayappan P. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(3):281-289

An efficient algorithm to remove redundant dependences in simple loops with constant dependences is presented. Dependences constrain the parallel execution of programs and are typically enforced by synchronization instructions. The synchronization instructions represent a significant part of the overhead in the parallel execution of a program. Some program dependences are redundant because they are covered by other dependences. It is shown that unlike with single loops, in the case of nested loops, a particular dependence may be redundant at some iterations but not redundant at others, so that the redundancy of a dependence may not be uniform over the entire iteration space. A sufficient condition for the uniformity of redundancy in a doubly nested loop is developed 相似文献

14.

A Direct Approach for Finding Loop Transformation Matrices

下载免费PDF全文

LIN Hua LU Mi Jesse Z.FANG 《计算机科学技术学报》1996,11(3):237-256

Loop transformations,such as loop interchange,reversal and skewing,have been unified under linear matrix transformations.A legal transformation matrix is usually generated based upon distance vectors or direction vectors.Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors, Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors,on the other hand,may not contain useful information.We propose the use of linear equations or inequalities of distance vectors to approximate data dependence.This approach is advantageous since(1) many loops having no constant distance vectors have very simple equations of distance vectors;(2) these equations contain more information than direction vectors do,thus the chance of exploiting potential parallelism is improved.In general,the equations or inequalities that approximate the data dependence of a given nested loop is not unique,hence classification is discussed for the purpose of loop transformationEfficient algorithms are developed to generate all kinds of linear equations of distance vectors for a given nested loop.The issue of how to obtain a desired transformation matrix from those equations is also addressed. 相似文献

15.

A Case Study of Implementing Supernode Transformations

Johann Steinbrecher Cesar J. Philippidis Weijia Shang 《International journal of parallel programming》2014,42(2):320-342

Supernode transformation is a technique to decrease the communication overhead by partitioning and scheduling a loop nest to a multi-processor system. This is achieved by grouping a number of iterations in a perfectly nested loop with regular dependences as a $supernode$ . Previous work has been focusing on finding the optimal supernode size and shape as well as an optimal execution schedule for multi-processor systems with unbounded resources. This paper emphasizes on the actual implementation strategies of supernode transformations on multi-core systems with limited resources. Using an example, the longest common subsequence (LCS) problem, we present and compare three different multithreading implementations. A formula for the total execution time of each method is presented. The techniques are benchmarked on a 12-core and a 4-core machine. On the 12-core machine our first technique, which yields increased data locality, speeds up the unaltered sequential loop nest 16.7 times. Combining this technique with skewing the loop by changing the linear schedule scores a 42.6 speedup. A more sophisticated method that executes entire rows of the loop nest in one thread scores a 59.5 speedup. Concepts presented and discussed in this paper on the LCS problem serve as basic foundation for implementations at regular dependence algorithms. 相似文献

16.

Exploitation of parallelism to nested loops with dependence cycles

Weng-Long Chih-Ping Michael 《Journal of Systems Architecture》2004,50(12):729-742

In this paper, we analyze the recurrences from the breakability of the dependence links formed in general multi-statements in a nested loop. The major findings include: (1) A sin k variable renaming technique, which can reposition an undesired anti-dependence and/or output-dependence link, is capable of breaking an anti-dependence and/or output-dependence link. (2) For recurrences connected by only true dependences, a dynamic dependence concept and the derived technique are powerful in terms of parallelism exploitation. (3) By the employment of global dependence testing, link-breaking strategy, Tarjan’s depth-first search algorithm, and a topological sorting, an algorithm for resolving a general multi-statement recurrence in a nested loop is proposed. Experiments with benchmark cited from Vector loops showed that among 134 subroutines tested, 3 had their parallelism exploitation amended by our proposed method. That is, our offered algorithm increased the rate of parallelism exploitation of Vector loops by approximately 2.24%. 相似文献

17.

常用循环摘要的自动生成方法及其应用

翟娟汤震浩李彬赵建华李宣东《软件学报》2017,28(5):1051-1069

采用形式化方法证明软件的正确性是保障软件可靠性的有效方法,而对循环语句的分析与验证是形式化证明中的关键,对循环语句的处理一直是程序分析与验证中的一个难点问题.本文提出使用循环语句修改的内存和这些内存中存放的新值来描述循环语句的执行效果,并将该执行效果定义为循环摘要.同时,本文提出了一种自动生成循环摘要的方法,可以为操作常用数据结构的循环自动生成循环摘要,包含嵌套循环.此外,基于循环摘要,我们可以自动生成循环语句的规约,包括循环不变式、循环的前置条件以及循环的后置条件.我们已经实现了自动生成循环摘要以及循环规约的方法,并将它们集成到验证工具Accumulator中,实验表明,我们的方法可以有效地生成循环摘要,并生成多种类型的规约,从而辅助软件程序的形式化证明,提高验证的自动化程度和效率,减轻验证人员的负担. 相似文献

18.

EFFECTIVE PARALLELIZATION TECHNIQUES FOR LOOP NESTS WITH NON-UNIFORM DEPENDENCES

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1):37-64

The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development. 相似文献

19.

Towards the optimal synchronization granularity for dynamic scheduling of pipelined computations on heterogeneous computing systems

I. Riakiotakis F. M. Ciorba T. Andronikos G. Papakonstantinou A. T. Chronopoulos 《Concurrency and Computation》2012,24(18):2302-2327

Loops are the richest source of parallelism in scientific applications. A large number of loop scheduling schemes have therefore been devised for loops with and without data dependencies (modeled as dependence distance vectors) on heterogeneous clusters. The loops with data dependencies require synchronization via cross‐node communication. Synchronization requires fine‐tuning to overcome the communication overhead and to yield the best possible overall performance. In this paper, a theoretical model is presented to determine the granularity of synchronization that minimizes the parallel execution time of loops with data dependencies when these are parallelized on heterogeneous systems using dynamic self‐scheduling algorithms. New formulas are proposed for estimating the total number of scheduling steps when a threshold for the minimum work assigned to a processor is assumed. The proposed model uses these formulas to determine the synchronization granularity that minimizes the estimated parallel execution time. The accuracy of the proposed model is verified and validated via extensive experiments on a heterogeneous computing system. The results show that the theoretically optimal synchronization granularity, as determined by the proposed model, is very close to the experimentally observed optimal synchronization granularity, with no deviation in the best case, and within 38.4% in the worst case. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

20.

Using knowledge-based techniques on loop parallelization for parallelizing compilers

《Parallel Computing》1997,23(3):291-309

In this paper we propose a knowledge-based approach for solving data dependence testing and loop scheduling problems. A rule-based system, called the K-Test, is developed by repertory grid and attribute ording table to construct the knowledge base. The K-Test chooses an appropriate testing algorithm according to some features of the input program by using knowledge-based techniques, and then applies the resulting test to detect data dependences for loop parallelization. Another rule-based system, called the KPLS, is also proposed to be able to choose an appropriate scheduling by inferring some features of loops and assign parallel loops on multiprocessors for achieving high speedup. The experimental results show that the graceful speedup obtained by our compiler is obvious. 相似文献