Lamport's parallelization algorithm (cf. [7]) is generalized to a broader class of loops, and the complexity of the transformation process has been estimated. It is shown that every loop can be parallelized using methods similar to those in [7]; moreover, they also have the property that all their inner loops are devoid of data dependencies, and so are fully parallelizable. Unfortunately, without restricting the nature of the loop to be parallelized, the negative solution to Hilbert's tenth problem (cf. [3]) can be applied to show that the parallelizing transformations are not computable. The class of affine loops was therefore introduced. This class is more general than that considered by Lamport, and it is shown that parallelizing transformations for affine loops are computable. In general, however, the complexity estimates for finding such loops suggest that the parallelization procedure will take longer than executing the original loop sequentially. It is further shown that, if the loop satisfies an additional, nondegeneracy condition, then the loop can be efficiently transformed.

Finally, although more generally applicable, these methods are best applied to vectorization problems.  相似文献   

Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and An abstract of this paper has been publsihed in Ref. 1. Research supported in part by Army contract #DABT63-92-C-0033. This work is not necessarily representative of the positions or policies of the Army of the Government. Research supported in part by Intel and NASA Graduate Fellowships. Research supported in part by an AT&T Bell Laboratoroies Graduate Fellowship and by the International Computer Science Institute, Berkeley, California.  相似文献   

This paper presents the results of an experiment to measure empirically the remaining opportunities for exploiting loop-level parallelism that are missed by the Stanford SUIF compiler, a state-of-the-art automatic parallelization system targeting shared-memory multiprocessor architectures. For the purposes of this experiment, we have developed a run-time parallelization test called the Extended Lazy Privatizing Doall (ELPD) test, which is able to simultaneously test multiple loops in a loop nest. The ELPD test identifies a specific type of parallelism where each iteration of the loop being tested accesses independent data, possibly by making some of the data private to each processor. For 29 programs in three benchmark suites, the ELPD test was executed at run time for each candidate loop left unparallelized by the SUIF compiler to identify which of these loops could safely execute in parallel for the given program input. The results of this experiment point to two main requirements for improving the effectiveness of parallelizing compiler technology: incorporating control flow tests into analysis and extracting low-cost run-time parallelization tests from analysis results  相似文献   

Linear loop transformations and tiling are known to be very effective for enhancing locality of reference in perfectly-nested loops. However, they cannot be applied directly to imperfectly-nested loops. Some compilers attempt to convert imperfectly-nested loops into perfectly-nested loops by using statement sinking, loop fusion, etc., and then apply locality enhancing transformations to the resulting perfectly-nested loops, but the approaches used are fairly ad hoc and may fail even for simple programs. In this paper, we present a systematic approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of each statement into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop nest, so embedding generalizes techniques like statement sinking and loop fusion which are used in ad hoc ways in current compilers to produce perfectly-nested loops from imperfectly-nested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space can itself be transformed to increase locality further, after which fully permutable loops can be tiled. The final code generation step may produce imperfectly-nested loops as output if that is desirable. We present experimental evidence for the effectiveness of this approach, using dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks.  相似文献   

In this paper we propose a knowledge-based approach for solving data dependence testing and loop scheduling problems. A rule-based system, called the K-Test, is developed by repertory grid and attribute ording table to construct the knowledge base. The K-Test chooses an appropriate testing algorithm according to some features of the input program by using knowledge-based techniques, and then applies the resulting test to detect data dependences for loop parallelization. Another rule-based system, called the KPLS, is also proposed to be able to choose an appropriate scheduling by inferring some features of loops and assign parallel loops on multiprocessors for achieving high speedup. The experimental results show that the graceful speedup obtained by our compiler is obvious.  相似文献   

阳雪林  于勐  陈道蓄  谢立 《软件学报》2002,13(8):1718-1722
针对分布式环境下可抽取观察循环的不规则串行程序循环的动态依赖关系分析问题,提出了一个基于观察/执行模型的动态分析算法.其贡献是:(1) 算法可并行执行于分布式系统;(2) 直接分析具有拷入和最后赋值操作的循环;(3) 给出了循环的并行化方法;(4) 并不要求循环是完全可并行的,对某些部分可并行循环,也支持其并行执行.理论分析和实验表明,在处理器数量适当的情况下,循环可以并行时,可以获得很好的加速比;不能并行时,对串行执行增加的开销也是小的.从而为分布式环境下开发更多的循环并行性提供了一种新的手段.  相似文献   

传统的并行编译技术能够在编译期间进行相关性分析,有效地并行化循环程序,但是对于程序运行时潜在的并行性却无能为力.因此,并行编译技术必须使用实时依赖分析技术,尽可能挖掘循环级并行性.本文提出仿射依赖关系,消除了循环迭代依赖;基于投机并行思想,提出了SPAD方法.实例分析表明,SPAD是有效的.与LRPD和SPNT方法相比较,SPAD做了重要的改进,因此是更通用的投机并行化方案.  相似文献   

Chemical processes with multiloop control configurations have significant amount of control loop interactions due to tight mass and heat integration. Change in set point and/or controller parameters of one control loop may affect the variables of other loops. The presence of loop interactions in a process plant can cause significant quality and production losses of the plant. It is challenging to measure the degree of interaction between control loops and rank the loops according to the extent of interactions. This paper presents two data driven techniques to quantify control loop interactions and rank the loops according to their importance. In the first approach, a novel method based on canonical correlation analysis has been developed to calculate interaction among the loops and then normalization is done with respect to the maximum canonical correlation to determine the rank of the loops. In another approach, two indices have been developed using integral of absolute or squared error criteria to quantify loop interaction and determine rank of the loops. Both methods require step test data of the plant. Simulation and experimental results show the validity and efficacy of the proposed methods.  相似文献   

应用程序中循环部分往往是计算密集型应用的主要工作负载,随着基于FPGA的可重构计算系统的出现,循环的静态分析技术已不能满足可重构计算系统根据程序当前行为模式进行特定优化的要求.针对现有的程序动态分析技术无法直接获取循环动态信息的问题,提出一种循环信息分析算法,根据支配关系在控制流图中识别循环,通过edge profiling的结果分析计算得到循环调用次数、循环平均迭代次数及循环运行时间等关键信息,并在LLVM (Low Level Virtual Machine)平台上实现该算法.实验结果表明,该算法能够自动识别所有循环结构,并对循环部分进行精确分析,分析结果能够为可重构计算系统待加速计算密集型循环的选择提供较全面、精确的信息支持,对程序员进行可重构系统中软硬件划分技术的研究具有重要作用.  相似文献   

Barrier MIMD's are asynchronous multiple instruction stream, multiple data stream architectures capable of parallel execution of variable execution time instructions and arbitrary control flow (e.g., while loops and calls); however, they differ from conventional MIMD's in that the need for run-time synchronization is significantly reduced. The authors consider the problem of scheduling nested loop structures on a barrier MIMD. The basic approach employs loop coalescing, a technique for transforming a multiply-nested loop into a single loop. Loop coalescing is extended to nested triangular loops, in which inner loop bounds are functions of outer loop indices. In addition, a more efficient scheme to generate the original loop indices from the coalesced index is proposed for the case of constant loop bounds. These results are general, and can be applied to extend previous work using loop coalescing techniques. The authors concentrate on using loop coalescing for scheduling barrier MIMDs, and show how previous work in loop transformations and linear scheduling theory can be applied to this problem  相似文献   

Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and loops. For nonloops, we propose a cost estimation based on combined run-time effects of various speculation factors to predict the resulting performance of candidate threads to guide the thread partitioning. For loops, we parallelize all the profitable loops that can potentially offer additional performance benefits by multilevel spawning in loop bodies, loop iterations, and inner loops. Then we select a proper thread boundary located in the front of loop branch instruction to reduce invalid spawning threads that waste core resources. Experimental results show that the proposed approach can obtain a significant increase in speedup and Olden benchmarks reach a performance improvement of 6.62 % on average.  相似文献   

In a parallelizable task model, a task can be parallelized and the component tasks can be executed concurrently on multiple processors. We use this parallelism in tasks to meet their deadlines and also obtain better processor utilisation compared to non-parallelized tasks. Non-preemptive parallelizable task scheduling combines the advantages of higher schedulability and lower scheduling overhead offered by the preemptive and non-preemptive task scheduling models, respectively. We propose a new approach to maximize the benefits from task parallelization. It involves checking the schedulability of periodic tasks (if necessary, by parallelizing them) off-line and run-time scheduling of the schedulable periodic tasks together with dynamically arriving aperiodic tasks. To avoid the run-time anomaly that may occur when the actual computation time of a task is less than its worst case computation time, we propose efficient run-time mechanisms.We have carried out extensive simulation to study the effectiveness of the proposed approach by comparing the schedulability offered by it with that of dynamic scheduling using Earliest Deadline First (EDF), and by comparing its storage efficiency with that of the static table-driven approach. We found that the schedulability offered by parallelizable task scheduling is always higher than that of the EDF algorithm for a wide variety of task parameters and the storage overhead incurred by it is less than 3.6% of the static table-driven approach even under heavy task loads.  相似文献   

We present a novel Hybrid Analysis technology which can efficiently and seamlessly integrate all static and run-time analysis of memory references into a single framework that is capable of performing all data dependence analysis and can generate necessary information for most associated memory related optimizations. We use HA to perform automatic parallelization by extracting run-time assertions from any loop and generating appropriate run-time tests that range from a low cost scalar comparison to a full, reference by reference run-time analysis. Moreover we can order the run-time tests in increasing order of complexity (overhead) and thus risk the minimum necessary overhead. We accomplish this by both extending compile time IP analysis techniques and by incorporating speculative run-time techniques when necessary. Our solution is to bridge free compile time techniques with exhaustive run-time techniques through a continuum of simple to complex solutions. We have implemented our framework in the Polaris compiler by introducing an innovative intermediate representation called RT_LMAD and a run-time library that can operate on it. Based on the experimental results obtained to date we hope to automatically parallelize most and possibly all PERFECT codes, a significant accomplishment.  相似文献   

Zero is an experimental statically typed, fully object-oriented reflective programming language. Reflective features cover introspection as well as structural and behavioural reflection. The reflective facilities include safe method and class replacements and detailed modification of methods. These enable Zero programs to quickly accommodate to run-time requirements. Behavioural reflection is realised using handlers (hooks), which may be attached to all language constructs based on closures. Zero provides an efficient static typing system with run-time extensions. Methods are first class values and are represented as objects when such representation is required. By using such representation, Zero provides elegant use of statically typed higher-order methods.  相似文献   

By following a rely-guarantee style of reasoning, we present novel termination and cost analyses for concurrent programs that, in order to prove termination or infer the cost of a considered loop: (1) infer the termination/cost of each loop as if it were a sequential one, imposing assertions on how shared-data is modified concurrently; and then (2) prove that these assertions cannot be violated infinitely many times and, for cost analysis, infer how many times they are violated. At the core of the analysis, we use a may-happen-in-parallel analysis to restrict the set of program points whose execution can interleave. Interestingly, the same kind of reasoning can be applied to prove termination and infer upper bounds on the number of iterations of loops with concurrent interleavings. To the best of our knowledge, this is the first method to automatically bound the cost of such kind of loops. We have implemented our analysis for an actor-based language, and showed its accuracy and efficiency by applying it on several typical applications for concurrent programs and on an industrial case study.  相似文献   

The author presents strategies for static loop decomposition and scheduling as well as computer-assisted run-time scheduling that take into account, in addition to the cost of performing operations, the overhead costs associated with a decomposition and schedule. An algorithm for static decomposition of multidimensional loops based on the operation execution costs, communication costs, and synchronization costs is discussed. Synchronization instructions are introduced to ensure correct program execution following program decomposition. An algorithm for determining the explicit synchronization instruction that should be introduced to ensure correct execution of a program with arbitrarily nested loops is presented. Techniques for reducing run-time scheduling and communication and synchronization costs due to self-scheduling of multidimensional loops are also presented. Experiments performed on the Encore multiprocessor system demonstrate that the techniques developed can reduce overhead costs  相似文献   

Closed geodesics, or geodesic loops, are crucial to the study of differential topology and differential geometry. Although the existence and properties of closed geodesics on smooth surfaces have been widely studied in mathematics community, relatively little progress has been made on how to compute them on polygonal surfaces. Most existing algorithms simply consider the mesh as a graph and so the resultant loops are restricted only on mesh edges, which are far from the actual geodesics. This paper is the first to prove the existence and uniqueness of geodesic loop restricted on a closed face sequence; it contributes also with an efficient algorithm to iteratively evolve an initial closed path on a given mesh into an exact geodesic loop within finite steps. Our proposed algorithm takes only an O(k) space complexity and an O(mk) time complexity (experimentally), where m is the number of vertices in the region bounded by the initial loop and the resultant geodesic loop, and k is the average number of edges in the edge sequences that the evolving loop passes through. In contrast to the existing geodesic curvature flow methods which compute an approximate geodesic loop within a predefined threshold, our method is exact and can apply directly to triangular meshes without needing to solve any differential equation with a numerical solver; it can run at interactive speed, e.g., in the order of milliseconds, for a mesh with around 50K vertices, and hence, significantly outperforms existing algorithms. Actually, our algorithm could run at interactive speed even for larger meshes. Besides the complexity of the input mesh, the geometric shape could also affect the number of evolving steps, i.e., the performance. We motivate our algorithm with an interactive shape segmentation example shown later in the paper.  相似文献   

