首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
An introduction to a formal theory of dependence analysis   总被引:3,自引:1,他引:2  
Dependence analysis is a very important part of any vectorizing or concurrentizing compiler. This paper is an introduction to a formal theory of dependence analysis. The emphasis here is on rigor —the subject matter is not new. The program model is a Fortran do loop consisting of loops and assignment statements. We carefully explain the key dependence concepts and indicate through examples how the dependence tests work. These ideas and methods can be readily extended to more general programs.  相似文献   

2.
Current parallelizing compilers cannot identify a significant fraction of parallelizable loops because they have complex or statically insufficiently defined access patterns. As parallelizable loops arise frequently in practice, we advocate a novel framework for their identification: speculatively execute the loop as a doall and apply a fully parallel data dependence test to determine if it had any cross-iteration dependences; if the test fails, then the loop is reexecuted serially. Since, from our experience, a significant amount of the available parallelism in Fortran programs can be exploited by loops transformed through privatization and reduction parallelization, our methods can speculatively apply these transformations and then check their validity at run-time. Another important contribution of this paper is a novel method for reduction recognition which goes beyond syntactic pattern matching: it detects at run-time if the values stored in an array participate in a reduction operation, even if they are transferred through private variables and/or are affected by statically unpredictable control flow. We present experimental results on loops from the PERFECT Benchmarks, which substantiate our claim that these techniques can yield significant speedups which are often superior to those obtainable by inspector/executor methods  相似文献   

3.
循环是程序中蕴含并行性最为丰富的一种结构,因此成为并行化编译最主要的对象.但循环内的过程调用严重妨碍了循环的数据相关性分析,使得循环语句潜在的大量并行性得不到开发.本文提出的循环嵌入方法使部分含过程调用循环语句的并行化成为可能,对部分用其它过程间分析技术也能开发其并行性的这一类循环语句采用循环嵌入方法,并行化开销低,并且分析更精确.采用循环嵌入方法还可降低程序由于多次过程调用带来的调度开销.这一方法在作者开发的自动并行化编译系统AFT(automaticPortrantransformer)中得到了实现,对Spec92测试程序包的试验结果表明了本文提出的方法是行之有效的.  相似文献   

4.
Loop transformations,such as loop interchange,reversal and skewing,have been unified under linear matrix transformations.A legal transformation matrix is usually generated based upon distance vectors or direction vectors.Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors, Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors,on the other hand,may not contain useful information.We propose the use of linear equations or inequalities of distance vectors to approximate data dependence.This approach is advantageous since(1) many loops having no constant distance vectors have very simple equations of distance vectors;(2) these equations contain more information than direction vectors do,thus the chance of exploiting potential parallelism is improved.In general,the equations or inequalities that approximate the data dependence of a given nested loop is not unique,hence classification is discussed for the purpose of loop transformationEfficient algorithms are developed to generate all kinds of linear equations of distance vectors for a given nested loop.The issue of how to obtain a desired transformation matrix from those equations is also addressed.  相似文献   

5.
The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development.  相似文献   

6.
Linear loop transformations and tiling are known to be very effective for enhancing locality of reference in perfectly-nested loops. However, they cannot be applied directly to imperfectly-nested loops. Some compilers attempt to convert imperfectly-nested loops into perfectly-nested loops by using statement sinking, loop fusion, etc., and then apply locality enhancing transformations to the resulting perfectly-nested loops, but the approaches used are fairly ad hoc and may fail even for simple programs. In this paper, we present a systematic approach for synthesizing transformations to enhance locality in imperfectly-nested loops. The key idea is to embed the iteration space of each statement into a special iteration space called the product space. The product space can be viewed as a perfectly-nested loop nest, so embedding generalizes techniques like statement sinking and loop fusion which are used in ad hoc ways in current compilers to produce perfectly-nested loops from imperfectly-nested ones. In contrast to these ad hoc techniques however, our embeddings are chosen carefully to enhance locality. The product space can itself be transformed to increase locality further, after which fully permutable loops can be tiled. The final code generation step may produce imperfectly-nested loops as output if that is desirable. We present experimental evidence for the effectiveness of this approach, using dense numerical linear algebra benchmarks, relaxation codes, and the tomcatv code from the SPEC benchmarks.  相似文献   

7.
A Fortran preprocessor is described which maps VAX Fortran into standard Fortran 77. Supported extensions include long variable names, DO WHILE loops, ENDDO loop terminations, Vax Fortran tab convention, INCLUDE file processing, and a MODULE facility to limit the access of external programs to subprogram and common block names. The software is available in ‘standard’ Pascal and Software Tools Pascal form.  相似文献   

8.
Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependencies and may not be very successful in optimizing imperfectly nested loops and explicitly parallelized programs. Although pure data transformations are not constrained by data dependencies, the impact of a data transformation on an array might be program-wide; that is, it can affect all the references to that array in all the loop nests. Therefore, in this paper we argue for an integrated approach that employs both loop and data transformations. The method enjoys the advantages of most of the previous techniques for enhancing locality and is efficient. In our approach, the loop nests in a program are processed one by one and the data layout constraints obtained from one nest are propagated for optimizing the remaining loop nests. We show a simple and effective matrix-based framework to implement this process. The search space that we consider for possible loop transformations can be represented by general nonsingular linear transformation matrices and the data layouts that we consider are those that can be expressed using hyperplanes. Experiments with several floating-point programs on an 8-processor SGI Origin 2000 distributed-shared-memory machine demonstrate the efficacy of our approach.  相似文献   

9.
Tiwari (2004) proved that the termination problem of a class of linear programs (loops with linear loop conditions and updates) over the reals is decidable through Jordan forms and eigenvector computation. Braverman (2006) proved that it is also decidable over the integers. Following their work, we consider the termination problems of three more general classes of programs which are loops with linear updates and three kinds of polynomial loop conditions, i.e., strict constraints, non-strict constraints and both strict and non-strict constraints, respectively. First, we prove that the termination problems of such loops over the integers are all undecidable. Then, for each class we provide an algorithm to decide the termination of such programs over the reals. The algorithms are complete for those programs satisfying a property, Non-Zero Minimum.  相似文献   

10.
As ubiquitous computing becomes a reality, its applications are increasingly being used in business-critical, mission-critical and even in safety-critical, areas. Such systems must demonstrate an assured level of correctness. One approach to the exhaustive analysis of the behaviour of systems is formal verification, whereby each important requirement is logically assessed against all possible system behaviours. While formal verification is often used in safety analysis, it has rarely been used in the analysis of deployed pervasive applications. Without such formality it is difficult to establish that the system will exhibit the correct behaviours in response to its inputs and environment. In this paper, we show how model-checking techniques can be applied to analyse the probabilistic behaviour of pervasive systems. As a case study we apply this technique to an existing pervasive message-forwarding system, Scatterbox. Scatterbox incorporates many typical characteristics of pervasive systems, such as dependence on sensor reliability and dependence on context. We assess the dynamic temporal behaviour of the system, including the analysis of probabilistic elements, allowing us to verify formal requirements even in the presence of uncertainty in sensors. We also draw some tentative conclusions concerning the use of formal verification for pervasive computing in general.  相似文献   

11.
Pipelining is now a standard technique for increasing the speed of computers, particularly for floating-point arithmetic. Single-chip, pipelined floating-point functional units are available as “off the shelf” components. Addressing arithmetic can be done concurrently with floating-point operations to construct a fast processor that can exploit fine-grain parallelism. This paper describes a metric to estimate the optimal execution time of DO loops on particular processors. This metric is parameterized by the memory bandwidth and peak floating-point rate of the processor, as well as the length of the pipelines used in the functional units. Data dependence analysis provides information about the execution order constraints of the operations in the DO loop and is used to estimate the amount of pipeline interlock required by a loop. Several transformations are investigated to determine their impact on loops under this metric.  相似文献   

12.
GCC4.1数据依赖分析器的分析与改进   总被引:1,自引:0,他引:1       下载免费PDF全文
本文深入分析了GCC4.1的数据依赖分析器,针对它在分析Fortran程序的线性化数组访问时的不足,给出了两点改进:一是初步实现了一个非仿射数组下标依赖分析算法;二是提出并实现了分裂递归链的仿射数组下标数据依赖分析方法。实验表明,这两点改进增强了GCC4.1的数据依赖分析能力,为进行循环变换如循环交换提供了更准确的数据依赖信息。  相似文献   

13.
This paper addresses the problem of communication-free partition of iteration spaces and data spaces along hyperplanes. To finding more possible communication-free hyperplane partitions, we treat statements within a loop body as separate schedulable units. Instead of using the information about data dependence distance or direction vectors, our technique explicitly formulates array references as transformations from statement-iteration spaces to data spaces. Based on these transformations, the necessary and sufficient conditions for communication-free partition along hyperplanes to be feasible have been proposed. This approach can be applied to all programs with an imperfectly nested loop or sequences of imperfectly nested loops, whose array references are affine functions of outer loop indices or loop invariant variables. The proposed approach is more practical than existing methods in finding the data and computation distribution patterns that can cause the processor to execute fully-parallel on multicomputers without any interprocessor communication.  相似文献   

14.
We present the first approach to deduce lower bounds for (worst-case) runtime complexity of term rewrite systems (TRSs) automatically. Inferring lower runtime bounds is useful to detect bugs and to complement existing methods that compute upper complexity bounds. Our approach is based on two techniques: the induction technique generates suitable families of rewrite sequences and uses induction proofs to find a relation between the length of a rewrite sequence and the size of the first term in the sequence. The loop detection technique searches for “decreasing loops”. Decreasing loops generalize the notion of loops for TRSs, and allow us to detect families of rewrite sequences with linear, exponential, or infinite length. We implemented our approach in the tool AProVE and evaluated it by extensive experiments.  相似文献   

15.
Over the past 20 years, increases in processor speed have dramatically outstripped performance increases for standard memory chips. To bridge this gap, compilers must optimize applications so that data fetched into caches are reused before being displaced. Existing compiler techniques can efficiently optimize simple loop structures such as sequences of perfectly nested loops. However, on more complicated structures, existing techniques are either ineffective or require too much computation time to be practical for a commercial compiler. To optimize complex loop structures both effectively and inexpensively, we present a novel loop transformation, dependence hoisting, for optimizing arbitrarily nested loops, and an efficient framework that applies the new technique to aggressively optimize benchmarks for better locality. Our technique is as inexpensive as the traditional unimodular loop transformation techniques and thus can be incorporated into commercial compilers. In addition, it is highly effective and is able to block several linear algebra kernels containing highly challenging loop structures, in particular, Cholesky, QR, LU factorization without pivoting, and LU with partial pivoting. The automatic blocking of QR and pivoting LU is a notable achievement—to our knowledge, few previous compiler techniques, including theoretically more general loop transformation frameworks [1, 21, 23, 27, 31], were able to completely automate the blocking of these kernels, and none has produced the same blocking as produced by our technique. These results indicate that with low compilation cost, our technique can in practice match the effectiveness of much more expensive frameworks that are theoretically more powerful.  相似文献   

16.
Loop unwinding is a known technique for reducing loop overhead, exposing parallelism, and increasing the efficiency of pipelining. Traditional loop unwinding is limited to the innermost loop in a group of nested loops and the amount of unwinding either is fixed or must be specified by the user, on a case by case basis. In this paper we present a general technique for automatically unwinding multiply nested loops, explain its advantages over other transformation techniques, and illustrate its practical effectiveness. Loop Quantization could be beneficial by itself or coupled with other loop transformations (e.g., Do-across).  相似文献   

17.
Modulo scheduling theory can be applied successfully to overlap Fortran DO loops on pipelined computers issuing multiple operations per cycle both with and without special loop architectural support. This paper shows that a broader class of loops—REPEAT-UNTIL, WHILE, and loops with more than one exit, in which the trip count is not known beforehand—can also be overlapped efficiently on multiple-issue pipelined machines. The approach is described with respect to a specific machine model, but it can be extended to other models. Special features in the architecture, as well as compiler representations for accelerating these loop constructs, are discussed. Performance results are presented for a few select examples.An earlier version of this paper was presented at Supercomputing '90.  相似文献   

18.
针对应用自动单模变换的两大困难:如何自动找出使多重循环并行化的恰当的单模变换矩阵以及如何解决妨碍单模矩阵计算的非常数归约相关距离,提出了如何对给定常数距离矩阵,自动找出使循环并行化的恰当的单模变换矩阵的技术;然后提出将数组归约相关表示为最小常数距离向量,从而使存在归约相关的多重循环也能够应用自动单模变换技术,为自动单模变换技术走向实用化提供了理论依据.  相似文献   

19.
An approach to transformations for general loops in which dependence vectors represent precedence constraints on the iterations of a loop is presented. Therefore, dependences extracted from a loop nest must be lexicographically positive. This leads to a simple test for legality of compound transformations: any code transformation that leaves the dependences lexicographically positive is legal. The loop transformation theory is applied to the problem of maximizing the degree of coarse- or fine-grain parallelism in a loop nest. It is shown that the maximum degree of parallelism can be achieved by transforming the loops into a nest of coarsest fully permutable loop nests and wavefronting the fully permutable nests. The canonical form of coarsest fully permutable nests can be transformed mechanically to yield maximum degrees of coarse- and/or fine-grain parallelism. The efficient heuristics can find the maximum degrees of parallelism for loops whose nesting level is less than five  相似文献   

20.
Modulo scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present an approach that schedules the loop operations for minimum register requirements, given a modulo reservation table. Our method determines optimal register requirements for machines with finite resources and for general dependence graphs. Measurements on a benchmark suite of 1327 loops from the Perfect Club, SPEC-89, and the Livermore Fortran Kernels show that the register requirements decrease by 24.8% on average when applying the optimal stage scheduler to the MRT-schedules of a register-insensitive modulo scheduler.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号