首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 390 毫秒
1.
Queue processors are a viable alternative for high performance embedded computing and parallel processing. We present the design and implementation of a compiler for a queue-based processor. Instructions of a queue processor implicitly reference their operands making the programs free of false dependencies. Compiling for a queue machine differs from traditional compilation methods for register machines. The queue compiler is responsible for scheduling the program in level-order manner to expose natural parallelism and calculating instructions relative offset values to access their operands. This paper describes the phases and data structures used in the queue compiler to compile C programs into assembly code for the QueueCore, an embedded queue processor. Experimental results demonstrate that our compiler produces good code in terms of parallelism and code size when compared to code produced by a traditional compiler for a RISC processor.  相似文献   

2.
3.
An automatic vectorizing compiler called V-Pascal is described in detail. The compiler has been designed and implemented with a view to vectorizing Pascal source programs. Using the mechanism of vector indirect addressing, it reduces multiply nestedfor loops to equivalent single loops, which are then executed by vector mode with sufficiently long vector lengths. TheD matrix, which is an adjacency matrix giving dependences between intermediate code nodes, plays an important role in the V-Pascal compiler. It is demonstrated that, in some cases, the V-Pascal compiler yields object code that runs faster than the Fortran counterpart. This paper mainly presents the basic constituents of the Version 1 of the V-Pascal compiler. Version 2 includes higher functions such as vectorization ofwhile-do loops and recursive procedures, vectorization of character string manipulations and relational database operations (written in Pascal), and automatic parallel decomposition for multiprocessor environments.  相似文献   

4.
Parallel languages allow the programmer to express parallelism at a high level. The management of parallelism and the generation of interprocessor communication is left to the compiler and the runtime system. This approach to parallel programming is particularly attractive if a suitable widely accepted parallel language is available. High Performance Fortran (HPF) has emerged as the first popular machine independent parallel language, and remarkable progress has been made towards compiling HPF efficiently. However, the performance of HPF programs is often poor and unpredictable, and obtaining adequate performance is a major stumbling block that must be overcome if HPF is to gain widespread acceptance. The programmer is often in the dark about how to improve the performance of an HPF program since poor performance can be attributed to a variety of reasons, including poor choice of algorithm, limited use of parallelism, or an inefficient data mapping. This paper presents a profiling tool that allows the programmer to identify the regions of the program that execute inefficiently, and to focus on the potential causes of poor performance. The central idea is to distinguish the code that is executing efficiently from the code that is executing poorly. Efficient code uses all processors of a parallel system to make progress, while inefficient code causes processors to wait, execute replicated code, idle, communicate, or perform compiler bookkeeping. We designate the latter code as non-scalable, since adding more processors generally does not lead to improved performance for such code. By analogy, the former code is called scalable. The tool presented here separates a program into scalable and non-scalable components and identifies the causes of non-scalability of different components. We show that compiler information is the key to dividing the execution times into logical categories that are meaningful to the programmer. We present the design and implementation of a profiler that is integrated with Fx, a compiler for a variant of HPF. The paper includes two examples that demonstrate how the data reported by the profiler are used to identify and resolve performance bugs in parallel programs. © 1997 John Wiley & Sons, Ltd.  相似文献   

5.
Linear recurrences are the most important class of nonvectorizable problems in typical scientific/engineering calculations. This work discusses high-performance methods for solving first-order linear recurrences on a vector computer, investigates automatic transformations, and develops compiling techniques for first-order linear recurrence problems. The results show that the improved vector code generated by the vectorizing compiler on the HITAC S-820 supercomputer runs at the rate of 150 MFLOPS (million floating operations per second) for moderate loop lengths (>1000) and over 200 MFLOPS for long loop lengths (> 10000). Also, overall performance improvements of 69% in the 14 Lawrence Livermore Loops and 25 % in the 24 Lawrence Livermore Loops, as measured by the harmonic mean, are attained.  相似文献   

6.
In this paper, we give an overview of the results of the CRAFT optimising compiler project (Fortran 90/HPF subset compilers). We start by describing the theoretical framework within which we designed program transformations for the optimization of inter- and intra-procedural data motion, as well as the optimizations for parallel loops; we then describe the implementation of the CRAFT compilers for Thinking Machines' CM-2 and CM-5. We report results from experiments on the Connection Machine CM-5, the IBM SP-2 and a network of UltraSparc workstations. The results demonstrate that these optimizations can achieve significant object code performance improvement. Copyright © 1999 John Wiley & Sons, Ltd.  相似文献   

7.
Many scientific codes can achieve significant performance improvement when executed on a computer equipped with a vector processor. Vector constructs in source code should be recognized by a vectorizing compiler or preprocessor. This paper discusses, from a general point of view, how a vectorizing compiler/preprocessor can be evaluated. The areas discussed include data dependence analysis, IF loop analysis, nested loops, loop interchanging, loop collapsing, indirect addressing, use of temporary storage, and order of arithmetic. The ideas presented are based on vectorization of over a million lines of production codes and an extensive test suite developed to evaluate preprocessors under varying degrees of code complexity. Areas for future research are also discussed.  相似文献   

8.
Translation validation is an approach for validating the output of optimizing compilers. Rather than verifying the compiler itself, translation validation mandates that every run of the compiler generate a formal proof that the produced target code is a correct implementation of the source code. Speculative loop optimizations are aggressive optimizations which are only correct under certain conditions which cannot be validated at compile time. We propose using an automatic theorem prover together with the translation validation framework to automatically generate run-time tests for such speculative optimizations. This run-time validation approach must not only detect the conditions under which an optimization generates incorrect code, but also provide a way to recover from the optimization without aborting the program or producing an incorrect result. In this paper, we apply the run-time validation technique to a class of speculative reordering transformations and give some initial results of run-time tests generated by the theorem prover CVC.  相似文献   

9.
In this paper, we describe our experience of creating an OpenMP implementation of the SPICE3 circuit simulator program. Given the irregular patterns of access to dynamic data structures in the SPICE code, a parallelization using current standard OpenMP directives is impossible without major rewriting of the original program. The aim of this work is to present a case study showing the development of a shared memory parallel code with minimum effort. We present two implementations, one with minimal code modification and one without modification to the original SPICE3 program using Intel’s taskq construct. We also discuss the results of the case study in terms of what future compiler tools may be needed to help OpenMP application developers with similar porting goals. Our experiments using SPICE3, based on SRAM model simulation, were compiled by the SUN compiler running on a SunFire V880 UltraSPARC-III 750 MHz and by the Intel icc compiler running on both an IBM Itanium with four CPUs and Intel Xeon of two processors machines. The results are promising.  相似文献   

10.
Boyle  J.M. Resler  R.D. Winter  V.L. 《Computer》1999,32(5):65-73
As our society becomes more technologically complex, computer systems are finding an alarming number of uses in safety-critical applications. In many such systems, the software component's reliability is essential to the system's safe operation, so it becomes natural to ask, “How can software be made to behave correctly when executed?” Using program transformations to produce trusted software simplifies verification. Program transformations use proven laws to manipulate programs in a manner analogous to algebraic transformations. The authors sketch how a formal method based on program transformations can be used to construct a verified compiler. Such a compiler has been proved to correctly compile any correct program into assembly language. While the compiler itself may not execute efficiently-after all, you need only use the verified compiler the last time you compile a program-the transformational approach should enable the verified compiler to produce efficient assembly code  相似文献   

11.
Kasi Anantha  Fred Long 《Software》1990,20(6):537-554
There are two principal methods used to exploit the parallelism available on a parallel machine: the program to be executed can be optimized by hand, or the program can be automatically converted to parallel machine code by a compiler. The first method usually derives parallelism at the procedure level; a parallel program is written in a high-level language and typically has various modules executing in parallel. By contrast, the compiler methodically transforms the program into parallel code using various transformations, such as code movement. The automatic conversion of a program to parallel code is called compaction or parallelization. This paper describes the evolution of a new compaction program and presents a new algorithm for determining legal code movements. A simulator of the target architecture was used to estimate the execution times of a sample suite of programs before and after compaction. The results verify that substantial advantages arise from applying this compaction technique.  相似文献   

12.
Atomic operations are a key primitive in parallel computing systems. The standard implementation mechanism for atomic operations uses mutual exclusion locks. In an object-based programming system, the natural granularity is to give each object its own lock. Each operation can then make its execution atomic by acquiring and releasing the lock for the object that it accesses. But this fine lock granularity may have high synchronization overhead because it maximizes the number of executed acquire and release constructs. To achieve good performance it may be necessary to reduce the overhead by coarsening the granularity at which the computation locks objects.In this article we describe a static analysis technique—lock coarsening—designed to automatically increase the lock granularity in object-based programs with atomic operations. We have implemented this technique in the context of a parallelizing compiler for irregular, object-based programs and used it to improve the generated parallel code. Experiments with two automatically parallelized applications show these algorithms to be effective in reducing the lock overhead to negligible levels. The results also show, however, that an overly aggressive lock coarsening algorithm may harm the overall parallel performance by serializing sections of the parallel computation. A successful compiler must therefore negotiate a trade-off between reducing lock overhead and increasing the serialization.  相似文献   

13.
Local CPS conversion is a compiler transformation for improving the code generated for nested loops by a direct-style compiler that uses recursive functions to represent loops. The transformation selectively applies CPS conversion at non-tail call sites, which allows the compiler to use a single machine procedure and stack frame for both the caller and callee. In this paper, we describe LCPS conversion, as well as a supporting analysis. We have implemented Local CPS conversion in the MOBY compiler and describe our implementation. In addition to improving the performance of loops, Local CPS conversion is also used to aid the optimization of non-local control flow by the MOBY compiler. We present results from preliminary experiments with our compiler that show significant reductions in loop overhead as a result of Local CPS conversion.  相似文献   

14.
On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five.  相似文献   

15.
The task of designing and implementing a compiler can be a difficult and error-prone process. In this paper, we present a new approach based on the use of higher-order abstract syntax and term rewriting in a logical framework. All program transformations, from parsing to code generation, are cleanly isolated and specified as term rewrites. This has several advantages. The correctness of the compiler depends solely on a small set of rewrite rules that are written in the language of formal mathematics. In addition, the logical framework guarantees the preservation of scoping, and it automates many frequently-occurring tasks including substitution and rewriting strategies. As we show, compiler development in a logical framework can be easier than in a general-purpose language like ML, in part because of automation, and also because the framework provides extensive support for examination, validation, and debugging of the compiler transformations. The paper is organized around a case study, using the MetaPRL logical framework to compile an ML-like language to Intel x86 assembly. We also present a scoped formalization of x86 assembly in which all registers are immutable.  相似文献   

16.
Exploiting cache locality of parallel programs at runtime is a complementary approach to a compiler optimization. This is particularly important for those applications with dynamic memory access patterns. We propose a memory-layout oriented technique to exploit cache locality of parallel loops at runtime on Symmetric Multiprocessor (SMP) systems. Guided by application-dependent and targeted architecture-dependent hints, our system, called Cacheminer, reorganizes and partitions a parallel loop using the memory-access space of its execution. Through effective runtime transformations, our system maximizes the data reuse in each partitioned data region assigned in a cache, and minimizes the data sharing among the partitioned data regions assigned to all caches. The executions of tasks in the partitions are scheduled in an adaptive and locality-presented way to minimize the execution time of programs by trading off load balance and locality. We have implemented the Cacheminer runtime library on two commercial SMP servers and an SimCS simulated SMP. Our simulation and measurement results show that our runtime approach can achieve comparable performance with the compiler optimizations for programs with regular computation and memory-access patterns, whose load balance and cache locality can be well optimized by the tiling and other program transformations. However, our experimental results show that our approach is able to significantly improve the memory performance for the applications with irregular computation and dynamic memory access patterns. These types of programs are usually hard to optimize by static compiler optimizations  相似文献   

17.
W. M. Waite  L. R. Carter 《Software》1985,15(3):221-237
The ‘fact’ that compilers employing generated parsers suffer significant performance degradation dis-a-vis recursive descent compilers is entrenched in the folklore of computing. We give detailed measurements that support this belief when the entire compiler is written in Pascal. We then define a general interface for a parsing module that hides significantly more information than usual, simplifying the process of generation and integration. This interface makes an assembly-coded parse table interpreter feasible without changing the language used for the remainder of the compiler. When the parse table interpreter is written in assembly language, the costs of the generated parser are essentially the same as those for recursive descent. (A minor space/time trade-off is possible, with the recursive descent implementation being slightly bigger and slightly faster.)  相似文献   

18.
This paper presents new approaches to the validation of loop optimizations that compilers use to obtain the highest performance from modern architectures. Rather than verify the compiler, the approach of translation validationperforms a validation check after every run of the compiler, producing a formal proof that the produced target code is a correct implementation of the source code. As part of an active and ongoing research project on translation validation, we have previously described approaches for validating optimizations that preserve the loop structure of the code and have presented a simulation-based general technique for validating such optimizations. In this paper, for more aggressive optimizations that alter the loop structure of the code—such as distribution, fusion, tiling, and interchange—we present a set of permutation ruleswhich establish that the transformed code satisfies all the implied data dependencies necessary for the validity of the considered transformation. We describe the extensions to our tool voc-64 which are required to validate these structure-modifying optimizations. This paper also discusses preliminary work on run-time validation of speculative loop optimizations. This involves using run-time tests to ensure the correctness of loop optimizations whose correctness cannot be guaranteed at compile time. Unlike compiler validation, run-time validation must not only determine when an optimization has generated incorrect code, but also recover from the optimization without aborting the program or producing an incorrect result. This technique has been applied to several loop optimizations, including loop interchange and loop tiling, and appears to be quite promising. This research was supported in part by NSF grant CCR-0098299, ONR grant N00014-99-1-0131, and the John von Neumann Minerva Center for Verification of Reactive Systems.  相似文献   

19.
Data dependence and its application to parallel processing   总被引:8,自引:0,他引:8  
Data dependence testing is required to detect parallelism in programs. In this paper data dependence concepts and data dependence direction vectors are reviewed. Data dependence computation in parallel and vector constructs as well as serialdo loops is covered. Several transformations that require data dependence are given as examples, such as vectorization (translating serial code into vector code), concurrentization (translating serial code into concurrent code for multiprocessors), scalarization (translating vector or concurrent code into serial code for a scalar uniprocessor), loop interchanging and loop fusion. The details of data dependence testing including several data dependence decision algorithms are given.  相似文献   

20.
Optimizing compilers for data-parallel languages such as High Performance Fortran perform a complex sequence of transformations. However, the effects of many transformations are not independent, which makes it challenging to generate high quality code. In particular, some transformations introduce conditional control flow, while others make some conditionals unnecessary by refining program context. Eliminating unnecessary conditional control flow during compilation can reduce code size and remove a source of overhead in the generated code. This paper describes algorithms to compute symbolic constraints on the values of expressions used in control predicates and to use these constraints to identify and remove unnecessary conditional control flow. These algorithms have been implemented in the Rice dHPF compiler and we show that these algorithms are effective in reducing the number of conditionals and the overall size of generated code. Finally, we describe a synergy between control flow simplification and data-parallel code generation based on loop splitting which achieves the effects of more narrow data-parallel compiler optimizations such as vector message pipelining and the use of overlap areas.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号