期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

David R. Hanson 《Software》1983,13(8):745-763

Program optimization has received a great deal of attention for many years, which has resulted in numerous advances in compiler technology. The effectiveness of various simple optimizations has received comparably little attention during the same time period. The simplicity of most programs suggests that straightforward optimizations pay the greatest dividends. This paper describes three such optimizations suitable for one-pass compilers. The optimizations involve expression rearrangement, instruction selection, and the use of a cache for the allocation of resources. The cost of these optimizations is low; none require major changes to the size or structure of the compiler or reduce compilation speed by more than 10%. The benefits are high; each optimization results in at least a 10% average reduction in object code size and a corresponding reduction in execution time. Examples and implementation details are also described. 相似文献

2.

Microbenchmarks for determining branch predictor organization

Milena Milenkovic Aleksandar Milenkovic Jeffrey Kulick 《Software》2004,34(5):465-487

In order to achieve an optimum performance of a given application on a given computer platform, a program developer or compiler must be aware of computer architecture parameters, including those related to branch predictors. Although dynamic branch predictors are designed with the aim of automatically adapting to changes in branch behavior during program execution, code optimizations based on the information about predictor structure can greatly increase overall program performance. Yet, exact predictor implementations are seldom made public, even though processor manuals provide valuable optimization tips. This paper presents an experimental flow with a series of microbenchmarks that determine the organization and size of a branch predictor using on‐chip performance monitoring registers. Such knowledge can be used either for manual code optimization or for design of new, more architecture‐aware compilers. Three examples illustrate how insight into exact branch predictor organization can be directly applied to code optimization. The proposed experimental flow is illustrated with microbenchmarks tuned for Intel Pentium III and Pentium 4 processors, although they can easily be adapted for other architectures. The described approach can also be used during processor design for performance evaluation of various branch predictor organizations and for testing and validation during implementation. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

3.

基于GCC的VLIW编译系统研究 总被引：1，自引：1，他引：0

朱凯佳尹宝林《计算机工程与应用》2001,37(12):125-128

VLIW机器在单个机器周期中同时发射并执行多个的并行操作,从而获得较高的指令级并行度,这些操作之间的依赖分析和调度工作则被完全交给相应的编译器执行,因此VLIW的并行性能能否充分发挥取决于VLIW体系结构相关编译器的质量。GNU开发的GCC是被最广泛使用的编译系统之一,它具有多语言、多平台支持的能力和开放的结构,能够运用各种成熟的常规编译优化技术生成高效的代码。文章分析了VLIW及GCC的结构特点,提出了一种基于GCC的VLIW编译系统设计方案,利用GCC进行RTL中间代码一级的体系结构无关优化和少量体系结构相关优化,在汇编代码一级针对VLIW结构进行体系结构相关的优化,从而充分利用GCC的成熟编译技术快速开发高效的VLIW多语言编译系统。相似文献

4.

Using profile information to assist classic code optimizations

Pohua P. Chang Scott A. Mahlke Wen-Mei W. Hwu 《Software》1991,21(12):1301-1321

This paper describes the design and implementation of an optimizing compiler that automatically generates profile information to assist classic code optimizations. This compiler contains two new components, an execution profiler and a profile-based code optimizer, which are not commonly found in traditional optimizing compilers. The execution profiler inserts probes into the input program, executes the input program for several inputs, accumulates profile information and supplies this information to the optimizer. The profile-based code optimizer uses the profile information to expose new optimization opportunities that are not visible to traditional global optimization methods. Experimental results show that the profile-based code optimizer significantly improves the performance of production C programs that have already been optimized by a high-quality global code optimizer. 相似文献

5.

一种加速访存地址计算的编译优化

高秀武姜军白书敬黄亮明《计算机工程》2023,49(1):173-180

在国产申威高性能多核服务器系统中,基础编译系统对应用程序中访存操作进行代码生成时,没有考虑国产处理器指令特征,导致编译器生成的访存地址计算代码效率较低,影响国产高性能处理器的性能。为充分发挥国产处理器高性能计算能力,提出一种加速访存地址计算的编译优化方法。加速访存地址计算编译优化基于处理器支持带扩展因子的运算指令,在编译器后端内存地址表达式合法性检查中,添加针对乘加模式的地址计算表达式合法性检查算法,自动识别地址表达式中存在的乘加运算并进行合法性检验,对符合条件的地址表达式在代码生成阶段匹配生成带扩展因子的运算指令来快速计算访存地址,从而加快访存指令的发射与执行以及应用程序中的访存地址生成,提升访存效率。使用行业标准性能测试集SPEC CPU2006对优化效果进行评测,结果表明,相比优化前SPECspeed Integer与SPECspeed Float Point两个子集,该优化方法平均性能分别提高了2.53%与1.50%。相似文献

6.

Compilation techniques for parallel systems

《Parallel Computing》1999,25(13-14):1741-1783

Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed. 相似文献

7.

Region-based compilation: Introduction, motivation, and initial experience 总被引：1，自引：0，他引：1

Richard E. Hank Wen-mei W. Hwu B. Ramakrishna Rau 《International journal of parallel programming》1997,25(2):113-146

The most important task of a compiler designed to exploit instruction-level parallelism (ILP) is instruction scheduling. If higher levels of ILP are to be achieved, the compiler must use, as the unit of scheduling, regions consisting of multiple basic blocks—preferably those that frequently execute consecutively, and which capture cycles in the program’s execution. Traditionally, compilers have been built using the function as the unit of compilation. In this framework, function boundaries often act as barriers to the formation of the most suitable scheduling regions. Function inlining may be used to circumvent this problem by assembling strongly coupled functions into the same compilation unit, but at the cost of very large function bodies. Consequently, global optimizations whose compile time and space requirements are superlinear in the size of the compilation unit, may be rendered prohibitively expensive. This paper introduces a new approach, called region-based compilation, wherein the compiler, after inlining, repartitions the program into more desirable compilation units, termed regions. Region-based compilation allows the compiler to control problem size and complexity while exposing inter-procedural scheduling, optimization and code motion opportunities. 相似文献

8.

跨文件编译模式与基于GCC的实现

下载免费PDF全文

郭学鹏赵克佳《计算机工程与科学》2007,29(4):111-115

有很多编译优化都与编译时的视野有关,较宽的视野能给编译器提供更详细的信息,从而能得到更好的优化效果。采用跨文件编译模式使编译器的视野扩大至整个程序将是未来的方向。本文总结了实现这种模式的一般流程以及所遇到的问题和解决方法,分析了三种已提出的跨文件过程间编译模式,最后给出了一个基于GCC3.4的跨文件编译框架的实现方法。相似文献

9.

基于NAS Benchmarks的ORC性能测试

林海波汤志忠《计算机科学》2003,30(3):141-145

1 引言安腾(Itanium)处理器是HP/Intel公司推出的第一代基于IA-64体系结构的处理器。IA-64体系结构是一种64位的支持显式指令级并行计算(Explicit Parallel Instruction Computing,EPIC)的体系结构,它实现了一系列新特性,支持开发更大的指令级并行性(Instruction Level Parallelism,ILP),突破了传统体系结构的性能限制。这些新特性包括:猜相似文献

10.

Design and implementation of a queue compiler

Arquimedes Canedo Ben A. Abderazek Masahiro Sowa 《Microprocessors and Microsystems》2009,33(2):129-138

Queue processors are a viable alternative for high performance embedded computing and parallel processing. We present the design and implementation of a compiler for a queue-based processor. Instructions of a queue processor implicitly reference their operands making the programs free of false dependencies. Compiling for a queue machine differs from traditional compilation methods for register machines. The queue compiler is responsible for scheduling the program in level-order manner to expose natural parallelism and calculating instructions relative offset values to access their operands. This paper describes the phases and data structures used in the queue compiler to compile C programs into assembly code for the QueueCore, an embedded queue processor. Experimental results demonstrate that our compiler produces good code in terms of parallelism and code size when compared to code produced by a traditional compiler for a RISC processor. 相似文献

11.

Alternatives of profile-guided code optimizations for one-stage compilation

O.?A.?Chetverina Email author 《Programming and Computer Software》2016,42(1):34-40

Optimizing compilers increase the resulting code performance by carrying out a number of code optimization techniques. Profile information assistance for code optimizations gives an opportunity to greatly increase the code performance in some cases. However, the impossibility to provide a representative training execution often leads to the decline in efficiency of profile-dependent code optimizations. This paper investigates the main causes of the performance loss for the one-stage optimization as compared to the profileguided optimization (PGO) and introduces some alternative compilation techniques to reduce this loss. The effectiveness of these techniques is evaluated for a VLIW-architecture Elbrus compiler. 相似文献

12.

The Partial Reverse If-Conversion Framework for Balancing Control Flow and Predication

David I. August Wen-Mei W. Hwu Scott A. Mahlke 《International journal of parallel programming》1999,27(5):381-423

Predicated execution is a promising architectural feature for exploiting instruction-level parallelism in the presence of control flow. Compiling for predicated execution involves converting program control flow into conditional, or predicated, instructions. This process is known as if-conversion. In order to apply ifconversion effectively, one must address two major issues: what should be ifconverted and when the if-conversion should be performed. A compiler's use of predication as a representation is most effective when large amounts of code are if-converted and when if-conversion is performed early in the compilation procedure. On the other hand, efficient execution of code generated for a processor with predicated execution requires a delicate balance between control flow and predication. The appropriate balance is tightly coupled with scheduling decisions and detailed processor characteristics. This paper presents a compilation framework based on partial reverse if-conversion that allows the compiler to maximize the benefits of predication as a compiler representation while delaying the final balancing of control flow and predication to schedule time. 相似文献

13.

SIMD计算机的优化编译器设计 总被引：1，自引：1，他引：0

下载免费PDF全文

赵辉黄石《计算机工程》2009,35(1):201-203

利用处理器的相关资源,提高编译器优化性能和增强代码可适应性是SIMD处理器优化编译的关键。该文基于M语言和LSSIMD体系结构,结合现代编译器的编译技术,提出针对SIMD协处理器编译器的优化和实现方法,包括寄存器分配、单值合并、代码压缩等。实验结果表明,编译生成的目标代码准确、高效。相似文献

14.

Separate compilation of hierarchical real-time programs into linear-bounded Embedded Machine code

Arkadeb Ghosal Christoph M. Kirsch 《Science of Computer Programming》2012,77(2):96-112

Hierarchical Timing Language (HTL) is a coordination language for distributed, hard real-time applications. HTL is a hierarchical extension of Giotto and, like its predecessor, based on the logical execution time (LET) paradigm of real-time programming. Giotto is compiled into code for a virtual machine, called the Embedded Machine (or E machine). If HTL is targeted to the E machine, then the hierarchical program structure needs to be flattened; the flattening makes separate compilation difficult, and may result in E machine code of exponential size. In this paper, we propose a generalization of the E machine, which supports a hierarchical program structure at runtime through real-time trigger mechanisms that are arranged in a tree. We present the generalized E machine, and a modular compiler for HTL that generates code of linear size. The compiler may generate code for any part of a given HTL program separately in any order. 相似文献

15.

Validation of GCC optimizers through trace generation

Aditya Kanade Amitabha Sanyal Uday P. Khedker 《Software》2009,39(6):611-639

The translation validation approach involves establishing semantics preservation of individual compilations. In this paper, we present a novel framework for translation validation of optimizers. We identify a comprehensive set of primitive program transformations that are commonly used in many optimizations. For each primitive, we define soundness conditions that guarantee that the transformation is semantics preserving. This framework of transformations and soundness conditions is independent of any particular compiler implementation and is formalized in PVS. An optimizer is instrumented to generate the trace of an optimization run in terms of the predefined transformation primitives. The validation succeeds if (1) the trace conforms to the optimization and (2) the soundness conditions of the individual transformations in the trace are satisfied. The first step eliminates the need to trust the instrumentation. The soundness conditions are defined in a temporal logic and therefore the second step involves model checking. Thus the scheme is completely automatable. We have applied this approach to several intraprocedural optimizations of RTL intermediate code in GNU Compiler Collection (GCC) v4.1.0, namely, loop invariant code motion, partial redundancy elimination, lazy code motion, code hoisting, and copy and constant propagation for sample programs written in a subset of the C language. The validation does not require information about program analyses performed by GCC. Therefore even though the GCC code base is quite large and complex, instrumentation could be achieved easily. The framework requires an estimated 21 lines of instrumentation code and 140 lines of PVS specifications for every 1000 lines of the GCC code considered for validation. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

16.

编译器中激进蝴蝶优化方法的研究与实现

朱广林吕方赖庆宽陈华英何先波《计算机工程与科学》2021,43(6):962-968

编译优化技术的目的是挖掘程序中的优化空间,提高程序编译或运行效率,无效代码删除优化是被广泛使用的编译优化技术之一,它旨在删除程序中不可达的代码,以提升程序的执行效率。许多应用程序的执行路径往往与运行时的输入参数值相关,并且在一些分支路径上与运行时参数值相结合,可能存在无效代码,通过现有的无效代码删除优化,很难做出优化处理。为此,提出一种依赖数据流分析的激进蝴蝶优化方法,利用SSA中间表示,根据动态运行时的参数可能值,自动为程序生成代码形状类似蝴蝶（butterfly）的分支代码,使编译器在程序编译阶段为相关优化提供可行的优化依据。最后通过实验验证了该方法的有效性和可行性。相似文献

17.

Analyses and Optimizations for Shared Address Space Programs

Arvind Krishnamurthy Katherine Yelick 《Journal of Parallel and Distributed Computing》1996,38(2):130

We present compiler analyses and optimizations for explicitly parallel programs that communicate through a shared address space. Any type of code motion on explicitly parallel programs requires a new kind of analysis to ensure that operations reordered on one processor cannot be observed by another. The analysis, calledcycle detection, is based on work by Shasha and Snir and checks for cycles among interfering accesses. We improve the accuracy of their analysis by using additional information fromsynchronization analysis, which handles post–wait synchronization, barriers, and locks. We also make the analysis efficient by exploiting the common code image property of SPMD programs. We make no assumptions on the use of synchronization constructs: our transformations preserve program meaning even in the presence of race conditions, user-defined spin locks, or other synchronization mechanisms built from shared memory. However, programs that use linguistic synchronization constructs rather than their user-defined shared memory counterparts will benefit from more accurate analysis and therefore better optimization. We demonstrate the use of this analysis for communication optimizations on distributed memory machines by automatically transforming programs written in a conventional shared memory style into a Split-C program, which has primitives for nonblocking memory operations and one-way communication. The optimizations includemessage pipelining, to allow multiple outstanding remote memory operations, conversion of two-way to one-way communication, and elimination of communication through data reuse. The performance improvements are as high as 20–35% for programs running on a CM-5 multiprocessor using the Split-C language as a global address layer. Even larger benefits can be expected on machines with higher communication latency relative to processor speed. 相似文献

18.

Proofs of numerical programs when the compiler optimizes

Sylvie Boldo Thi Minh Tuyen Nguyen 《Innovations in Systems and Software Engineering》2011,7(2):151-160

On certain recently developed architectures, a numerical program may give different answers depending on the execution hardware and the compilation. Our goal is to formally prove properties about numerical programs that are true for multiple architectures and compilers. We propose an approach that states the rounding error of each floating-point computation whatever the environment and the compiler choices. This approach is implemented in the Frama-C platform for static analysis of C code. Small case studies using this approach are entirely and automatically proved. 相似文献

19.

基于值剖视的编译优化

下载免费PDF全文

孔凡金黄春《计算机工程》2011,37(6):58-60

介绍在GCC编译器中利用值剖视识别收集变量的不变特征信息并指导代码优化工作的方法。NPB基准测试程序的测试结果表明，GCC基于值剖视的优化引入的开销小，与边剖视一起使用时能获得较好的优化效果，在不同程序间显示出一定的优化针对性和局限性，值剖视信息的类型与数量、优化种类等存在较大的改进空间。相似文献

20.

Compiler techniques for the distribution of data and computation

Navarro A. Zapata E. Padua D. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):545-562

This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines. 相似文献