期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast,frequency‐based,integrated register allocation and instruction scheduling

Ioana Cutcutache Weng‐Fai Wong 《Software》2008,38(11):1105-1126

Instruction scheduling and register allocation are two of the most important optimization phases in modern compilers as they have a significant impact on the quality of the generated code. Unfortunately, the objectives of these two optimizations are in conflict with one another. The instruction scheduler attempts to exploit instruction‐level parallelism and requires many operands to be available in registers. On the other hand, the register allocator wants register pressure to be kept low so that the amount of spill code can be minimized. Currently these two phases are done separately, typically in three passes: prepass scheduling, register allocation and postpass scheduling. But this separation can lead to poor results. Previous works attempted to solve the phase‐ordering problem by combining the instruction scheduler with graph‐coloring‐based register allocators. The latter tend to be computationally expensive. Linear‐scan register allocators, on the other hand, are simple, fast and efficient. In this paper, we describe our effort to integrate instruction scheduling with a linear‐scan allocator. Furthermore, our integrated optimizer is able to take advantage of execution frequencies obtained through profiling. Our integrated register allocator and instruction scheduler achieved good code quality with significantly reduced compilation times. On the SPEC2000 benchmarks running on a 900 MHz ItaniumII, compared with OpenIMPACT, we halved the time spent in instruction scheduling and register allocation with negligible impact on execution times. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

2.

Register spilling via transformed interference equations for PAC DSP architecture

Chung‐Ju Wu Chia‐Han Lu Jenq Kuen Lee 《Concurrency and Computation》2014,26(3):779-799

Digital signal processors (DSPs) with very long instruction word (VLIW) data‐path architectures are increasingly being deployed on embedded devices for multimedia processing applications. To reduce the power consumption and design cost of VLIW DSP processors, distributed register files and multibank register architectures are being adopted to reduce the number of read and write ports associated with register files, which presents new challenges for devising compiler optimization schemes. This paper addresses the issues of reducing the spill code for a VLIW DSP with distributed register files. Spill code produced by register allocation is traditionally handled by memory spills, but the multibank register‐file architecture provides the opportunity to spill‐out register values onto different register banks. We present a conceptual framework based on the universal and the proxy interference graphs to model the live ranges of registers for spilling codes to different register banks. Heuristic algorithms are then developed on the basis of this concept. By heuristically estimating the register pressure for each register file, we treat different register banks as optional spilling locations in addition to traditional spilling to memory. Experiments were performed on the parallel architecture core VLIW DSP with distributed register files by incorporating our proposed optimization schemes into an Open64‐based compiler. The experimental results show that our approach can improve the performances on average for DSPStone and MiBench benchmarks with spilling cases by 7.1% and 21.6%, respectively, compared with the one always handling spill code in memory. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

3.

Improving on Linear Scan Register Allocation

Shahrzad Kananizadeh Kirill Kononenko 《国际自动化与计算杂志》2018,15(2):228-238

Register allocation is a major step for all compilers. Various register allocation algorithms have been developed over the decades. This work describes a new class of rapid register allocation algorithms and presents experimental data on their behavior. Our research encourages the avoidance of graphing and graph-coloring based on the fact that precise graph-coloring is nondeterministic polynomial time-complete (NP-complete), which is not suitable for real-time tasks. In addition, practical graph-coloring algorithms tend to use polynomial-time heuristics. In dynamic compilation environments, their super linear complexity makes them unsuitable for register allocation and code generation. Existing tools for code generation and register allocation do not completely fulfill the require- ments of fast compilation. Existing approaches either do not allow for the optimization of register allocation to be achieved compre- hensively with a sufficient degree of performance or they require an unjustifiable amount of time and/or resources. Therefore, we pro- pose a new class of register allocation and code generation algorithms that can be performed in linear time. These algorithms are based on the mathematical foundations of abstract interpretation and the computation of the level of abstraction. They have been implemen- ted in a specialized library for just-in-time compilation. The specialization of this library involves the execution of common intermedi- ate language (CIL) and low level virtual machine (LLVM) with a focus on embedded systems. 相似文献

4.

Pragmatic integrated scheduling for clustered VLIW architectures

Rahul Nagpal Y. N. Srikant 《Software》2008,38(3):227-257

Clustered architecture processors are preferred for embedded systems because centralized register file architectures scale poorly in terms of clock rate, chip area, and power consumption. Scheduling for clustered architectures involves spatial concerns (where to schedule) as well as temporal concerns (when to schedule). Various clustered VLIW configurations, connectivity types, and inter‐cluster communication models present different performance trade‐offs to a scheduler. The scheduler is responsible for resolving the conflicting requirements of exploiting the parallelism offered by the hardware and limiting the communication among clusters to achieve better performance. In this paper, we describe our experience with developing a pragmatic scheme and also a generic graph‐matching‐based framework for cluster scheduling based on a generic and realistic clustered machine model. The proposed scheme effectively utilizes the exact knowledge of available communication slots, functional units, and load on different clusters as well as future resource and communication requirements known only at schedule time. The proposed graph‐matching‐based framework for cluster scheduling resolves the phase‐ordering and fixed‐ordering problem associated with earlier schemes for scheduling clustered VLIW architectures. The experimental evaluation in the context of a state‐of‐art commercial clustered architecture (using real‐world benchmark programs) reveals a significant performance improvement over the earlier proposals, which were mostly evaluated using compiled simulation of hypothetical clustered architectures. Our results clearly highlight the importance of considering the peculiarities of commercial clustered architectures and the hard‐nosed performance measurement. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

5.

Experimental evaluation of various register‐pressure‐reduction heuristics

Ghassan Shobaki Laith Sakka Najm Eldeen Abu Rmaileh Hasan Al‐Hamash 《Software》2015,45(11):1497-1517

Minimizing the amount of spill code is still an open problem in code generation and optimization. The amount of spill code depends on both the register allocation algorithm and the pre‐allocation instruction scheduling algorithm that controls the register pressure. In this paper, we focus on the impact of pre‐allocation instruction scheduling on the amount of spill code. Many heuristic techniques have been proposed to do instruction scheduling with the objective of minimizing register pressure and consequently the amount of spill code. However, the performance of these heuristic techniques has not been studied relative to optimality on real large‐scale programs. In this paper, we present an experimental study that evaluates the performance of several pre‐allocation scheduling heuristics. The evaluation involves computing an experimental lower bound on the size of gap between each heuristic's performance and optimal performance. We also propose a simple heuristic technique based on a specific permutation of two basic priority schemes and experimentally evaluate the performance of this technique compared with other heuristics, including the heuristics implemented in the LLVM open‐source Compiler. The evaluation is carried out by running SPEC CPU2006 on real x86‐64 hardware and measuring both the amount of spill code and the execution time. The results of our study show that the proposed heuristic technique gives better overall performance than LLVM's best heuristic on x86‐64, although it produces slightly more spilling. The proposed heuristic has better overall performance, because it achieves a better balance between register pressure and instruction‐level parallelism (ILP). This result shows the importance of ILP in pre‐allocation scheduling even on out‐of‐order machines. Furthermore, the results of the study show that there is a large gap between the performance of any of the studied heuristics and optimal performance; even the best heuristic in the study produces significantly more spill code than the optimal amount. This experimental result quantifies the intuitive belief that it is unlikely to find a heuristic that works well in all cases, thus showing the need for more rigorous solutions using combinatorial approaches. The paper discusses the challenges and complexities that are involved in developing such rigorous solutions. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

6.

Vega: Non‐Linear FEM Deformable Object Simulator

F. S. Sin D. Schroeder J. Barbi 《Computer Graphics Forum》2013,32(1):36-48

This practice and experience paper describes a robust C++ implementation of several non‐linear solid three‐dimensional deformable object strategies commonly employed in computer graphics, named the Vega finite element method (FEM) simulation library. Deformable models supported include co‐rotational linear FEM elasticity, Saint–Venant Kirchhoff FEM model, mass–spring system and invertible FEM models: neo‐Hookean, Saint–Venant Kirchhoff and Mooney–Rivlin. We provide several timestepping schemes, including implicit Newmark and backward Euler integrators, and explicit central differences. The implementation of material models is separated from integration, which makes it possible to employ our code not only for simulation, but also for deformable object control and shape modelling. We extensively compare the different material models and timestepping schemes. We provide practical experience and insight gained while using our code in several computer animation and simulation research projects. 相似文献

7.

Global Register Allocation for SIMD Multiprocessors

下载免费PDF全文

Benjamin HAO David PEARSON Richard ZIPPEL 《计算机科学技术学报》1996,11(3):222-236

It is relatively clear how to map regular,repetitive or grid oriented computations onto SIMD architectures.It is not so clear,however,how to do this for irregular computations even though there may be significant amounts of intrinsic parallelism in branch free code.We study compilation techniques for this type of code when targeted to SIMD computers and illustrate their use on a simple model architecture.In this paper,we present one of the compilation techniques,global register allocation,we have developed for SIMD computers,and demonstrate that it can effectively allocate registers for parallelizing irregular computations in branch free code.This technique is an extension and a modification of the register allocation via graph coloring approach used by sequential compilers.Our performance results validate our method. 相似文献

8.

Solving the Register Allocation Problem for Embedded Systems Using a Hybrid Evolutionary Algorithm

Topcuoglu H.R. Demiroz B. Kandemir M. 《Evolutionary Computation, IEEE Transactions on》2007,11(5):620-634

Embedded systems are unique in the challenges they present to application programmers, such as power and memory space constraints. These characteristics make it imperative to design customized compiler passes. One of the important factors that shape runtime performance of a given embedded code is the register allocation phase of compilation. It is crucial to provide aggressive and sophisticated register allocators for embedded devices, where the excessive compilation time can be tolerated due to high demand on code quality. Failing to do a good job on allocating variables to registers (i.e., determining the set of variables to be stored in the limited number of registers) can have serious power, performance, and code size consequences. This paper explores the possibility of employing a hybrid evolutionary algorithm for register allocation problem in embedded systems. The proposed solution combines genetic algorithms with a local search technique. The algorithm exploits a novel, highly specialized crossover operator that takes into account domain-specific information. The results from our implementation based on synthetic benchmarks and routines that are extracted from well-known benchmark suites clearly show that the proposed approach is very successful in allocating registers to variables. In addition, our experimental evaluation also indicates that it outperforms a state-of-the-art register allocation heuristic based on graph coloring for most of the cases experimented. 相似文献

9.

Reducing the overhead of dynamic compilation

Chandra J. Krintz David Grove Vivek Sarkar Brad Calder 《Software》2001,31(8):717-738

The execution model for mobile, dynamically‐linked, object‐oriented programs has evolved from fast interpretation to a mix of interpreted and dynamically compiled execution. The primary motivation for dynamic compilation is that compiled code executes significantly faster than interpreted code. However, dynamic compilation, which is performed while the application is running, introduces execution delay. In this paper we present two dynamic compilation techniques that enable high performance execution while reducing the effect of this compilation overhead. These techniques can be classified as (1) decreasing the amount of compilation performed, and (2) overlapping compilation with execution. We first present and evaluate lazy compilation, an approach used in most dynamic compilation systems in which individual methods are compiled on‐demand upon their first invocation. This is in contrast to eager compilation, in which all methods in a class are compiled when a new class is loaded. In this work, we describe our experience with eager compilation, as well as the implementation and transition to lazy compilation. We empirically detail the effectiveness of this decision. Our experimental results using the SpecJVM Java benchmarks and the Jalapeño JVM show that, compared to eager compilation, lazy compilation results in 57% fewer methods being compiled and reductions in total time of 14 to 26%. Total time in this context is compilation plus execution time. Next, we present profile‐driven, background compilation, a technique that augments lazy compilation by using idle cycles in multiprocessor systems to overlap compilation with application execution. With this approach, compilation occurs on a thread separate from that of application threads so as to reduce intermittent, and possibly substantial, delay in execution. Profile information is used to prioritize methods as candidates for background compilation. Methods are compiled according to this priority scheme so that performance‐critical methods are invoked using optimized code as soon as possible. Our results indicate that background compilation can achieve the performance of off‐line compiled applications and masks almost all compilation overhead. We show significant reductions in total time of 14 to 71% over lazy compilation. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

10.

一种面向VLIW指令压缩的寄存器分配算法 总被引：1，自引：0，他引：1

朱少波姚庆栋洪享史册《计算机工程》2003,29(20):154-156

针对VLIW结构的指令压缩方法，通过对编译中间代码的深入分析和总结，提出一种改进的寄存器分配算法，该算法在线性扫描的基础上，对寄存器的选择添加约束条件，应用该算法能够使得目标代码中寄存器的编号尽量靠近，从而达到更好的压缩效果。相似文献

11.

溢出代码和访存压力敏感的快速机器学习*

刘章林张兆庆《计算机应用研究》2007,24(6):216-219

基于遗传算法提出了溢出代码和访存压力敏感的机器学习来调试寄存器分配的权值函数。不同于以往采用目标程序的运行时间作为适应值,通过静态分析寄存器分配产生的溢出代码和基本块中的访存压力来构建适应值,以减少学习时间。这些分析被限定在热点函数中,在保证适应值精度的同时进一步加快了学习速度。实验表明,快速学习仅需要考虑热点函数的编译时间,整个CPU2000CINT测试集在5 h内即可学习完毕。大部分CPU2000CINT测试例子的性能得到了提高。其中perlbmk的性能提升最高可达到7.2%。相似文献

12.

The effectiveness of producer‐side machine‐independent optimizations for mobile code

Wolfram Amme Jeffery von Ronne Philipp Adler Michael Franz 《Software》2009,39(10):923-946

Portable mobile code is often executed by a host virtual machine using just‐in‐time compilation. In this context, the compilation time in the host virtual machine is critical. This compilation time can be reduced if optimizations are performed ahead‐of‐time before distribution of the mobile code. Unfortunately, the portable nature of mobile code limits ahead‐of‐time optimizations to those that are machine‐independent. This work examines the effect of machine‐independent optimizations on the performance of mobile code applications. All experiments use the SafeTSA Format, a mobile code format that is based on Static Single Assignment Form (SSA Form). The experiments, which are performed on both the PowerPC and IA32 architectures, indicate that the effects of performing classical machine‐independent optimizations are—in fact—quite machine‐dependent. Nevertheless, the results demonstrate that applying such optimizations in a mobile code system can be beneficial. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

13.

Region-based dual bank register allocation for reduced instruction encoding Architectures

《Microprocessors and Microsystems》2017

In embedded systems, small code size is important due to memory constraints. One technique to achieve a small code size is reducing the instruction encoding from 32-bit to 16-bit, such as the ARM THUMB or MIPS-16 architectures. This half-size encoding leads to shorter register operands, making fewer registers available for register allocation and causing more spills, although invisible registers can be used as spill locations via copies. We propose reconstructing the original register file into dual-banks, added with the bank toggle instruction for bank changes and the inter-bank copies between the banks. We also propose an efficient dual-bank register allocation technique based on regions in the code to reduce spills. As a case study, we applied our banked register allocation model for the THUMB architecture. We found that the code size decreases by as much as 8% (5.8% on average) while the performance improves by as much as 11.1% (3.3% on average). Our results indicate that we would better organize the register file of an embedded CPU that can provide reduced encoding into dual banks for better quality of register allocation, rather than using the invisible registers for spills. 相似文献

14.

The complementary relationship of interprocedural register allocation and inlining

Feipei Lai Yung-kuang Chao Chia-Jung Hsieh 《International journal of parallel programming》1994,22(4):409-434

Inline expansion and interprocedural register allocation are two general approaches used for interprocedural optimization. However, there are certain situations which prevent either of them from being applied smoothly to procedure calls. Especially, interactions between inlining and register allocation can cause an inlined version of a program to run more slowly than its noninlined counterpart. This paper describes a method of integrating inlining and interprocedural register allocation to reduce the procedure call overhead without this negative effect. We use profile information to identify the heavy called procedures regions and the register usage information of each code site to optimize the placement of the register save/restore code. This method also takes full advantage offree-use registers at each procedure call site. The average performance improvement is 1.21 compared with the previous schemes that performed either of them independently. 相似文献

15.

A multi‐grained distributed implementation of the parallel Block Conjugate Gradient algorithm

A. Murli L. D'Amore G. Laccetti F. Gregoretti G. Oliva 《Concurrency and Computation》2010,22(15):2053-2072

The Block Conjugate Gradient algorithm (Block‐CG) was developed to solve sparse linear systems of equations that have multiple right‐hand sides. We have adapted it for use in heterogeneous, geographically distributed, parallel architectures. Once the main operations of the Block‐CG (Tasks) have been collected into smaller groups (subjobs), each subjob is matched by the middleware MJMS (MPI Jobs Management System) with a suitable resource selected among those which are available. Moreover, within each subjob, concurrency is introduced at two different levels and with two different granularities: the coarse‐grained parallelism to perform independent tasks and the fine‐grained parallelism within the execution of a task. We refer to this algorithm as to multi‐grained distributed implementation of the parallel Block‐CG. We compare the performance of a parallel implementation with the one of the distributed implementation running on a variety of Grid computing environments. The middleware MJMS—developed by some of the authors and built on top of Globus Toolkit and Condor‐G—was used for co‐allocation, synchronization, scheduling and resource selection. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

16.

KOALA: a co‐allocating grid scheduler

Hashim Mohamed Dick Epema 《Concurrency and Computation》2008,20(16):1851-1876

In multicluster systems, and more generally in grids, jobs may require co‐allocation, that is, the simultaneous or coordinated access of single applications to resources of possibly multiple types in multiple locations managed by different resource managers. Co‐allocation presents new challenges to resource management in grids, such as locating sufficient resources in geographically distributed sites, allocating and managing resources in multiple, possibly heterogeneous sites for single applications, and coordinating the execution of single jobs at multiple sites. Moreover, as single jobs now may have to rely on multiple resource managers, co‐allocation introduces reliability problems. In this paper, we present the design and implementation of a co‐allocating grid scheduler named KOALA that meets these co‐allocation challenges. In addition, we report on the results of an analysis of the performance in our multicluster testbed of the co‐allocation policies built into KOALA . We also include the results of a performance and reliability test of KOALA while our testbed was unstable. Copyright © 2007 John Wiley & Sons, Ltd. 相似文献

17.

Incremental compiler transformations for multiple instruction retry

Shyh-Kwei Chen Neal J. Alewine W. Kent Fuchs Wen-Mei W. Hwu 《Software》1994,24(12):1179-1198

Previous work on compiler-based multiple instruction retry has utilized a series of compiler transformations, loop protection, node splitting, and loop expansion, to eliminate anti-dependencies of length ≤ N in the pseudo register, the machine register, and the post-pass resolver phases of compilation.¹ The results have provided a means of rapidly recovering from transient processor failures by rolling back N instructions. This paper presents techniques for improving compilation and run-time performance in compiler-based multiple instruction retry. Incremental updating enhances compilation time when new instructions are added to the program. Post-pass code rescheduling and spill register reassignment algorithms improve the run-time performance and decrease the code growth across the application programs studied. Branch hazards are shown to be resolvable by simple modifications to the incremental updating schemes during the pseudo register phase and to the spill register reassignment algorithm during the post-pass phase. 相似文献

18.

A formal introduction to the compilation of Java

Stephan Diehl 《Software》1998,28(3):297-327

The term abstract machine is widely accepted to denote intermediate target languages and related architectures which serve as an intermediate stage in compiling programming languages. In this paper we explain how a considerable subset of Java is translated into Byte-Code for the Java Virtual Machine, an abstract machine used as a target for Java compilation. Using formal and precise notation we present the language concepts, the related byte-code instructions and the compilation schemes. Hitherto none of the existing literature on the JVM^1,2 describes how compilation is done, but present the JVM in isolation. © 1998 John Wiley & Sons, Ltd. 相似文献

19.

Efficient Register Mapping and Allocation in LaTTe, an Open-Source Java Just-in-Time Compiler

Byung-Sun Yang Junpyo Lee SeungIl Lee Seongbae Park Yoo C. Chung Suhyun Kim Kemal Ebcioglu Erik Altman Soo-Mook Moon 《Parallel and Distributed Systems, IEEE Transactions on》2007,18(1):57-69

Java just-in-time (JIT) compilers improve the performance of a Java virtual machine (JVM) by translating Java bytecode into native machine code on demand. One important problem in Java JIT compilation is how to map stack entries and local variables to registers efficiently and quickly, since register-based computations are much faster than memory-based ones, while JIT compilation overhead is part of the whole running time. This paper introduces LaTTe, an open-source Java JIT compiler that performs fast generation of efficiently register-mapped RISC code. LaTTe first maps "all" local variables and stack entries into pseudoregisters, followed by real register allocation which also coalesces copies corresponding to pushes and pops between local variables and stack entries aggressively. Our experimental results indicate that LaTTe's sophisticated register mapping and allocation really pay off, achieving twice the performance of a naive JIT compiler that maps all local variables and stack entries to memory. It is also shown that LaTTe makes a reasonable trade-off between quality and speed of register mapping and allocation for the bytecode. We expect these results will also be beneficial to parallel and distributed Java computing: 1) by enhancing single-thread Java performance; and 2) by significantly reducing the number of memory accesses which the rest of the system must properly order to maintain coherence and keep threads synchronized 相似文献

20.

A Complete Characterization of Group-Strategyproof Mechanisms of Cost-Sharing

Emmanouil?Pountourakis Email author Angelina?Vidali 《Algorithmica》2012,63(4):831-860

We study the problem of designing group-strategyproof cost-sharing mechanisms. The players report their bids for getting serviced and the mechanism decides a set of players that are going to be serviced and how much each one of them is going to pay. We determine three conditions: Fence Monotonicity, Stability of the allocation and Validity of the tie-breaking rule that are necessary and sufficient for group-strategyproofness, regardless of the cost function. Consequently, Fence Monotonicity characterizes group-strategyproof cost-sharing schemes closing an important open problem. Finally, we use our results to prove that there exist families of cost functions, where any group-strategyproof mechanism has arbitrarily poor budget balance. 相似文献