首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 38 毫秒
1.
Instruction scheduling and register allocation are two of the most important optimization phases in modern compilers as they have a significant impact on the quality of the generated code. Unfortunately, the objectives of these two optimizations are in conflict with one another. The instruction scheduler attempts to exploit instruction‐level parallelism and requires many operands to be available in registers. On the other hand, the register allocator wants register pressure to be kept low so that the amount of spill code can be minimized. Currently these two phases are done separately, typically in three passes: prepass scheduling, register allocation and postpass scheduling. But this separation can lead to poor results. Previous works attempted to solve the phase‐ordering problem by combining the instruction scheduler with graph‐coloring‐based register allocators. The latter tend to be computationally expensive. Linear‐scan register allocators, on the other hand, are simple, fast and efficient. In this paper, we describe our effort to integrate instruction scheduling with a linear‐scan allocator. Furthermore, our integrated optimizer is able to take advantage of execution frequencies obtained through profiling. Our integrated register allocator and instruction scheduler achieved good code quality with significantly reduced compilation times. On the SPEC2000 benchmarks running on a 900 MHz ItaniumII, compared with OpenIMPACT, we halved the time spent in instruction scheduling and register allocation with negligible impact on execution times. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

2.
This paper presents a fundamentally new approach to global register allocation that optimally allocates registers and optimally places spill code, significantly decreasing spill code overhead compared with the traditional graph-coloring approach. The Optimal Register Allocation (ORA) approach formulates global register allocation as a 0–1 integer programming problem, incorporating all aspects of register allocation within a unified framework, including copy elimination, live range splitting, rematerialization, callee and caller register spilling, special instruction-operand requirements, and paired registers. A prototype ORA allocator is built into the Gnu C Compiler (GCC). For the SPEC92 integer benchmarks, the ORA allocator actually produces a net decrease of more than 100 million cycles across the entire benchmark set, because the dynamic copies the ORA allocator removes exceed the dynamic loads and stores that are inserted. In contrast, the GCC allocator and a Chaitin-style graph-coloring allocator each cause a net increase of more than 1 billion cycles. Because global register allocation is NP-complete, optimal register allocation has been considered intractable. However, the run-time complexity of the ORA approach is shown experimentally to be O(n3). A profile-guided hybrid allocation approach is proposed that uses the ORA allocator for the performance critical regions in the performance critical functions, while using a graph-coloring allocator for the non-critical functions and regions. An ORA-GCC hybrid allocator takes an average of 4.6 seconds per function to produce an allocation that is within 1% of optimal for 97% of the SPEC92 integer benchmark functions, showing that the hybrid allocator is practical as an advanced optimization for performance-critical codes.  相似文献   

3.
Compile-time reordering of low level instructions is successful in achieving large increases in performance of programs on fine grain parallel machines. However, because of the interdependences between instruction scheduling and register allocation, a lack of cooperation between the scheduler and register allocator can result in generating code that contains excess register spills and/or a lower degree of parallelism than actually achievable. This paper describes a strategy for providing cooperation between register allocation and both global and local instruction scheduling. We experimentally compare this strategy with other cooperative and uncooperative scenarios.  相似文献   

4.
一种基于寄存器压力的VLIW DSP分簇算法   总被引:1,自引:0,他引:1  
寄存器是程序运行时最宝贵的资源之一,软件流水在对VLIW DSP指令调度的同时,会显著增加寄存器的压力,从而导致寄存器溢出,软件流水中止。在以往的研究中,软件流水之前的指令分簇会更多地考虑指令并行性,往往会把寄存器的压力交给寄存器分配阶段,当物理寄存器不够分配时会造成寄存器溢出。通过考察指令运行时的寄存器压力情况对指令进行分簇,这样可根据各个簇的寄存器压力的动态信息减少寄存器的溢出,提高指令运行效率。  相似文献   

5.

As embedded memory technology evolves, the traditional Static Random Access Memory (SRAM) technology has reached the end of development. For deepening the manufacturing process technology, the next generation memory technology is highly required because of the exponentially increasing leakage current of SRAM. Non-volatile memories such as STT-MRAM (Spin Torque Transfer Magnetic Random Access Memory), PCM (Phase Change Memory) are good candidates for replacing SRAM technology in embedded memory systems. They have many advanced characteristics in the perspective of power consumption, leakage power, size (density) and latency. Nonetheless, nonvolatile memories have two major problems that hinder their use it the next-generation memory. First, the lifetime of the nonvolatile memory cell is limited by the number of write operations. Next, the write operation consumes more latency and power than the same size of the read operation. This study describes a compiler optimization technique to overcome such disadvantages of a nonvolatile memory component in hybrid cache memories. A hybrid cache is proposed to overcome the disadvantages using a compiler. Specifically, to minimize the number of write operations for nonvolatile memory, we present a data replacement technique that considers the locations of the register spill data. Many portions of the memory accesses are yielded by the spill data of a register allocator in an optimizing compiler. Such spill data can be partially removed using a recalculation method. Thus, we implemented an optimization technique that rearranges the data placement with recalculation to minimize the write instructions on the nonvolatile memory. Our experimental results show that the proposed technique can reduce the average number of spill codes by 20%, and improves the energy consumption by 20.2% on average.

  相似文献   

6.
Digital signal processors (DSPs) with very long instruction word (VLIW) data‐path architectures are increasingly being deployed on embedded devices for multimedia processing applications. To reduce the power consumption and design cost of VLIW DSP processors, distributed register files and multibank register architectures are being adopted to reduce the number of read and write ports associated with register files, which presents new challenges for devising compiler optimization schemes. This paper addresses the issues of reducing the spill code for a VLIW DSP with distributed register files. Spill code produced by register allocation is traditionally handled by memory spills, but the multibank register‐file architecture provides the opportunity to spill‐out register values onto different register banks. We present a conceptual framework based on the universal and the proxy interference graphs to model the live ranges of registers for spilling codes to different register banks. Heuristic algorithms are then developed on the basis of this concept. By heuristically estimating the register pressure for each register file, we treat different register banks as optional spilling locations in addition to traditional spilling to memory. Experiments were performed on the parallel architecture core VLIW DSP with distributed register files by incorporating our proposed optimization schemes into an Open64‐based compiler. The experimental results show that our approach can improve the performances on average for DSPStone and MiBench benchmarks with spilling cases by 7.1% and 21.6%, respectively, compared with the one always handling spill code in memory. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

7.
谓词执行能使分片式处理器充分利用众多的执行单元,开发指令级并行性.但因此形成的超块也使得分支误预测代价增大,所以提高分支预测器的性能至关重要.本文提出一种基于剖析信息决策的谓词执行技术,该技术利用剖析信息对谓词执行前后的执行周期进行估算,从而对分支的谓词执行进行决策.该技术使分支预测器的命中率提高了0.68%~3.50%,使系统性能提高了1.67%~8.33%.同时,利用select指令表示谓词化指令也消除了重命名阶段寄存器多定义问题.  相似文献   

8.
In embedded systems, small code size is important due to memory constraints. One technique to achieve a small code size is reducing the instruction encoding from 32-bit to 16-bit, such as the ARM THUMB or MIPS-16 architectures. This half-size encoding leads to shorter register operands, making fewer registers available for register allocation and causing more spills, although invisible registers can be used as spill locations via copies. We propose reconstructing the original register file into dual-banks, added with the bank toggle instruction for bank changes and the inter-bank copies between the banks. We also propose an efficient dual-bank register allocation technique based on regions in the code to reduce spills. As a case study, we applied our banked register allocation model for the THUMB architecture. We found that the code size decreases by as much as 8% (5.8% on average) while the performance improves by as much as 11.1% (3.3% on average). Our results indicate that we would better organize the register file of an embedded CPU that can provide reduced encoding into dual banks for better quality of register allocation, rather than using the invisible registers for spills.  相似文献   

9.
Previous work on compiler-based multiple instruction retry has utilized a series of compiler transformations, loop protection, node splitting, and loop expansion, to eliminate anti-dependencies of length ≤ N in the pseudo register, the machine register, and the post-pass resolver phases of compilation.1 The results have provided a means of rapidly recovering from transient processor failures by rolling back N instructions. This paper presents techniques for improving compilation and run-time performance in compiler-based multiple instruction retry. Incremental updating enhances compilation time when new instructions are added to the program. Post-pass code rescheduling and spill register reassignment algorithms improve the run-time performance and decrease the code growth across the application programs studied. Branch hazards are shown to be resolvable by simple modifications to the incremental updating schemes during the pseudo register phase and to the spill register reassignment algorithm during the post-pass phase.  相似文献   

10.
Minimizing the amount of spill code is still an open problem in code generation and optimization. The amount of spill code depends on both the register allocation algorithm and the pre‐allocation instruction scheduling algorithm that controls the register pressure. In this paper, we focus on the impact of pre‐allocation instruction scheduling on the amount of spill code. Many heuristic techniques have been proposed to do instruction scheduling with the objective of minimizing register pressure and consequently the amount of spill code. However, the performance of these heuristic techniques has not been studied relative to optimality on real large‐scale programs. In this paper, we present an experimental study that evaluates the performance of several pre‐allocation scheduling heuristics. The evaluation involves computing an experimental lower bound on the size of gap between each heuristic's performance and optimal performance. We also propose a simple heuristic technique based on a specific permutation of two basic priority schemes and experimentally evaluate the performance of this technique compared with other heuristics, including the heuristics implemented in the LLVM open‐source Compiler. The evaluation is carried out by running SPEC CPU2006 on real x86‐64 hardware and measuring both the amount of spill code and the execution time. The results of our study show that the proposed heuristic technique gives better overall performance than LLVM's best heuristic on x86‐64, although it produces slightly more spilling. The proposed heuristic has better overall performance, because it achieves a better balance between register pressure and instruction‐level parallelism (ILP). This result shows the importance of ILP in pre‐allocation scheduling even on out‐of‐order machines. Furthermore, the results of the study show that there is a large gap between the performance of any of the studied heuristics and optimal performance; even the best heuristic in the study produces significantly more spill code than the optimal amount. This experimental result quantifies the intuitive belief that it is unlikely to find a heuristic that works well in all cases, thus showing the need for more rigorous solutions using combinatorial approaches. The paper discusses the challenges and complexities that are involved in developing such rigorous solutions. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

11.
This paper describes the design and implementation of a register allocator that performs the allocation over the Program Dependence Graph (PDG) representation of a routine. The PDG representation has been used successfully as the basis for various scalar optimizations, as well as for detecting and improving parallelization for vector machines, multiple processor machines, and architectures that exhibit instruction level parallelism. Variations of the PDG have also been used for debugging and integrating different versions of a program via program-slicing, and to enable translation of imperative programs for data-flow machines and demand-driven graph reducers. By basing register allocation on the PDG, the register allocation phase may be more easily integrated and intertwined with other optimization analyses and transformations. In addition, the advantages of a hierarchical approach to global register allocation can be attained without constructing an additional structure used solely for register allocation. © 1998 John Wiley & Sons, Ltd.  相似文献   

12.
多寄存器组网络处理器上的寄存器分配技术   总被引:1,自引:0,他引:1  
针对传统的图着色寄存器分配算法不能直接处理网络处理器的操作问题,提出了一种多寄存器组网络处理上的寄存器分配技术.在依次分析了一个符号寄存器可能位于哪些寄存器组?如果没有候选组,该如何解决这种冲突?如果有多个候选组,该选用哪个组等问题的基础上,通过将这些方法与图着色寄存器分配算法相融合,在IXP上实现了这种多寄存器组的寄存器分配,提高了它的可编程性.这种方法也可运用到其它具有类似寄存器结构的处理器上.  相似文献   

13.
VLIW DSP通过软件流水获得时间并行性,通过指令分簇获得空间并行性.指令的分簇本质上是资源分配问题.传统的指令分簇假设一条指令分到某一簇执行,而某些体系结构提供SIMD指令,传统的分簇算法对这类体系结构并不完全适用.提出的基于评估模型的分簇算法能对SIMD指令和普通指令进行合理的分簇.分簇之后,通过调度簇间传输指令,合成适当的簇间双字传输指令.由于SIMD和簇间双字传输的引入,以及较好的分簇决策,程序整体的调度延迟变短.对许多数字信号处理程序相对于没分簇的情况下的性能有2~3倍的性能提升,相对寄存器压力分簇算法有约7~10%性能的提升.  相似文献   

14.
Although technology advancement can pack more and more physical registers in processors, the numbers of architectural registers defined by the instruction set architectures (ISAs) remain relatively small on most modern processors. Exposing more architectural registers to compilers and programmers can improve the effectiveness of compiler optimization and the quality of code. However, increasing the number of architectural registers by simply adding extra bits to the register fields of instructions will expand the code size. Therefore, a better way of exposing more ISA registers without significantly expanding the code size is needed. This paper presents a new ISA called Floating Accumulator Architecture (FAA) that can expand the number of ISA registers without increasing the instruction length. Unlike the accumulator architecture whose accumulator is a fixed, special register, FAA dynamically chooses a register from the general-purpose register file as the accumulator. In other words, the accumulator in FAA is an alias to some register in the register file at any instruction, and the alias relation can be dynamically updated by FAA at any program points. Since the accumulator implicitly stores the result, the destination register field can be omitted from FAA instructions, resulting in a saving of 3 to 5 bits for each instruction. This new free instruction bit space can be utilized in two possible ways: doubling the number of ISA registers of modern 32-bit RISC processors or maintaining the number of ISA registers for 16-bit instructions on embedded processors. This paper presents the result of utilizing the free bit space to double the number of ISA registers from 16 to 32 on ARM processors, and experimental results show that performance can be improved by 7.6% on average for MediaBench benchmarks.  相似文献   

15.
基于遗传算法提出了溢出代码和访存压力敏感的机器学习来调试寄存器分配的权值函数。不同于以往采用目标程序的运行时间作为适应值,通过静态分析寄存器分配产生的溢出代码和基本块中的访存压力来构建适应值,以减少学习时间。这些分析被限定在热点函数中,在保证适应值精度的同时进一步加快了学习速度。实验表明,快速学习仅需要考虑热点函数的编译时间,整个CPU2000CINT测试集在5 h内即可学习完毕。大部分CPU2000CINT测试例子的性能得到了提高。其中perlbmk的性能提升最高可达到7.2%。  相似文献   

16.
针对X86系统仿真中基于静态寄存器分配的代码翻译机制导致的目标代码膨胀率高、翻译引擎和执行引擎间切换开销大两方面问题,提出了以寄存器映射、自定义指令和影子寄存器为基础的软硬协同优化方法。寄存器映射优化将对内存中模拟的源机器寄存器的操作转化为对本地机器寄存器操作,降低了翻译后目标代码膨胀率;自定义指令和影子寄存器优化将引擎切换时上下文的备份和恢复操作简化为2条自定义指令,提升了引擎切换效率。相比协同优化前,X86仿真系统Linux-0.2的翻译后目标代码膨胀率降低了21.9%,开关机时间获得了1.35的加速比。测试结果表明了该协同优化方法对于提升系统仿真效率具有可行性和有效性。  相似文献   

17.
王显著  李三立  黄震春 《计算机学报》1998,21(12):1112-1118
本文讨论了开发Java处理器的指令级并行性的策略,提出采用虚拟寄存器技术的Java处理器(VRJP)结构,并给出了判断相关性和管理虚拟寄存器的方法。分析和实验表明,VRJP能够有效地开发Java的指令级并行性,提高Java程序的执行效率。在VRJP中,大多数虚拟寄存器都不需要对应的物理寄存器,大大降低了物理寄存器的访问频率。  相似文献   

18.
Non-volatile memories are good candidates for DRAM replacement as main memory in embedded systems and they have many desirable characteristics. Nevertheless, the disadvantages of non-volatile memory co-exist with its advantages. First, the lifetime of some of the non-volatile memories is limited by the number of erase operations. Second, read and write operations have asymmetric speed or power consumption in non-volatile memory. This paper focuses on the embedded systems using non-volatile memory as main memory. We propose register allocation technique with re-computation to reduce the number of store instructions. When non-volatile memory is applied as the main memory, reducing store instructions will reduce write activities on non-volatile memory. To re-compute the spills effectively during register allocation, a novel potential spill selection strategy is proposed. During this process, live range splitting is utilized to split certain long live ranges such that they are more likely to be assigned into registers. In addition, techniques for re-computation overhead reduction is proposed on systems with multiple functional units. With the proposed approach, the lifetime of non-volatile memory is extended accordingly. The experimental results demonstrate that the proposed technique can efficiently reduce the number of store instructions on systems with non-volatile memory by 33% on average.  相似文献   

19.
In a recent study, we discovered that many single load/store operations in embedded applications can be parallelized and thus encoded simultaneously in a single‐instruction multiple‐data instruction, called the multiple load/store (MLS) instruction. In this work, we investigate the problem of utilizing MLS instructions to produce optimized machine code, and propose an effective approach to the problem. Specifically, we formalize the MLS problem, that is, the problem of maximizing the use of MLS instructions with an unlimited register file size. Based on this analysis, we show that we can solve the problem efficiently by translating it into a variant of the problem finding a maximum weighted path cover in a dynamic weighted graph. To handle a more realistic case of the finite size of the register file, our solution is then extended to take into account the constraints of register sequencing in MLS instructions and the limited register resource available in the target processor. We demonstrate the effectiveness of our approach experimentally by using a set of benchmark programs. In summary, our approach can reduce the number of loads/stores by 13.3% on average, compared with the code generated from existing compilers. The total code size reduction is 3.6%. This code size reduction comes at almost no cost because the overall increase in compilation time as a result of our technique remains quite minimal. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

20.
Most performance enhancing mechanisms in current processors, such as branch predictors or prefetchers, rely on program characteristics monitored at the granularity of single instructions. However, many of these characteristics can be obtained at the basic block-level instead. The coarser granularity allows a larger portion of the code to be examined, enabling a more accurate profiling and a detailed analysis of the different types of instructions executed within a block. Therefore, block-level analysis can be advantageous for performance enhancing mechanisms, as it allows us to look at how the instructions influence each other, and thus detect complex behavior patterns.In this paper, we present the Dynamic Block-Level Execution Profiler (DBLEP), a basic block level online mechanism that profiles micro-architectural bottlenecks, such as delinquent memory loads, hard-to-predict branches and contention for functional units. DBLEP operates at the basic block level and provides information that can be used to reduce the impact of these bottlenecks. A prefetch dropping scheme and a memory controller policy were developed to use the code profiling information provided by DBLEP. By taking advantage of the high profiling accuracy, these mechanisms are able to improve the processor’s performance by up to 18.6% (5.3% on average). We show that our mechanism’s performance is comparable to mechanisms that work on single instruction granularity, using less hardware.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号