首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
田祖伟  孙光 《计算机科学》2010,37(5):130-133
程序中大量分支指令的存在,严重制约了体系结构和编译器开发并行性的能力。有效发掘指令级并行性的一个主要挑战是要克服分支指令带来的限制。利用谓词执行可有效地删除分支,将分支指令转换为谓词代码,从而扩大了指令调度的范围并且删除了分支误测带来的性能损失。阐述了基于谓词代码的指令调度、软件流水、寄存器分配、指令归并等编译优化技术。设计并实现了一个基于谓词代码的指令调度算法。实验表明,对谓词代码进行编译优化,能有效提高指令并行度,缩短代码执行时间,提高程序性能。  相似文献   

2.
Predicated execution is a promising architectural feature for exploiting instruction-level parallelism in the presence of control flow. Compiling for predicated execution involves converting program control flow into conditional, or predicated, instructions. This process is known as if-conversion. In order to apply ifconversion effectively, one must address two major issues: what should be ifconverted and when the if-conversion should be performed. A compiler's use of predication as a representation is most effective when large amounts of code are if-converted and when if-conversion is performed early in the compilation procedure. On the other hand, efficient execution of code generated for a processor with predicated execution requires a delicate balance between control flow and predication. The appropriate balance is tightly coupled with scheduling decisions and detailed processor characteristics. This paper presents a compilation framework based on partial reverse if-conversion that allows the compiler to maximize the benefits of predication as a compiler representation while delaying the final balancing of control flow and predication to schedule time.  相似文献   

3.
谓词执行能使分片式处理器充分利用众多的执行单元,开发指令级并行性.但因此形成的超块也使得分支误预测代价增大,所以提高分支预测器的性能至关重要.本文提出一种基于剖析信息决策的谓词执行技术,该技术利用剖析信息对谓词执行前后的执行周期进行估算,从而对分支的谓词执行进行决策.该技术使分支预测器的命中率提高了0.68%~3.50%,使系统性能提高了1.67%~8.33%.同时,利用select指令表示谓词化指令也消除了重命名阶段寄存器多定义问题.  相似文献   

4.
Instruction schedulers for superscalar and VLIW processors must expose sufficient instruction-level parallelism to the hardware in order to achieve high performance. Traditional compiler instruction scheduling techniques typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase instruction-level parallelism for the frequent execution scenarios at the expense of the less freuent ones. Profile information identifies these important execution scenarios in a program. In this paper, two major categories of profile information are studied: control-flow and memory-dependence. Profile-assisted code scheduling techniques have been incorporated into the IMPACT-I compiler. These techniques are acyclic global scheduling and software pipelining. This paper describes the scheduling algorithms, highlights the modifications required to use profile information, and explains the hardware and compiler support for dealing with hazards that arise from aggressive use of profile information. The effectiveness of these profile-based scheduling techniques is evaluated for a range of superscalar and VLIW processors.  相似文献   

5.
支持有向有环图的微调度方法   总被引:1,自引:0,他引:1  
指令调度是编译器中的重要优化阶段.如何充分利用处理器结构相关的资源,发掘程序并行性,以提高编译优化性能和增强代码可适应性,一直是指令调度的研究难点之一.目前微调度已经取得了一定的效果,但对软件流水产生的有向有环图则未能提供支持.在ORC中提出并实现了一种基于IA-64体系结构的支持有向有环图的微调度方法,有效地减少了程序执行周期和流水线停顿,取得了较为满意的编译优化性能.  相似文献   

6.
Scalar performance on processors with instruction level parallelism (ILP) is often limited by control and data dependences. This paper describes a family of compiler techniques, called Critical Path Reduction (CPR) techniques, which reduce the length of critical paths through control and data dependences. Control CPR reduces the number of branches on the critical path and improves the performance of branch intensive codes on processors with inadequate branch throughput or excessive branch latency. Data CPR reduces the number of arithmetic operations on the critical path. Optimization and scheduling are adapted to support CPR.  相似文献   

7.
唐遇星  邓鹍  窦勇  周兴铭 《计算机学报》2007,30(11):1972-1981
分支指令与分支预测失败限制了处理器发掘指令级并行(ILP)的潜力.通过If-conversion或Predicated执行将程序中的控制相关转化为数据相关,能较好地降低分支预测开销.提出一种基于简化Trace结构的动态隐式断言执行机制(Dynamic Implicit Predication,DIP),而早期的相关研究主要集中于由编译器显式为宽发射处理器产生静态Predicated指令.无需编译器或者其他二进制工具的帮助,DIP可以在程序运行过程中识别可以进行断言变换的指令片断,完成指令转换与优化,并在以后的执行中使用优化后的指令Trace.基于SPEC2000模拟测试表明DIP可以有效避免错误的分支预测,提高并行度,单个程序的IPC平均提高10.3%,基准程序的平均加速比可达7.59%.  相似文献   

8.
To exploit larger amounts of instruction level parallelism, processors are being built with wider issue widths and larger numbers of functional units. Instruction fetch rate must also be increased in order to effectively exploit the performance potential of such processors. Block-structured ISAs provide an effective means of increasing the instruction fetch rate. We define an optimization, called block enlargement, that can be applied to a block-structured ISA to increase the instruction fetch rate of a processor that implements that ISA. We have constructed a compiler that generates block-structured ISA code, and a simulator that models the execution of that code on a block-structured ISA processor. We show that for the SPECint95 benchmarks, the block-structured ISA improves the performance of an aggressive wide issue, dynamically scheduled processor by 15% while using simpler microarchitectural mechanisms to support wide issue and dynamic scheduling.  相似文献   

9.
The superblock: An effective technique for VLIW and superscalar compilation   总被引:8,自引:1,他引:7  
A compiler for VLIW and superscalar processors must expose sufficient instruction-level parallelism (ILP) to effectively utilize the parallel hardware. However, ILP within basic blocks is extremely limited for control-intensive programs. We have developed a set of techniques for exploiting ILP across basic block boundaries. These techniques are based on a novel structure called thesuperblock. The superblock enables the optimizer and scheduler to extract more ILP along the important execution paths by systematically removing constraints due to the unimportant paths. Superblock optimization and scheduling have been implemented in the IMPACT-I compiler. This implementation gives us a unique opportunity to fully understand the issues involved in incorporating these techniques into a real compiler. Superblock optimizations and scheduling are shown to be useful while taking into account a variety of architectural features.  相似文献   

10.
The multiflow trace scheduling compiler   总被引:3,自引:0,他引:3  
The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks.  相似文献   

11.
SMA:前瞻性多线程体系结构   总被引:4,自引:1,他引:3  
肖刚  周兴铭  徐明  邓鹍 《计算机学报》1999,22(6):582-590
提出了一种新的ILP处理器体系结构-前瞻性多线程体系的结构,简称SMA.它结合了前瞻性执行机制和多线程执行机制,以整个线程为长步进行前瞻性执行,多个线程并行执行并且共享处理器硬件资源,这样,处理器既通过组合每个线程的指令窗口形成一个大的动态指令窗口,开发出程序中更大的ILP,又利用多线程执行机制屏蔽各种长延迟操作,达到较高的资源利用率;介绍了SMA执行模型,并讨论了SMA处理器的实现和其中的关键技  相似文献   

12.
Speculative execution is the execution of instructions before it is known whether these instructions should be executed. In the speculative execution for instruction level parallelism (ILP) processors, the concept of shadow register provides a hardware solution to maintain semantics of a program from the pollution of boosted instructions that are incorrectly predicted. In a recent study, Chang and Lai proposed a special register file based on shadow register, named conjugate register file (CRF), to support multilevel boosting in speculative execution. They also proposed a scheduling heuristic named frequency-driven scheduling to incorporate with CRF for execution. However, the ability of boosting is still constrained since the concept of register pair will force the results produced speculatively be stored in dedicated locations. Moreover, when the parallelism potential increases to tens through the advancement of hardware techniques, the heavy demand on register usage and the complexity of register file may well become a serious bottleneck for the exploitation of ILP.In this paper, the algorithm of frequency-driven scheduling is modified by replacing the function of hardware CRF with the technique of variable renaming during compilation. The new scheduling technique, named LESS, can exploit the parallelism efficiently with limited number of registers. Moreover, since the technique can benefit ILP without any special hardware support, it can be incorporated with any other ILP architecture without changing its instruction set architecture (ISA).Simulation results show that the performance achievable by LESS is better than other existing methods. For example, under the ILP model with an issue rate of 8, the speculative execution can achieve an increase of 34% in parallelism, as compared to 18% in CRF scheme.  相似文献   

13.
新型体系结构概念—虚拟寄存器与并行的指令处理部件   总被引:4,自引:1,他引:3  
随着程序对地址空间的需求日益提高,研究者提出了虚拟存储器概念,使程序访问的地址空间免受物理存储器的限制。随着面向寄存器的RISC技术发展以及多发射结构中指令调度的日益重要,我们提出了虚拟寄存器的新概念,使寄存器空间不受物理寄存器堆大小的束缚,有利于指令调度和寄存器重新命名技术,提高指令级并行性ILP。此外,现代新型RISC处理机都着重于加强数据处理部件中的执行并行度,忽略了放在存储器中指令的处理。  相似文献   

14.
Region-based compilation: Introduction, motivation, and initial experience   总被引:1,自引:0,他引:1  
The most important task of a compiler designed to exploit instruction-level parallelism (ILP) is instruction scheduling. If higher levels of ILP are to be achieved, the compiler must use, as the unit of scheduling, regions consisting of multiple basic blocks—preferably those that frequently execute consecutively, and which capture cycles in the program’s execution. Traditionally, compilers have been built using the function as the unit of compilation. In this framework, function boundaries often act as barriers to the formation of the most suitable scheduling regions. Function inlining may be used to circumvent this problem by assembling strongly coupled functions into the same compilation unit, but at the cost of very large function bodies. Consequently, global optimizations whose compile time and space requirements are superlinear in the size of the compilation unit, may be rendered prohibitively expensive. This paper introduces a new approach, called region-based compilation, wherein the compiler, after inlining, repartitions the program into more desirable compilation units, termed regions. Region-based compilation allows the compiler to control problem size and complexity while exposing inter-procedural scheduling, optimization and code motion opportunities.  相似文献   

15.
In modern processors, instructions to perform operations are often produced before it becomes known that this is required. Such an expedient, which is called speculative execution, helps to reveal parallelism at the instruction level. In the EPIC architectures, the speculative execution is completely controlled by the compiler, which makes it possible to avoid using complex hardware mechanisms for supporting speculative instruction production. Moreover, the idea of the speculative execution can be used by the compiler in machine-independent optimizations. The paper describes a scheme of construction of the speculative optimization that is based on the selection of properties of the control flow and data flow that are important from the optimization standpoint and on the estimation of the probabilities of their fulfillment. The probabilities found are used for searching and constructing advantageous speculative and bookkeeping transformations. For optimizations that include only speculative movements of instructions upwards along the control flow graph, on the basis of the suggested scheme, a method has been developed that includes algorithms for finding probabilities of data and control dependences, for estimating benefit of speculative movements, and for constructing a recovery code. On the basis of this method, an algorithm for the speculative scheduling of instructions for the Intel Itanium architecture has been developed and implemented. Specific features of its implementation and experimental results are described.  相似文献   

16.
编译器提高程序并行性的主要障碍是:频繁的控制转移和模棱两可的内存访问。推断和推测是vliw处理器体系结构的新特点,为了消除分支或访存对指令级并行性识别的影响。指令调度是编译器挖掘程序指令级并行性的关键技术之一,本文论述了如何在指令调度中有效地利用推断和推测技术,提高程序的性能。  相似文献   

17.
In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented.  相似文献   

18.
开发指令并行性的分支控制技术   总被引:1,自引:1,他引:0  
提高指令级并行性是现代计算机追求的目标之一,控制分支则为挖掘指令级并行提出了挑战性问题。为开发指令级并行性,现代计算机采用了两种分支控制技术即投机执行技术和判定执行技术。文章就这两种技术的实现进行了系统分析,并以Merced芯片的实现为例进行了说明。  相似文献   

19.
This work presents a static method implemented in a compiler for extracting high instruction level parallelism for the 32-bit QueueCore, a queue computation-based processor. The instructions of a queue processor implicitly read and write their operands, making instructions short and the programs free of false dependencies. This characteristic allows the exploitation of maximum parallelism and improves code density. Compiling for the QueueCore requires a new approach since the concept of registers disappears. We propose a new efficient code generation algorithm for the QueueCore. For a set of numerical benchmark programs, our compiler extracts more parallelism than the optimizing compiler for an RISC machine by a factor of 1.38. Through the use of QueueCore’s reduced instruction set, we are able to generate 20% and 26% denser code than two embedded RISC processors.  相似文献   

20.
基于GCC的IF转换算法的分析与改进   总被引:1,自引:0,他引:1  
分支指令是发掘指令级并行(ILP)的一个主要障碍。IF转换能够有效地删除指令流中的分支,通过删除程序中的莱些分支,将控制依赖转换为数据依赖。能够获得更好的调度效果。本文详细分析了GCC中基于IA-64谓词执行的IF转换算法,并改进了其算法。实验数据表明。优化效果明显。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号