首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 35 毫秒
1.
一种精确的分支预测微处理器模型   总被引:3,自引:0,他引:3  
在当今深流水宽发射的微处理器中,为实现高性能,精确的分支预测是不可缺少的关键技术.分支预测失效将浪费大量的时钟周期,无法发挥乱序执行的效能.宽发射微处理器的有效性能同时还依赖指令窗口的大小和指令预取宽度.提出了一种新的更精确的支持分支预测和分支误预测周期损失的微处理器模型.根据指令的执行带宽为指令窗口中可用指令数的平方根统计规律,给出了一个更为精确的描述微处理器取指带宽、分支预测精度、分支误预测周期损失、指令窗口大小和IPC之间关系的算法,并讨论了这些参数的综合权衡以及这些参数对程序IPC的影响.由此可以确定依赖多个微处理器参数的取指带宽阈值和微处理器中几个关键参数的选取.  相似文献   

2.
A deeply pipelined superscalar processor needs an accurate branch predictor in order to approach its performance potential. The 2-level branch predictors have been shown to achieve high prediction accuracy, yet they still suffer a significant number of mispredictions. It has been shown that a number of these mispredictions are due to interference in the pattern history tables. This paper details a method for reducing the amount of pattern history table interference by dynamically identifying some easily predictable branches and inhibiting the pattern history table update for these branches. We show that inhibiting the update in this manner reduces the amount of destructive interference in the global history variation of the 2-level branch predictor, resulting in significantly improved branch prediction accuracy for the SPEC 95 benchmarks. For example, for a 2 K Byte gshare predictor, we eliminate 38% of the mispredictions for the gcc benchmark.  相似文献   

3.
Accurate instruction fetch and branch prediction is increasingly important in today's superscalar architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely outcome of branch instructions. A branch target buffer (BTB) is often used to provide target addresses for taken branches and to predict the destination of indirect jumps. Using a BTB avoids the delay needed to recalculate the destination address and reduces the misfetch penalty. However, an effective branch target buffer can be large and can possibly increase the cycle time of a processor. We propose that a design used in older computers, such as the PDP-8, be used in modern architectures instead of a BTB design. The compiler would pre-compute the branch destination for most branch instructions, allowing the branch information to be stored with the instruction. We consider computing branch destinations at link time and as instructions are fetched into the instruction cache; both alternatives offer similar performance with different advantages. A very small BTB is still useful to predict indirect branches, which cannot be pre-computed. Our results show that the Precomputed-Branch architecture performs better than an architecture using only a BTB, and has significant hardware savings. This is particularly true for larger programs more representative of modern applications.  相似文献   

4.
分支目标缓存(BTB)是高端嵌入式CPU的主要耗能部件之一。针对BTB访问中引入的冗余功耗问题,提出了一种循环体访问过滤机制消除循环体指令流中顺序指令对BTB的无效访问。进一步提出了一种分支跟踪方法补偿循环过滤机制对循环体中非循环类分支指令的错误过滤造成的性能损失,节省了循环体指令流中顺序指令访问BTB的大量冗余功耗。基于Powerstone基准程序的仿真实验表明,在128表项BTB配置下,二级循环过滤器和4表项分支踪迹表可以减少约71.9%的BTB功耗,而平均每条指令周期数(CPI)退化仅为0.66%。  相似文献   

5.
任建  安虹  路放  梁博 《计算机科学》2006,33(3):239-243
同时多线程处理器(SMT)每个周期能够从多个线程中发射指令执行,从而大大地提高了超标量微处理器的指令吞吐量,但多个线程的同时执行也带来了许多硬件资源的共享冲突问题.其中,多个线程共享分支预测硬件的方案会对分支预测精度产生较大的影响.研究SMT处理器中分支处理方案对于处理器整体性能的影响,对于指导SMT处理器的设计是十分重要的.本文利用SMT处理器模拟器,针对各线程运行独立应用的SMT结构实验评估了几种著名的分支预测方案;给出了在单线程和多线程情况下,分支预测方案对分支预测精度和处理器整体性能的影响的分析;总结出在这样的SMT结构中,各线程拥有独立的预测器是一种较好的选择,并且由于各独立预测器可以采用小而简单的结构,所以不会带来太多的硬件开销.  相似文献   

6.
In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. In the same way, about 21% of branches depend on MUL/DIV instructions whereas 3.76% are unbiased and depend on MUL/DIV instructions. These dependences involve high-penalty mispredictions becoming serious performance obstacles and causing significant performance degradation in executing instructions from wrong paths. Therefore, the negative impact of (unbiased) branches over global performance should be seriously attenuated by anticipating the results of long-latency instructions, including critical Loads. On the other hand, hiding instructions’ long latencies in a pipelined superscalar processor represents an important challenge itself. We developed a superscalar architecture that selectively anticipates the values produced by high-latency instructions. In this work we are focusing on multiply, division and loads with miss in L1 data cache, implementing a dynamic instruction reuse scheme for the MUL/DIV instructions and a simple last value predictor for the critical Load instructions. Our improved superscalar architecture achieves an average IPC speedup of 3.5% on the integer SPEC 2000 benchmarks, of 23.6% on the floating-point benchmarks, and an improvement in energy-delay product (EDP) of 6.2% and 34.5%, respectively. We also quantified the impact of our developed selective instruction reuse and value prediction techniques in a simultaneous multithreaded architecture (SMT) that implies per thread reuse buffers and load value prediction tables. Our simulation results showed that the best improvements on the SPEC integer applications have been obtained with 2 threads: an IPC speedup of 5.95% and an EDP gain of 10.44%. Although, on the SPEC floating-point programs, we obtained the highest improvements with the enhanced superscalar architecture, the SMT with 3 threads also provides an important IPC speedup of 16.51% and an EDP gain of 25.94%.  相似文献   

7.
Data value prediction has been widely accepted as an effective mechanism to break data hazards for high performance processor design. Several works have reported promising performance potential. However, there is hardly enough information that is presented in a clear way about performance comparison of these prediction mechanisms. This paper investigates the performance impact of four previously proposed value predictors, namely last value predictor, stride value predictor, two-level value predictor and hybrid (stride two-level) predictor. The impact of misprediction penalty, which has been frequently ignored, is discussed in detail. Several other implementation issues, including instruction window size, issue width and branch predictor are also addressed and simulated. Simulation results indicate that data value predictors act differently under different configurations. In some cases, simpler schemes may be more beneficial than complicated ones. In some particular cases, value prediction may have negative impact on performance.  相似文献   

8.
This paper introduces alloyed prediction, a new hardware-based two-level branch predictor organization that combines global and local history in the same structure, combining the advantages of current two-level predictors with those of hybrid predictors. The alloyed organization is motivated by measurements showing that wrong-history mispredictions are even more important than conflict-induced mispredictions. Wrong-history mispredictions arise because current two-level, history-based predictors provide only global or only local history. The contribution of wrong history to the overall misprediction rate is substantial because most programs have some branches that require global history and others that require local history. This paper explores several ways to implement alloyed prediction, including the previously proposed bi-mode organization. Simulations show that mshare is the best alloyed organization among those we examine, and that mshare gives reliably good prediction compared to bimodal (two-bit), two-level, and hybrid predictors. The robust performance of alloying across a range of predictor sizes stems from its ability to attack wrong-history mispredictions at even very small sizes without subdividing the branch prediction hardware into smaller and less effective components.  相似文献   

9.
一种有效的同时多线程处理器取指控制机制   总被引:1,自引:0,他引:1  
同时多线程处理器通过每时钟周期从多个运行的线程取指令执行,极大地提高了处理器的性能.分支预测器的预测精度和取指策略的效率是影响同时多线程处理器性能的关键.通过将一个基于值的分支预测器和一个基于线程推进速度的取指策略相结合,提出一种新的取指控制机制.该结构的硬件开销较小,实现复杂度较低.实验结果表明,该取指控制机制有效地提高了处理器的性能,其相对于传统取指控制机制的性能加速比为28%且该加速比也高于目前基于流缓冲区和基于分支分类器的取指控制机制.  相似文献   

10.
Predicting indirect-branch targets has become a performance bottleneck for many applications.Previous highperformance indirect-branch predictors usually require significant hardware storage or additional compiler support,which increases the complexity of the processor front-end or the compilers.This paper proposes a complexity-effective indirectbranch prediction mechanism,called the Set-Way Index Pointing (SWIP) prediction.It stores multiple indirect-branch targets in different branch target buffer (BTB) entries,whose set indices and way locations are treated as set-way index pointers.These pointers are stored in the existing branch-direction predictor.SWIP prediction reuses the branch direction predictor to provide such pointers,and then accesses the pointed BTB entries for the predicted indirect-branch target.Our evaluation shows that SWIP prediction could achieve attractive performance improvement without requiring large dedicated storage or additional compiler support.It improves the indirect-branch prediction accuracy by 36.5% compared to that of a commonly-used BTB,resulting in average performance improvement of 18.56%.Its energy consumption is also reduced by 14.34% over that of the baseline.  相似文献   

11.
熊振亚  林正浩  任浩琪 《计算机科学》2017,44(3):195-201, 214
现代计算机体系结构受两个方面的困扰:性能和能耗。为降低嵌入式处理器日益增长的功耗,提出基于跳转轨迹的分支目标缓冲结构(TG-BTB)。与传统分支目标缓冲每次提取指令时需要查询分支目标缓冲不同,TG-BTB只在执行轨迹预测为跳转时才查询分支目标缓冲。该结构通过在程序执行过程中动态分析跳转轨迹行为,可以实现只在轨迹跳转时查询分支目标缓冲,从而降低功耗。在动态分析过程中首先提取记录两条跳转分支指令之间的指令间隔,然后将提取的指令间隔存储在TG-BTB中,最后根据存储在TG-BTB中的指令间隔决定是否需要查询BTB。基于基准测试向量进行模型验证和性能测试,实验结果表明TG-BTB降低了81%的BTB查询能耗。  相似文献   

12.
Lalja  D. J. 《Computer》1988,21(7):47-55
A probabilistic model is developed to quantify the performance effects of the branch penalty in a typical pipeline. The branch penalty is analyzed as a function of the relative number of branch instructions executed and the probability that a branch is taken. The resulting model shows the fraction of maximum performance achievable under the given conditions. Techniques to reduce the branch penalty include static and dynamic branch prediction, the branch target buffer, the delayed branch, branch bypassing and multiple prefetching, branch folding, resolution of branch decision early in the pipeline, using multiple independent instruction streams in a shared pipeline, and the prepare-to-branch instruction  相似文献   

13.
The schedulability analysis of real-time embedded systems requires worst case execution time (WCET) analysis for the individual tasks. Bounding WCET involves not only language-level program path analysis, but also modeling the performance impact of complex micro-architectural features present in modern processors. In this paper, we statically analyze the execution time of embedded software on processors with speculative execution. The speculation of conditional branch outcomes (branch prediction) significantly improves a program's execution time. Thus, accurate modeling of control speculation is important for calculating tight WCET estimates. We present a parameterized framework to model the different branch prediction schemes. We further consider the complex interaction between speculative execution and instruction cache performance, that is, the fact that speculatively executed blocks can generate additional cache hits/misses. We extend our modeling to capture this effect of branch prediction on cache performance. Starting with the control flow graph of a program, our technique uses integer linear programming to estimate the program's WCET. The accuracy of our method is demonstrated by tight estimates obtained on realistic benchmarks.  相似文献   

14.
By executing two or more threads concurrently, Simultaneous MultiThreading (SMT) architectures are able to exploit both Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) from the increased number of in-flight instructions that are fetched from multiple threads. However, due to incorrect control speculations, a significant number of these in-flight instructions are discarded from the pipelines of SMT processors (which is a direct consequence of these pipelines getting wider and deeper). Although increasing the accuracy of branch predictors may reduce the number of instructions so discarded from the pipelines, the prediction accuracy cannot be easily scaled up since aggressive branch prediction schemes strongly depend on the particular predictability inherently to the application programs. In this paper, we present an efficient thread scheduling mechanism for SMT processors, called SAFE-T (Speculation-Aware Front-End Throttling): it is easy to implement and allows an SMT processor to selectively perform speculative execution of threads according to the confidence level on branch predictions, hence preventing wrong-path instructions from being fetched. SAFE-T provides an average reduction of 57.9% in the number of discarded instructions and improves the instructions per cycle (IPC) performance by 14.7% on average over the ICOUNT policy across the multi-programmed workloads we simulate. This paper is an extended version of the paper, “Speculation Control for Simultaneous Multithreading,” which appeared in the Proceedings of the 18th International Parallel and Distributed Processing Symposium, Santa Fe, New Mexico, April 2004.  相似文献   

15.
Microarchitects should consider power consumption, together with accuracy, when designing a branch predictor, especially in embedded processors. This paper proposes a power-aware branch predictor, which is based on the gshare predictor, by accessing the BTB (Branch Target Buffer) selectively. To enable the selective access to the BTB, the PHT (Pattern History Table) in the proposed branch predictor is accessed one cycle earlier than the traditional PHT if the program is executed sequentially without branch instructions. As a side effect, two predictions from the PHT are obtained through one access to the PHT, resulting in more power savings. In the proposed branch predictor, if the previous instruction was not a branch and the prediction from the PHT is untaken, the BTB is not accessed to reduce power consumption. If the previous instruction was a branch, the BTB is always accessed, regardless of the prediction from the PHT, to prevent the additional delay/accuracy decrease. The proposed branch predictor reduces the power consumption with little hardware overhead, not incurring additional delay and never harming prediction accuracy. The simulation results show that the proposed branch predictor reduces the power consumption by 29-47%.  相似文献   

16.
Conditional branches incur a severe performance penalty in wide-issue, deeply pipelined processors. Speculative execution(1, 2) and predicated execution(3–9) are two mechanisms that have been proposed for reducing this penalty. Speculative execution can completely eliminate the penalty associated with a particular branch, but requires accurate branch prediction to be effective. Predicated execution does not require accurate branch prediction to eliminate the branch penalty, but is not applicable to all branches and can increase the latencies within the program. This paper examines the performance benefit of using both mechanisms to reduce the branch execution penalty. Predicated execution is used to handle the hard-to-predict branches and speculative execution is used to handle the remaining branches. The hard-to-predict branches within the program are determined by profiling. We show that this approach can significantly reduce the branch execution penalty suffered by wide-issue processors.  相似文献   

17.
传统的分支目标缓冲器(BTB)每个取指周期都要进行访问,由于程序中的分支指令只占总指令数的20%左右,使得大约80%的BTB访问都是无效的.为此,利用程序控制流中分支指令间距固定的特性,提出一种对性能影响极小的BTB跳跃访问算法.在BTB中存储分支指令到运行路径中下一条分支指令的距离,BTB命中后,根据相应的分支距离来关闭当前分支指令与下一条分支指令之间的BTB访问,以有效地提高访问效率并降低动态功耗.该算法在嵌入式处理器中实现时只控制预测跳转分支指令的BTB跳跃访问,减少了硬件资源的开销.在硬件模型上进行模拟和综合后的结果表明,在128分支项的BTB中,采用文中算法可以降低72%的动态功耗,而性能损失仅为0.013%.  相似文献   

18.
As microprocessor designs move towards deeper pipelines and support for multiple instruction issue, steps must be taken to alleviate the negative impact of branch operations on processor performance. One approach is to use branch prediction hardware and perform speculative execution of the instructions following an unresolved branch. Another technique is to eliminate certain branch instructions altogether by translating the instructions following a forward branch into predicate form. Both these techniques are employed in many current processor designs. This paper investigates the relationship between branch prediction techniques and branch predication. In particular, we are interested in how using predication to remove a certain class of poorly predicted branches affects the prediction accuracy of the remaining branches. A variety of existing predication models for eliminating branch operations are presented, and the effect that eliminating branches has on branch prediction schemes ranging from simple prediction mechanisms to the newer more sophisticated branch predictors is studied. We also examine the impact of predication on basic block size, and how the two techniques used together affect overall processor performance.  相似文献   

19.
Kim  H. Joao  J.A. Mutlu  O. Patt  Y.N. 《Micro, IEEE》2007,27(1):94-104
The branch misprediction penalty is a major performance limiter and a major cause of wasted energy in high-performance processors. The diverge-merge processor reduces this penalty by dynamically predicating a wide range of hard-to-predict branches at runtime in an energy-efficient way that doesn't significantly increase hardware complexity or require major ISA changes  相似文献   

20.
针对超标量处理器中长周期执行指令延迟退休及持续译码导致的重排序缓存(ROB)阻塞问题,提出一种指令乱序提交机制。通过设计容量可配置的多缓存指令提交结构,实现存储器操作指令和ALU类型指令的分类退休,根据超标量处理器架构及性能需求对目标缓存和存储缓存容量进行参数化配置降低流水线阻塞风险,同时利用指令目的寄存器编码提交模式加快指令提交速率。实验结果表明,该机制提高了单次指令提交数量,基于该机制的超标量处理器相比传统基于ROB顺序提交机制的超标量处理器在减少硬件开销的情况下平均IPC指数提升46%,相比基于值预测、乱序退休和组提交的超标量处理器平均IPC指数增益为19%,综合性能更优。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号