首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 125 毫秒
1.
分支目标缓存(BTB)是高端嵌入式CPU的主要耗能部件之一。针对BTB访问中引入的冗余功耗问题,提出了一种循环体访问过滤机制消除循环体指令流中顺序指令对BTB的无效访问。进一步提出了一种分支跟踪方法补偿循环过滤机制对循环体中非循环类分支指令的错误过滤造成的性能损失,节省了循环体指令流中顺序指令访问BTB的大量冗余功耗。基于Powerstone基准程序的仿真实验表明,在128表项BTB配置下,二级循环过滤器和4表项分支踪迹表可以减少约71.9%的BTB功耗,而平均每条指令周期数(CPI)退化仅为0.66%。  相似文献   

2.
分支误预测是影响高性能处理器性能进一步提升的一个主要因素.现代处理器采用分支目标缓存(branch target buffer,BTB)预测分支指令的目标地址,BTB的预测精度受限于其命中率.由于程序中分支指令的分布并不均匀,传统的BTB索引方式无法充分利用BTB资源,从而造成不必要的冲突缺失,影响分支目标地址的预测精度,采用散列索引方式优化访问映射关系是有效解决方法之一.当前大量文献研究了cache的访问方式,但对BTB的散列索引算法的专门探讨则显不足.为了消除分支指令的分布空洞,离散分支指令和BTB条目的固有映射关系,设计了用于BTB索引的XOR散列算法和优化的bit-select索引算法,使用概率方法对BTB单组最大映射数期望的上界作了估计,并对这两种散列索引算法的效果进行了模拟评估.实验结果表明,散列映射方式能够较好地避免BTB冲突缺失造成的预测失败,XOR散列算法的离散效果更好.  相似文献   

3.
熊振亚  林正浩  任浩琪 《计算机科学》2017,44(3):195-201, 214
现代计算机体系结构受两个方面的困扰:性能和能耗。为降低嵌入式处理器日益增长的功耗,提出基于跳转轨迹的分支目标缓冲结构(TG-BTB)。与传统分支目标缓冲每次提取指令时需要查询分支目标缓冲不同,TG-BTB只在执行轨迹预测为跳转时才查询分支目标缓冲。该结构通过在程序执行过程中动态分析跳转轨迹行为,可以实现只在轨迹跳转时查询分支目标缓冲,从而降低功耗。在动态分析过程中首先提取记录两条跳转分支指令之间的指令间隔,然后将提取的指令间隔存储在TG-BTB中,最后根据存储在TG-BTB中的指令间隔决定是否需要查询BTB。基于基准测试向量进行模型验证和性能测试,实验结果表明TG-BTB降低了81%的BTB查询能耗。  相似文献   

4.
嵌入式处理器动态分支预测机制研究与设计   总被引:2,自引:1,他引:1  
黄伟  王玉艳  章建雄 《计算机工程》2008,34(21):163-165
针对嵌入式处理器的特定应用环境,通过对传统神经网络算法的改进,结合定制的分支目标缓冲,提出一种复合式动态分支预测机制。该机制基于全局索引方式,对BTB结构进行定制设计,实现对循环逻辑中最后一条分支指令的精确预测。实验结果表明,该动态分支预测机制能降低硬件复杂度,提高预测精度。  相似文献   

5.
通过研究处理器动态分支预测器中预测效率与分支历史长度的关系,针对程序中各分支指令存在不同最优历史长度的规律,提出一种搜索各分支指令最佳历史长度的分支预测方法.该方法通过实时监测分支指令的预测准确率,在分支预测表硬件资源不变的情况下动态调整预测器的历史长度,以适应程序的动态运行特征.实验结果表明,在相同硬件资源下,文中方法相对于Gshare预测器错误率降低15.8%,相对于Bi-mode预测器预测错误率降低10.3%.  相似文献   

6.
一种低功耗动态可重构cache算法的研究   总被引:1,自引:0,他引:1  
动态可重构cache算法根据指令时间数监测程序段的变化,确定容量调整.在程序段内,状态机根据平均访问时间对cache的访问进行预判,然后根据预判的结果确定当前程序段的cache结构.实验结果表明,此算法比传统四路组相联cache功耗降低61%,而性能损失只有2%左右.与已有算法相比,功耗和性能都得到进一步的提高.  相似文献   

7.
高性能低功耗的容错编译技术:错误流压缩算法   总被引:1,自引:1,他引:1  
高珑  杨学军 《软件学报》2006,17(12):2425-2437
在许多关键应用中,计算机的高性能、低功耗和高可靠性是必须同时满足的要求.传统的软件容错技术频繁使用和比较分支指令检测错误,带来了巨大的性能和功耗的开销.提出了基于计算数据流模型的错误流模型,并设计了错误流压缩算法.在错误流压缩算法中,利用附加计算压缩了错误流的直径,显著减少了分支指令的数量,而总指令数不变.针对StreamIT提供的快速傅立叶变换测试程序,采用Wattch对错误流压缩算法进行模拟测试.实验结果表明,当循环参数n=225时,与传统的EDDI算法相比,使用错误流压缩算法可减少分支指令24%以上,IPC提高超过12%,同时,功耗减少了将近5%.给出的推算表明:在该实验中,如果内层循环体的存储指令数量为8,分支指令的减少可以达到43%以上.  相似文献   

8.
嵌入式处理器中Cache的应用极大地提高了处理器的性能,同时Cache,尤其是指令Cache功耗占据了处理器很大一部分功耗,关闭不必要的tag SRAM和data SRAM的访问,可以极大地降低功耗。提出了一种流水化的指令Cache访问机制,关闭不必要的data SRAM的访问;并且通过记录指令Cache行的信息和预测下一行的Cache形成一个Cache行滑动窗口,关闭不必要的tag SRAM访问。所提出的方法没有性能损失,在SMIC 90nm工艺下进行功耗分析,其指令访问的功耗降低50%。  相似文献   

9.
分支调度是一种有效消除分支指令延迟的指令调度技术,对于提升VLIW类处理器的性能非常重要。提出了一个针对分支延迟槽的指令调度优化算法。该算法面向VLIW体系结构,根据程序依赖图选择合适的候选指令序列;通过建立代价收益模型为分支延迟槽产生一个收益较大的指令调度序列。实验数据表明,分支调度算法可以平均提升12.9%的应用程序性能。  相似文献   

10.
混合Cache的低功耗设计方案   总被引:1,自引:0,他引:1       下载免费PDF全文
在嵌入式处理器中,Cache的功耗所占的比重越来越大。为降低嵌入式系统中混合Cache的功耗,引入一种基于程序段的重构算法——PPBRA,并提出一种新的基于分类访问的可重构混合Cache结构,该方案能够根据不同程序段对Cache容量的需求,动态地分配混合Cache的指令路数和数据路数,还能够对混合Cache进行分类访问,过滤对不必要路的访问,从而实现降低混合Cache的功耗的目的。Mibench仿真结果表明,该方案在有效降低Cache功耗的同时,还能提高Cache的综合性能。  相似文献   

11.
Predicting indirect-branch targets has become a performance bottleneck for many applications.Previous highperformance indirect-branch predictors usually require significant hardware storage or additional compiler support,which increases the complexity of the processor front-end or the compilers.This paper proposes a complexity-effective indirectbranch prediction mechanism,called the Set-Way Index Pointing (SWIP) prediction.It stores multiple indirect-branch targets in different branch target buffer (BTB) entries,whose set indices and way locations are treated as set-way index pointers.These pointers are stored in the existing branch-direction predictor.SWIP prediction reuses the branch direction predictor to provide such pointers,and then accesses the pointed BTB entries for the predicted indirect-branch target.Our evaluation shows that SWIP prediction could achieve attractive performance improvement without requiring large dedicated storage or additional compiler support.It improves the indirect-branch prediction accuracy by 36.5% compared to that of a commonly-used BTB,resulting in average performance improvement of 18.56%.Its energy consumption is also reduced by 14.34% over that of the baseline.  相似文献   

12.
Modern processors access the branch target buffer (BTB) every cycle to speculate branch target addresses. This aggressive approach improves performance as it results in early identification of target addresses. However, unfortunately, such accesses, quite often, are unnecessary as there is no control flow instruction among those fetched.In this work, we introduce speculative BTB access to address this design inefficiency. Our technique relies on a simple power efficient structure, referred to as the BLC-filter, to identify cycles where there is no control flow instruction among those fetched, at least one cycle in advance. By identifying such cycles and eliminating unnecessary BTB accesses we reduce BTB power dissipation (and therefore power density).  相似文献   

13.
The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time.  相似文献   

14.
In modern processors, deep pipelines couple with superscalar techniques to allow each pipe stage to process multiple instructions. When such a pipe must be flushed and refilled, as when predicted program flow beyond a branch is subsequently recognized as wrong, the temporary performance loss is significant. While modern branch target buffer (BTB) technology makes this flush/refill penalty fairly rare, the penalty that accrues from the remaining branch mispredictions is a serious impediment to even higher processor performance. Advanced mechanisms that can reduce this residual misprediction penalty can be of enormous value in future microprocessor designs. In this paper we describe the design and performance of a promising new mechanism called the Misprediction Recovery Cache (MRC). The key results of our study are. (1) Small, finite sized MRCs (16 to 256 entry) can effectively reduce branch penalty in deeply pipelined processors. (2) Commercial Benchmarks such as the Winstone benchmarks make better use of larger M RCs due to large number of unique branch instructions unlike the predominantly technical SPECint benchmarks. (3) The MRC hit rates increase with increasing BTB prediction accuracy (5-200% depending on MRC size) due to fewer residual mispredictions associated with better prediction. (4) For the processor architecture we studied, the M RC resulted in up to 20% improvement in cpi(cycles per instruction). (5) The incremental performance gain achievable by adding an MRC to a modern CISC processor (which uses a BTB with a two-level predictor) is two to three times of what was achievable by going from a one-level predictor to a two-level predictor.  相似文献   

15.
Accurate instruction fetch and branch prediction is increasingly important in today's superscalar architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely outcome of branch instructions. A branch target buffer (BTB) is often used to provide target addresses for taken branches and to predict the destination of indirect jumps. Using a BTB avoids the delay needed to recalculate the destination address and reduces the misfetch penalty. However, an effective branch target buffer can be large and can possibly increase the cycle time of a processor. We propose that a design used in older computers, such as the PDP-8, be used in modern architectures instead of a BTB design. The compiler would pre-compute the branch destination for most branch instructions, allowing the branch information to be stored with the instruction. We consider computing branch destinations at link time and as instructions are fetched into the instruction cache; both alternatives offer similar performance with different advantages. A very small BTB is still useful to predict indirect branches, which cannot be pre-computed. Our results show that the Precomputed-Branch architecture performs better than an architecture using only a BTB, and has significant hardware savings. This is particularly true for larger programs more representative of modern applications.  相似文献   

16.
Execution along mispredicted paths may or may not affect the accuracy of subsequent branch predictions if recovery mechanisms are not provided to undo the erroneous information that is acquired by the branch prediction storage structures. In this paper, we study four elements of the Two-Level Branch Predictor: the Branch Target Buffer (BTB), the Branch History Register (BHR), the Pattern History Tables (PHTs), and the Return Address Stack (RAS). For each we determine whether a recovery mechanism is needed, and, if so, show how to design a cost-effective one. Using five benchmarks from the SPECint92 suite, we show that there is no need to provide recovery mechanisms for the BTB and the PHTs, but that performance is degraded by an average of 30% if recovery mechanisms are not provided for the BHR and RAS.  相似文献   

17.
Microarchitects should consider power consumption, together with accuracy, when designing a branch predictor, especially in embedded processors. This paper proposes a power-aware branch predictor, which is based on the gshare predictor, by accessing the BTB (Branch Target Buffer) selectively. To enable the selective access to the BTB, the PHT (Pattern History Table) in the proposed branch predictor is accessed one cycle earlier than the traditional PHT if the program is executed sequentially without branch instructions. As a side effect, two predictions from the PHT are obtained through one access to the PHT, resulting in more power savings. In the proposed branch predictor, if the previous instruction was not a branch and the prediction from the PHT is untaken, the BTB is not accessed to reduce power consumption. If the previous instruction was a branch, the BTB is always accessed, regardless of the prediction from the PHT, to prevent the additional delay/accuracy decrease. The proposed branch predictor reduces the power consumption with little hardware overhead, not incurring additional delay and never harming prediction accuracy. The simulation results show that the proposed branch predictor reduces the power consumption by 29-47%.  相似文献   

18.
We present a new memory access optimization for Java to perform aggressive code motion for speculatively optimizing memory accesses by applying partial redundancy elimination (PRE) techniques. First, to reduce as many barriers as possible and to enhance code motion, we perform alias analysis to identify all the regions in which each object reference is not aliased. Secondly, we find all the possible barriers. Finally, we perform code motions in three steps. For the first step, we apply a non‐speculative PRE algorithm to move load instructions and their following instructions in the backwards direction of the control flow graph. For the second step, we apply a speculative PRE algorithm to move some of them aggressively before the conditional branches. For the third step, we apply our modified version of a non‐speculative PRE algorithm to move store instructions in the forward direction of the control flow graph and to even move some of them after the merge points. We implemented our new algorithm in our production‐level Java just‐in‐time compiler. Our experimental results show that our speculative algorithm improves the average (maximum) performance by 13.1% (90.7%) for jBYTEmark and 1.4% (4.4%) for SPECjvm98 over the fastest algorithm previously described, while it increases the average (maximum) compilation time by 0.9% (2.9%) for both benchmark suites. Copyright © 2004 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号