期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

喻明艳张祥建杨兵《计算机辅助设计与图形学学报》2010,22(4)

传统的分支目标缓冲器(BTB)每个取指周期都要进行访问,由于程序中的分支指令只占总指令数的20%左右,使得大约80%的BTB访问都是无效的.为此,利用程序控制流中分支指令间距固定的特性,提出一种对性能影响极小的BTB跳跃访问算法.在BTB中存储分支指令到运行路径中下一条分支指令的距离,BTB命中后,根据相应的分支距离来关闭当前分支指令与下一条分支指令之间的BTB访问,以有效地提高访问效率并降低动态功耗.该算法在嵌入式处理器中实现时只控制预测跳转分支指令的BTB跳跃访问,减少了硬件资源的开销.在硬件模型上进行模拟和综合后的结果表明,在128分支项的BTB中,采用文中算法可以降低72%的动态功耗,而性能损失仅为0.013%. 相似文献

2.

BTB索引散列算法的研究与设计

王国澎胡向东尹飞朱英《计算机研究与发展》2014,51(9)

分支误预测是影响高性能处理器性能进一步提升的一个主要因素.现代处理器采用分支目标缓存(branch target buffer,BTB)预测分支指令的目标地址,BTB的预测精度受限于其命中率.由于程序中分支指令的分布并不均匀,传统的BTB索引方式无法充分利用BTB资源,从而造成不必要的冲突缺失,影响分支目标地址的预测精度,采用散列索引方式优化访问映射关系是有效解决方法之一.当前大量文献研究了cache的访问方式,但对BTB的散列索引算法的专门探讨则显不足.为了消除分支指令的分布空洞,离散分支指令和BTB条目的固有映射关系,设计了用于BTB索引的XOR散列算法和优化的bit-select索引算法,使用概率方法对BTB单组最大映射数期望的上界作了估计,并对这两种散列索引算法的效果进行了模拟评估.实验结果表明,散列映射方式能够较好地避免BTB冲突缺失造成的预测失败,XOR散列算法的离散效果更好. 相似文献

3.

嵌入式处理器动态分支预测机制研究与设计 总被引：2，自引：1，他引：1

黄伟王玉艳章建雄《计算机工程》2008,34(21):163-165

针对嵌入式处理器的特定应用环境,通过对传统神经网络算法的改进,结合定制的分支目标缓冲,提出一种复合式动态分支预测机制。该机制基于全局索引方式,对BTB结构进行定制设计,实现对循环逻辑中最后一条分支指令的精确预测。实验结果表明,该动态分支预测机制能降低硬件复杂度,提高预测精度。相似文献

4.

基于循环的指令高速缓存访问预测方法 总被引：1，自引：0，他引：1

梁静陈志坚孟建熠《计算机应用研究》2012,29(7):2491-2493

为了减少高速缓存访问功耗,提出了一种针对循环的基于历史访问路径的指令高速缓存访问预测方法。该方法以循环作为高速缓存访问路预测行为开启的先决条件,通过指令高速缓存的历史访问路径训练预测器。当循环体再次进入时选择对应的访问路径预测器,获取目标指令高速缓存的路进行访问,降低访问功耗。并进一步提出多路径路预测方法,以得到更高的预测准确率。基于Powerstone测试基准的实验结果表明,该预测方法能达到99%的预测准确率。相比传统的指令高速缓存,使用本方法的高速缓存可平均降低65%的访问功耗,仅增加约0.2%的平均指令高速缓存访问周期。相似文献

5.

基于跳转轨迹的分支目标缓冲研究

熊振亚林正浩任浩琪《计算机科学》2017,44(3):195-201, 214

现代计算机体系结构受两个方面的困扰:性能和能耗。为降低嵌入式处理器日益增长的功耗,提出基于跳转轨迹的分支目标缓冲结构(TG-BTB)。与传统分支目标缓冲每次提取指令时需要查询分支目标缓冲不同,TG-BTB只在执行轨迹预测为跳转时才查询分支目标缓冲。该结构通过在程序执行过程中动态分析跳转轨迹行为,可以实现只在轨迹跳转时查询分支目标缓冲,从而降低功耗。在动态分析过程中首先提取记录两条跳转分支指令之间的指令间隔,然后将提取的指令间隔存储在TG-BTB中,最后根据存储在TG-BTB中的指令间隔决定是否需要查询BTB。基于基准测试向量进行模型验证和性能测试,实验结果表明TG-BTB降低了81%的BTB查询能耗。相似文献

6.

高性能低功耗的容错编译技术:错误流压缩算法 总被引：1，自引：1，他引：1

高珑杨学军《软件学报》2006,17(12):2425-2437

在许多关键应用中,计算机的高性能、低功耗和高可靠性是必须同时满足的要求.传统的软件容错技术频繁使用和比较分支指令检测错误,带来了巨大的性能和功耗的开销.提出了基于计算数据流模型的错误流模型,并设计了错误流压缩算法.在错误流压缩算法中,利用附加计算压缩了错误流的直径,显著减少了分支指令的数量,而总指令数不变.针对StreamIT提供的快速傅立叶变换测试程序,采用Wattch对错误流压缩算法进行模拟测试.实验结果表明,当循环参数n=2²⁵时,与传统的EDDI算法相比,使用错误流压缩算法可减少分支指令24%以上,IPC提高超过12%,同时,功耗减少了将近5%.给出的推算表明:在该实验中,如果内层循环体的存储指令数量为8,分支指令的减少可以达到43%以上. 相似文献

7.

基于记录缓冲的低功耗指令Cache方案 总被引：1，自引：1，他引：1

马志强季振洲胡铭曾《计算机研究与发展》2006,43(4):744-751

现代微处理器大多采用片上Cache来缓解主存储器与中央处理器(CPU)之间速度的巨大差异,但Cache也成为处理器功耗的主要来源,尤其是其中大部分功耗来自于指令Cache.采用缓冲器可以过滤掉大部分的指令Cache访问,从而降低功耗,但仍存在相当程度不必要的存储体访问,据此提出了一种基于记录缓冲的低功耗指令Cache结构RBC.通过记录缓冲器和对存储体的改造,RBC能够过滤大部分不必要的存储体访问,有效地降低了Cache的功耗.对10个SPEC2000标准测试程序的仿真结果表明,与传统基于缓冲器的Cache结构相比,在仅牺牲6.01%处理器性能和3.75%面积的基础上,该方案可以节省24.33%的指令Cache功耗. 相似文献

8.

基于预取和缓存原理的片上Flash加速控制器设计

蒋进松黄凯陈辰王钰博严晓浪《计算机工程与科学》2016,38(12):2381-2391

为了提高片上Flash在嵌入式应用中的读取速度,提出了一种基于预取和缓存原理的片上Flash加速控制器。该控制器包括预取缓存和高速缓存两种加速方案。其中预取缓存方案采用位宽扩展和预取技术加速顺序指令的读取,并采用分支缓存存储非顺序指令,降低由非顺序指令造成的预取缺失代价;而高速缓存方案采用组相联和路预测技术,提高指令重用率,减少Flash访问次数,降低系统功耗。针对不同的应用场景,两种加速方案既可通过寄存器来静态切换,也可通过软件流程来自适应动态切换,从而获得最佳的读取速度提升。多项基准程序的测试结果表明了所提出的片上Flash加速控制器在性能和功耗优化上的可行性和高效性。相似文献

9.

嵌入式微处理器分支预测的设计与实现

陈海民李峥王瑞蛟《计算机应用》2011,31(7):2004-2007

针对五级流水线嵌入式微处理器的特定应用环境,对分支预测技术进行了深入研究,提出了一种新的分支预测方案。该方案兼容带缓存设计,通过扩展指令总线,在取指段提前对分支指令跳转方向和目标地址进行预测,保存可能执行而未执行的指令和地址指针以备分支预测失效时得以恢复,减少了预测失效的代价,同时保证了指令流的正确执行。研究表明,该方案硬件开销小,预测效率高,预测失效代价低。相似文献

10.

以基本块为单位的非顺序指令预取 总被引：1，自引：0，他引：1

沈立戴葵王志英《计算机工程与科学》2003,25(4):94-98

取指令能力的高低对微处理器的性能有很大影响。指令预取技术能够有效地降低指令Cache的访问失效率,提高微处理器的取指令能力,进而提高微处理器的性能。本文提出了一种由分支指令指导的、以基本块为单位的非顺序指令预取技术,每次预取将一个完整的基本块读入指令Cache。这种方法使用静态策略分析程序行为,实现所需的硬件复杂度低。模拟结果显示,该方法能够有效地提高指令Cache访问的命中率。相似文献

11.

A Power-Aware Branch Predictor by Accessing the BTB Selectively

下载免费PDF全文

Cheol Hong Kim Sung Woo Chung and Chu Shik Jhon 《计算机科学技术学报》2005,20(5):607-614

Microarchitects should consider power consumption, together with accuracy, when designing a branch predictor, especially in embedded processors. This paper proposes a power-aware branch predictor, which is based on the gshare predictor, by accessing the BTB （Branch Target Buffer） selectively. To enable the selective access to the BTB, the PHT （Pattern History Table） in the proposed branch predictor is accessed one cycle earlier than the traditional PHT if the program is executed sequentially without branch instructions. As a side effect, two predictions from the PHT are obtained through one access to the PHT, resulting in more power savings. In the proposed branch predictor, if the previous instruction was not a branch and the prediction from the PHT is untaken, the BTB is not accessed to reduce power consumption. If the previous instruction was a branch, the BTB is always accessed, regardless of the prediction from the PHT, to prevent the additional delay/accuracy decrease. The proposed branch predictor reduces the power consumption with little hardware overhead, not incurring additional delay and never harming prediction accuracy. The simulation results show that the proposed branch predictor reduces the power consumption by 29-47%. 相似文献

12.

Linked instruction caches for enhancing power efficiency of embedded systems

Chang-Jung Ku Ching-Wen Chen An Hsia Chun-Lin Chen 《Microprocessors and Microsystems》2014

The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time. 相似文献

13.

The Precomputed-Branch architecture: Efficient branches with compiler support

《Journal of Systems Architecture》1999,45(9):651-679

Accurate instruction fetch and branch prediction is increasingly important in today's superscalar architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely outcome of branch instructions. A branch target buffer (BTB) is often used to provide target addresses for taken branches and to predict the destination of indirect jumps. Using a BTB avoids the delay needed to recalculate the destination address and reduces the misfetch penalty. However, an effective branch target buffer can be large and can possibly increase the cycle time of a processor. We propose that a design used in older computers, such as the PDP-8, be used in modern architectures instead of a BTB design. The compiler would pre-compute the branch destination for most branch instructions, allowing the branch information to be stored with the instruction. We consider computing branch destinations at link time and as instructions are fetched into the instruction cache; both alternatives offer similar performance with different advantages. A very small BTB is still useful to predict indirect branches, which cannot be pre-computed. Our results show that the Precomputed-Branch architecture performs better than an architecture using only a BTB, and has significant hardware savings. This is particularly true for larger programs more representative of modern applications. 相似文献

14.

SWIP Prediction: Complexity-Effective Indirect-Branch Prediction Using Pointers

下载免费PDF全文

谢子超佟冬黄明凯史秦青程旭《计算机科学技术学报》2012,27(4):754-768

Predicting indirect-branch targets has become a performance bottleneck for many applications.Previous highperformance indirect-branch predictors usually require significant hardware storage or additional compiler support,which increases the complexity of the processor front-end or the compilers.This paper proposes a complexity-effective indirectbranch prediction mechanism,called the Set-Way Index Pointing (SWIP) prediction.It stores multiple indirect-branch targets in different branch target buffer (BTB) entries,whose set indices and way locations are treated as set-way index pointers.These pointers are stored in the existing branch-direction predictor.SWIP prediction reuses the branch direction predictor to provide such pointers,and then accesses the pointed BTB entries for the predicted indirect-branch target.Our evaluation shows that SWIP prediction could achieve attractive performance improvement without requiring large dedicated storage or additional compiler support.It improves the indirect-branch prediction accuracy by 36.5% compared to that of a commonly-used BTB,resulting in average performance improvement of 18.56%.Its energy consumption is also reduced by 14.34% over that of the baseline. 相似文献

15.

The Misprediction Recovery Cache

Ashwini K. Nanda James O. Bondi Simonjit Dutta 《International journal of parallel programming》1998,26(4):383-415

In modern processors, deep pipelines couple with superscalar techniques to allow each pipe stage to process multiple instructions. When such a pipe must be flushed and refilled, as when predicted program flow beyond a branch is subsequently recognized as wrong, the temporary performance loss is significant. While modern branch target buffer (BTB) technology makes this flush/refill penalty fairly rare, the penalty that accrues from the remaining branch mispredictions is a serious impediment to even higher processor performance. Advanced mechanisms that can reduce this residual misprediction penalty can be of enormous value in future microprocessor designs. In this paper we describe the design and performance of a promising new mechanism called the Misprediction Recovery Cache (MRC). The key results of our study are. (1) Small, finite sized MRCs (16 to 256 entry) can effectively reduce branch penalty in deeply pipelined processors. (2) Commercial Benchmarks such as the Winstone benchmarks make better use of larger M RCs due to large number of unique branch instructions unlike the predominantly technical SPECint benchmarks. (3) The MRC hit rates increase with increasing BTB prediction accuracy (5-200% depending on MRC size) due to fewer residual mispredictions associated with better prediction. (4) For the processor architecture we studied, the M RC resulted in up to 20% improvement in cpi(cycles per instruction). (5) The incremental performance gain achievable by adding an MRC to a modern CISC processor (which uses a BTB with a two-level predictor) is two to three times of what was achievable by going from a one-level predictor to a two-level predictor. 相似文献

16.

高性能代价比的两层关联间接转移预测器设计

袁楠范东睿《计算机学报》2008,31(11)

随着面向对象语言程序、动态链接库(DLL)等的普遍应用,间接转移指令的使用越来越频繁.两层关联间接转移预测器预测准确度高,但实现硬件代价较高,因此并不实用.文中深入分析了两层关联间接转移预测器中产生误预测的原因,通过改进索引方法、压缩存储等实用方法减小硬件实现代价.实验结果表明,通过这些方法的改进,在133K比特硬件存储代价下,使用一组SPEC CPU2000测试程序进行评估,间接转移误预测率为9.6%,仅比两层关联预测器理想误预测率高2.3%,而4路组相联BTB预测器的误预测率为31%. 相似文献