首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
为了提高片上Flash在嵌入式应用中的读取速度,提出了一种基于预取和缓存原理的片上Flash加速控制器。该控制器包括预取缓存和高速缓存两种加速方案。其中预取缓存方案采用位宽扩展和预取技术加速顺序指令的读取,并采用分支缓存存储非顺序指令,降低由非顺序指令造成的预取缺失代价;而高速缓存方案采用组相联和路预测技术,提高指令重用率,减少Flash访问次数,降低系统功耗。针对不同的应用场景,两种加速方案既可通过寄存器来静态切换,也可通过软件流程来自适应动态切换,从而获得最佳的读取速度提升。多项基准程序的测试结果表明了所提出的片上Flash加速控制器在性能和功耗优化上的可行性和高效性。  相似文献   

2.
在现代微处理器中,指令缓存的Tag读取、比较消耗了指令缓存较大比例的能耗。提出一种基于推断的低能耗指令缓存:不对称指令缓存。根据跳转指令比例低的特点,在该结构中区别处理跳转指令和顺序指令,使用和数据不完全对应的简化标记管理位。该结构采用了命中推断和变长指令取指两种创新技术,其中基于命中推断技术解决了指令缓存命中时Tag比较过多的问题;使用变长指令取指技术提高了顺序指令块的命中率。实验结果表明,对于选取的SPEC2006测试程序,不对称指令缓存结构较常规L1指令Cache取指能耗下降了40%~60%,比无标记指令缓存结构TH IC能耗降低了9%;取指ED2P方面,较常规L1指令Cache优化约50%,比TH IC结构优化约17%。  相似文献   

3.
同步数据触发体系结构SDTA将传统指令级并行细化到微操作级并行,具有较高的数据处理能力,但其特殊的指令格式及指令特性,给指令Cache访问带来了挑战。指令预取技术能够有效地降低指令Cache的访问失效率,增强处理器取指能力,提高性能。本文分析了SDTA指令集特性,提出了一种适合SDTA指令集特性的软硬件相结合的混合指令预取机制,采用硬件预取引擎和软件提示相结合进行预取。该方法能够有效地提高指令Cache命中率,且具有实现简单、无效预取率低、不会增加代码体积等特点。  相似文献   

4.
The static specification of operations executed in parallel using No Operations (NOPs) is another culprit to make code size to be increased in VLIW architecture. Some alternatives in the instruction encoding and memory subsystem are proposed to minimize the impact of NOP on the code size. One is the compressed cache using the packed encoding scheme and the other is the decompressed cache using the unpacked encoding scheme. The compressed cache shows high memory utilization but increases the pipeline branch penalty because it requires very complex fetch hardware. On the contrary, the fetch overhead can be decreased in the decompressed cache because the unpacked encoding scheme allows an instruction to be issued to the pipeline without any recovery process. However, it has a shortcoming that the memory utilization is deteriorated due to the memory allocation irrespective of the number of useful operations. In this research, a new instruction encoding scheme called a semi-packed encoding scheme and the section cache, which enables effective store and retrieval of semi-packed instructions, are proposed. This can decrease the hardware complexity to fetch an instruction and the wasted memory space due to NOPs via the partially fixed length of an instruction. The experimental results reveal that the memory utilization in the section cache is 3.4 times higher than in the decompressed cache. The memory subsystem using the section cache can provide about 15% performance improvement with the moderate size of chip area.  相似文献   

5.
The instruction fetch unit (IFU) usually dissipates a considerable portion of total chip power. In traditional IFU architectures, as soon as the fetch address is generated, it needs to be sent to the instruction cache and TLB arrays for instruction fetch. Since limited work can be done by the power-saving logic after the fetch address generation and before the instruction fetch, previous power-saving approaches usually suffer from the unnecessary restrictions from traditional IFU architectures. In this paper, we present CASA, a new power-aware IFU architecture, which effectively reduces the unnecessary restrictions on the power-saving approaches and provides sufficient time and information for the power-saving logic of both instruction cache and TLB. By analyzing, recording, and utilizing the key information of the dynamic instruction flow early in the front-end pipeline, CASA brings the opportunity to maximize the power efficiency and minimize the performance overhead. Compared to the baseline configuration, the leakage and dynamic power of instruction cache is reduced by 89.7% and 64.1% respectively, and the dynamic power of instruction TLB is reduced by 90.2%. Meanwhile the performance degradation in the worst case is only 0.63%. Compared to previous state-of-the-art power-saving approaches, the CASA-based approach saves IFU power more effectively, incurs less performance overhead and achieves better scalability. It is promising that CASA can stimulate further work on architectural solutions to power-efficient IFU designs.  相似文献   

6.
Energy consumption and power dissipation are important concerns in the design of embedded systems and they will become even more crucial with finer process geometry, higher frequencies, deeper pipelines and wider issue designs. In particular, the instruction cache consumes more energy than any other processor module, especially with commonly used highly associative CAM-based implementations.Two energy-efficient approaches for highly associative CAM-based instruction cache designs are presented by means of using a segmented wordline and a predictor-based instruction fetch mechanism. The latter is based on the fact that not all instructions in a given I-cache fetch are used due to taken branches. The proposed Fetch Mask Predictor unit determines which instructions in a cache access will actually be used to avoid fetching any of the other instructions. Both proposed approaches are evaluated for an embedded 4-wide issue processor in 100 nm technology. Experimental results show average I-cache energy savings of 48% and overall processor energy savings of 19%.  相似文献   

7.
基于循环的指令高速缓存访问预测方法   总被引:1,自引:0,他引:1  
为了减少高速缓存访问功耗,提出了一种针对循环的基于历史访问路径的指令高速缓存访问预测方法。该方法以循环作为高速缓存访问路预测行为开启的先决条件,通过指令高速缓存的历史访问路径训练预测器。当循环体再次进入时选择对应的访问路径预测器,获取目标指令高速缓存的路进行访问,降低访问功耗。并进一步提出多路径路预测方法,以得到更高的预测准确率。基于Powerstone测试基准的实验结果表明,该预测方法能达到99%的预测准确率。相比传统的指令高速缓存,使用本方法的高速缓存可平均降低65%的访问功耗,仅增加约0.2%的平均指令高速缓存访问周期。  相似文献   

8.
在同时多线程处理器中,提高取指单元的吞吐率意味着各线程之间的Cache竞争更加激烈,而这种竞争又制约着取指单元吞吐率的提高。本文针对当前超长指令字体系结构的新特点,提出了一种同时提高取指单元和处理器吞吐率的方法。该方法通过尽可能早地作废取指流水线中的无效地址,减少了由无效取指导致的程序Cache冲突,也提高了整个处理器的性能。实验结果表明,该方法使处理器和取指单元的吞吐率均相对提高了12%~23%,而一级程序Cache的失效率则略微增加甚至降低。另外,它还能够减少10%~25%的一级程
程序Cache读访问,从而降低了处理器的功耗。  相似文献   

9.
Radiation-induced soft error has become an emerging reliability threat to high performance microprocessor design. As the size of on chip cache memory steadily increased for the past decades, resilient techniques against soft errors in cache are becoming increasingly important for processor reliability. However, conventional soft error resilient techniques have significantly increased the access latency and energy consumption in cache memory, thereby resulting in undesirable performance and energy efficiency degradation. The emerging 3D integration technology provides an attractive advantage, as the 3D microarchitecture exhibits heterogeneous soft error resilient characteristics due to the shielding effect of die stacking. Moreover, the 3D shielding effect can offer several inner dies that are inherently invulnerable to soft error, as they are implicitly protected by the outer dies. To exploit the invulnerability benefit, we propose a soft error resilient 3D cache architecture, in which data blocks on the soft error invulnerable dies have no protection against soft error, therefore, access to the data block on the soft error invulnerable die incurs a considerably reduced access latency and energy. Furthermore, we propose to maximize the access on the soft error invulnerable dies by dynamically moving data blocks among different dies, thereby achieving further performance and energy efficiency improvement. Simulation results show that the proposed 3D cache architecture can reduce the power consumption by up to 65% for the L1 instruction cache, 60% for the L1 data cache and 20% for the L2 cache, respectively. In general, the overall IPC performance can be improved by 5% on average.  相似文献   

10.
跟踪缓存(Trace Cache)是着力解决取指令的带宽的一种颇具潜力的技术.SimpleScalar模拟器是使用软件手段模拟和研究CPU体系结构的重要手段.本文在介绍CPU模拟器和Trace Cache技术的基础上,提出了一种改进的基于基本块构造的Trace Cache,并在SimpleScalar模拟器中实现,并且给出了在这个平台上的试验结果.  相似文献   

11.
Accurate instruction fetch and branch prediction is increasingly important in today's superscalar architectures. Fetch prediction is the process of determining the next instruction to request from the memory subsystem. Branch prediction is the process of predicting the likely outcome of branch instructions. A branch target buffer (BTB) is often used to provide target addresses for taken branches and to predict the destination of indirect jumps. Using a BTB avoids the delay needed to recalculate the destination address and reduces the misfetch penalty. However, an effective branch target buffer can be large and can possibly increase the cycle time of a processor. We propose that a design used in older computers, such as the PDP-8, be used in modern architectures instead of a BTB design. The compiler would pre-compute the branch destination for most branch instructions, allowing the branch information to be stored with the instruction. We consider computing branch destinations at link time and as instructions are fetched into the instruction cache; both alternatives offer similar performance with different advantages. A very small BTB is still useful to predict indirect branches, which cannot be pre-computed. Our results show that the Precomputed-Branch architecture performs better than an architecture using only a BTB, and has significant hardware savings. This is particularly true for larger programs more representative of modern applications.  相似文献   

12.
The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time.  相似文献   

13.
一种精确的分支预测微处理器模型   总被引:3,自引:0,他引:3  
在当今深流水宽发射的微处理器中,为实现高性能,精确的分支预测是不可缺少的关键技术.分支预测失效将浪费大量的时钟周期,无法发挥乱序执行的效能.宽发射微处理器的有效性能同时还依赖指令窗口的大小和指令预取宽度.提出了一种新的更精确的支持分支预测和分支误预测周期损失的微处理器模型.根据指令的执行带宽为指令窗口中可用指令数的平方根统计规律,给出了一个更为精确的描述微处理器取指带宽、分支预测精度、分支误预测周期损失、指令窗口大小和IPC之间关系的算法,并讨论了这些参数的综合权衡以及这些参数对程序IPC的影响.由此可以确定依赖多个微处理器参数的取指带宽阈值和微处理器中几个关键参数的选取.  相似文献   

14.
The demand for low-power embedded systems requires designers to tune processor parameters to avoid excessive energy wastage. Tuning on a per-application or per-application-phase basis allows a greater saving in energy consumption without a noticeable degradation in performance. On-chip caches often consume a significant fraction of the total energy budget and are therefore prime candidates for adaptation. Fixed-configuration caches must be designed to deliver low average memory access times across a wide range of potential applications. However, this can lead to excessive energy consumption for applications that do not require the full capacity or associativity of the cache at all times. Furthermore, in systems where the clock period is constrained by the access times of level-1 caches, the clock frequency for all applications is effectively limited by the cache requirements of the most demanding phase within the most demanding application. This results in both performance and energy efficiency that represents the lowest common denominator across the applications. In this work we present a Set and way Management cache Architecture for Run-Time reconfiguration (SMART cache), a cache architecture that allows reconfiguration in both its size and associativity. Results show the energy-delay of the Smart cache is on average 70 and 12 % better than the baseline configuration for a two-core and four-core system respectively, with just 2 % away from oracle result and also with an overall performance degradation of less than 2 % compared with a baseline statically-configured cache.  相似文献   

15.
嵌入式处理器中降低Cache缺失代价设计方法研究   总被引:2,自引:0,他引:2  
以龙芯1号处理器为研究对象,探讨了嵌入式处理器中降低Cache缺失代价的设计方法.通过分析处理器的结构特征,本文实现了在关键字优先基础上一次缺失下命中的非阻塞数据Cache,可以将处理器平均性能提高3.9%,同时利用局部性原理,在关键字优先非阻塞数据Cache的基础上,本文提出了一种类非阻塞的指令Cache设计方法,可以降低指令Cache的缺失代价,以较小的实现代价进一步将处理器平均性能提高7.7%.通过本文的工作,可以同时降低指令Cache和数据Cache的缺失代价,处理器的平均性能提高了11.6%.  相似文献   

16.
跟踪缓存是着力解决取指令的带宽的一种颇具潜力的技术.探讨了跟踪缓存中突出存在的存储冗余问题,详细分析了跟踪缓存冗余存在的原因和后果,介绍了降低跟踪缓存冗余的方案,指出依然存在的问题,提出改进意见.  相似文献   

17.
传统的缓存替换算法由于不能适应应用程序的流式访问行为而导致缓存性能不佳.设计基于周期检测的预测方法,分析程序访存重用距离的规律性和流式访问的复杂性,提出用重用距离预测能同时适应简单流和复杂流访问模式的RDP算法.RDP的基本思想是预测重用距离并动态维护重用距离计数,动态调整缓存数据的替换顺序,通过流采样缩减存储开销.实验结果表明,RDP算法能够很好地适应程序中多样化的流访问模式,其总体性能优于LRU算法和DIP算法,在32MB缓存上比传统LRU算法平均减少了27.5%的缓存缺失.  相似文献   

18.
龙芯2号处理器设计和性能分析   总被引:16,自引:4,他引:16  
介绍龙芯2号处理器设计及其性能测试结果.龙芯2号采用四发射超标量超流水结构。片内一级指令和数据高速缓存各64KB,片外二级高速缓存最多可达8MB.为了充分发挥流水线的效率,龙芯2号实现了先进的转移猜测、寄存器重命名、动态调度等乱序执行技术以及非阻塞的Cache访问和load Speculation等动态存储访问机制.龙芯2号处理器采用0.18gm的CMOS工艺实现,在正常电压下的最高工作频率为500MHz,500MHz时的实测功耗为3~5W.龙芯2号单精度峰值浮点运算速度为20亿a/秒,双精度浮点运算速度为10亿a/秒,SPECCPU2000的实测性能是龙芯1号的8~10倍,综合性能已经达到PentiumⅢ的水平.目前芯片样机能流畅运行完整的64位中文Linux操作系统,全功能的Mozilla浏览器、多媒体播放器和OpenOffice办公套件,可以满足绝大多数桌面应用的要求.  相似文献   

19.
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.  相似文献   

20.
随着处理器和存储器速度差距的不断拉大,访存指令尤其是频繁cache miss的指令成为影响性能的重要瓶颈。编译器由于无法得知访存指令动态执行的拍数,一般假定这些指令的延迟为cache命中或者cache miss的延迟,所以并不准确。我们引入cache profiling技术来收集访存指令运行时的cache miss或者命中的信息,利用这些信息来计算访存的延迟。乱序机器上硬件的指令调度对于发射窗口内的指令能进行很好的动态调度,编译器则对更长的范围内的指令调度更有优势。在reorder buffer中cache miss一旦发生,容易引起reorder buffer满,导致流水线阻塞。调度容易cache miss的指令。使其并行执行,从而隐藏cache miss的长延迟,就可以提高程序性能。因此,我们针对load指令,一方面修改频繁miss的指令的延迟,一方面修改调度策略,提高存储级并行度。实验证明,我们的调度对于bzip2有高达4.8%的提升,art有4%的提升,整体平均提高1.5%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号