首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 234 毫秒
1.
方娟  郭媚  杜文娟 《计算机科学》2013,40(8):34-37,42
针对片上多核处理器下的二级共享Cache的能耗问题提出了基于Cache划分的路预测Cache结构WPP-L2,该结构首先对共享Cache进行公平性划分,然后采用路预测的方法降低了预测命中和失效时各自的能耗开销。实验表明,在基本保持多核处理器性能的同时,8核处理器系统下WPP-L2Cache比基于路预测的L2Cache的能耗延迟乘积EDP(Energy Delay Product)平均下降24.7%,比传统的L2Cache的EDP平均下降66.1%,极大地降低了L2Cache功耗。  相似文献   

2.
共享存储多核处理器中“忙-等待”技术常用来实现锁或栅栏等同步操作,这些典型的同步机制通常受限于较长的同步延迟和资源竞争等问题,导致扩展性较差,且需要不时进行访存操作,影响正常存储器访问操作,加剧对存储系统的带宽需求。提出了一种用于同步数据触发结构多核处理器的基于指令Cache作废的同步技术,同步时作废将执行的指令Cache行导致取指失效,向L2 Cache发送取指请求,L2 Cache中设置相应的过滤机制,不服务不满足同步条件的处理器核的取指请求,使相应处理器核暂停,达到同步目的。测试表明,该方法在可扩展性和同步性能方面均具有一定的优势。  相似文献   

3.
高效能是处理器设计的重要指标。由于指令部件在处理器芯片中开始占据越来越多的芯片面积,消耗了较多的芯片功耗,研究人员提出了零级指令缓存设计。零级指令缓存容量小、访问耗能低,与流水线紧密耦合、取指命中时可以门控流水线部分逻辑。因此,零级指令缓存可以有效提高流水线指令部件的能效比。综述了现有的零级指令缓存的不同结构、各结构的发展与应用情况;展望了零级指令缓存设计的未来研究思路。  相似文献   

4.
熊振亚  林正浩  任浩琪 《计算机科学》2017,44(3):195-201, 214
现代计算机体系结构受两个方面的困扰:性能和能耗。为降低嵌入式处理器日益增长的功耗,提出基于跳转轨迹的分支目标缓冲结构(TG-BTB)。与传统分支目标缓冲每次提取指令时需要查询分支目标缓冲不同,TG-BTB只在执行轨迹预测为跳转时才查询分支目标缓冲。该结构通过在程序执行过程中动态分析跳转轨迹行为,可以实现只在轨迹跳转时查询分支目标缓冲,从而降低功耗。在动态分析过程中首先提取记录两条跳转分支指令之间的指令间隔,然后将提取的指令间隔存储在TG-BTB中,最后根据存储在TG-BTB中的指令间隔决定是否需要查询BTB。基于基准测试向量进行模型验证和性能测试,实验结果表明TG-BTB降低了81%的BTB查询能耗。  相似文献   

5.
面向通用计算机系统的指令预取技术无法满足实时系统的应用需求,其中一个重要原因是:无效预取引起的指令Cache内容污染使得实时任务WCET评估值不够精确,导致系统可调度性下降,严重影响系统效率.以简化实时任务WCET分析、降低任务WCET评估值为目标,提出一种基于程序基本块的指令预取方法.该方法以基本块为粒度执行指令预取,避免了传统指令预取技术引入的无效预取;通过简化最坏情况下的指令访问命中/缺失情况判定,简化任务WCET分析过程并优化WCET评估值.实时基准测试程序评估结果表明:与常规无预取方法相比,该预取方法可使实时任务WCET评估值降低约20%,平均执行情况下的指令Cache访问性能提升约10%.  相似文献   

6.
随着网络上光盘资源的增长,光盘服务器成为光盘网络共享的重要技术,针对传统光盘服务器的缺点,实现了一个新的高性能光盘服务器CDS(CDServer),CDS系统采用两级Cache(客户端Cache加服务器Cache)的技术来提高系统的性能.客户端Cache根据光盘顺序访问特点,采用慢速增长快速下降的预取算法设计,即提高了系统的性能,也保证预取不命中时的响应时间,服务器Cache采用Hash算法同平衡二叉树相结合的两级组织结构,实现了Cache的快速查找.在详细介绍了CDS系统的两级Cache算法的同时,进行了相应的试验测试和性能分析.  相似文献   

7.
半导体工艺的进步使片上可以集成更多的处理核心,对于消耗较多面积和功耗的存储单元,如何有效地减小面积、降低功耗是片上多核研究的一个重要方向。软件指令缓存技术是降低指令存储复杂性,以及降低功耗的有效方式,本文深入对比了硬件Cache结构和软件指令缓存结构,并且详细分析了两款典型的软件指令缓存结构,总结了其特点和需要解决的关键问题,为片上多核的指令存储设计提供了参考。  相似文献   

8.
指令压缩技术能够克服传统超长指令字(very long instruction word,VLIW)结构的指令高速缓冲(cache)中长指令字密度低的缺陷,使长指令字中的各条指令能紧密地排列在高速缓冲行(cache line)中,但可能导致长指令字分置于两个cache line,使其不能同时参与取指与发射,从而成为处理器的性能瓶颈.受到分置cache line的影响,传统提升循环效率的软件流水方法性能下降.高性能变长指令发射窗的机制能够解决分离指令字带来的取指发射问题,为取指流水线提供高效连续的指令流,特别地,该机制缓存循环的一次迭代,硬件支持循环的软件流水,有效地增强VLIW结构的数字信号处理器(digital signal processor,DSP)的性能.通过搭建时钟精确的处理器仿真模型,并基于DSP/IMG库上进行仿真,结果显示,采用两级指令发射窗机制,平均性能提高约21.89%.  相似文献   

9.
取指策略直接影响处理器的指令吞吐率.针对传统处理器取指策略存在取指带宽利用不均衡、指令队列冲突率高的缺点,提出基于同时多线程处理器的取指策略IFSBSMT.该策略以线程的IPC值为基础,速取优先级高的线程进行取指,并利用预取指令条数预算的方式分配取指带宽,采取线程IPC值和L2 Cache缺失率的双优先级动态资源分配机制分配处理器的系统资源.研究结果表明,IFSBSMT策略有效地解决了取指带宽、指令队列冲突及资源浪费问题,进一步提高了指令吞吐率,且具有较好的取指公平性.  相似文献   

10.
多核多线程处理器二级Cache预取结构的设计   总被引:1,自引:1,他引:0       下载免费PDF全文
合理的设计二级Cache是有效地减少多核多线程处理器存储器访问延迟的方法。针对现有的多核多线程处理器,讨论了二级Cache的混合预取结构设计方案。通过详细设计和仿真分析,结果表明混合预取结构可有效提高处理器的整体性能。特别是采用不命中混合预取结构的二级Cache性能更佳,适合满足此类结构的多核多线程处理器需求。  相似文献   

11.
Energy consumption and power dissipation are important concerns in the design of embedded systems and they will become even more crucial with finer process geometry, higher frequencies, deeper pipelines and wider issue designs. In particular, the instruction cache consumes more energy than any other processor module, especially with commonly used highly associative CAM-based implementations.Two energy-efficient approaches for highly associative CAM-based instruction cache designs are presented by means of using a segmented wordline and a predictor-based instruction fetch mechanism. The latter is based on the fact that not all instructions in a given I-cache fetch are used due to taken branches. The proposed Fetch Mask Predictor unit determines which instructions in a cache access will actually be used to avoid fetching any of the other instructions. Both proposed approaches are evaluated for an embedded 4-wide issue processor in 100 nm technology. Experimental results show average I-cache energy savings of 48% and overall processor energy savings of 19%.  相似文献   

12.
The design of a high performance fetch architecture can be challenging due to poor interconnect scaling and energy concerns. Way prediction has been presented as one means of scaling the fetch engine to shorter cycle times, while providing energy efficient instruction cache accesses. However, way prediction requires additional complexity to handle mispredictions.In this paper, we examine a high-bandwidth fetch architecture augmented with an instruction cache way predictor. We compare the performance and energy efficiency of this architecture to both a serial access cache and a parallel access cache. Our results show that a serial fetch architecture achieves approximately the same energy reduction and performance as way prediction architectures, without the added structures and recovery complexity needed for way prediction.  相似文献   

13.
为了提高片上Flash在嵌入式应用中的读取速度,提出了一种基于预取和缓存原理的片上Flash加速控制器。该控制器包括预取缓存和高速缓存两种加速方案。其中预取缓存方案采用位宽扩展和预取技术加速顺序指令的读取,并采用分支缓存存储非顺序指令,降低由非顺序指令造成的预取缺失代价;而高速缓存方案采用组相联和路预测技术,提高指令重用率,减少Flash访问次数,降低系统功耗。针对不同的应用场景,两种加速方案既可通过寄存器来静态切换,也可通过软件流程来自适应动态切换,从而获得最佳的读取速度提升。多项基准程序的测试结果表明了所提出的片上Flash加速控制器在性能和功耗优化上的可行性和高效性。  相似文献   

14.
The static specification of operations executed in parallel using No Operations (NOPs) is another culprit to make code size to be increased in VLIW architecture. Some alternatives in the instruction encoding and memory subsystem are proposed to minimize the impact of NOP on the code size. One is the compressed cache using the packed encoding scheme and the other is the decompressed cache using the unpacked encoding scheme. The compressed cache shows high memory utilization but increases the pipeline branch penalty because it requires very complex fetch hardware. On the contrary, the fetch overhead can be decreased in the decompressed cache because the unpacked encoding scheme allows an instruction to be issued to the pipeline without any recovery process. However, it has a shortcoming that the memory utilization is deteriorated due to the memory allocation irrespective of the number of useful operations. In this research, a new instruction encoding scheme called a semi-packed encoding scheme and the section cache, which enables effective store and retrieval of semi-packed instructions, are proposed. This can decrease the hardware complexity to fetch an instruction and the wasted memory space due to NOPs via the partially fixed length of an instruction. The experimental results reveal that the memory utilization in the section cache is 3.4 times higher than in the decompressed cache. The memory subsystem using the section cache can provide about 15% performance improvement with the moderate size of chip area.  相似文献   

15.
On-chip instruction cache is a potential power hungry component in embedded systems due to its large chip area and high access-frequency. Aiming at reducing power consumption of the on-chip cache, we propose a Reduced One-Bit Tag Instruction Cache (ROBTIC), where the cache size is judiciously reduced and the cache tag field only contains the least significant bit of the full-tag. We develop a cache operational control scheme for ROBTIC so that with the one-bit cache tag, the program locality can still be efficiently exploited. For applications where most of the memory accesses are localized, our cache can achieve similar performance as a traditional full-tag cache; however, the power consumption of the cache can be significantly reduced due to the much smaller cache size, narrower tag array (just one bit), and tinier tag comparison circuit being used. Experiments on a set of benchmarks implemented in CMOS 180 nm process technology demonstrate that our proposed design can reduce up to 27.3% dynamic power consumption and 30.9% area of the traditional cache when the cache size is fixed at 32 instructions, which outperforms the existing partial-tag based cache design. With the cache size customization, a further 47.8% power saving can be achieved. Our experimental results also show that when implemented in the deep sub-micron technologies where the leakage power is not ignorable, our design is still efficient - a coherent power saving trend (about 22%) has been observed for technologies from 130 nm down to 65 nm.  相似文献   

16.
现代嵌入式处理器中指令高速缓存的功耗十分显著,对此提出一种基于路访问轨迹的组相联指令高速缓存的低功耗策略,利用改进的指令高速缓存和转移目标缓存建立和维护运行时指令高速缓存的路访问轨迹来减少指令高速缓存命中检测及无关路访问.进一步提出了基于跨行访问前驱指针、转移前驱状态、转移前驱指针及转移目标索引的路访问轨迹信息维护策略用以降低信息重建的频度,从而更有效地利用已建立的路访问轨迹信息.实验结果表明:采用优化后的路访问轨迹策略的指令高速缓存的标志存储器访问和数据存储器访问分别降低到传统指令高速缓存的3.60%和27.70%.  相似文献   

17.
The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time.  相似文献   

18.
The instruction compression mechanism used to solve the drawbacks of traditional very long instruction word (VLIW) architectures often leads to poor code density in the instruction cache, which causes the irregular lengths of long instructions to cross the different cache line. These split long instructions cannot be fetched simultaneously, which creates a bottleneck for VLIW architectures. This paper proposes a buffing mechanism which can slide the split long instruction as a continuous form to offer better efficiency in instruction fetching. This approach helps maintain the behaviors of the software pipeline technology, which schedules iterative instructions to enhance the performance of streaming processing for VLIW architectures. In the proposed mechanism, the instruction stream buffer stores the repeat block completely and suspends as far as possible the cache access to reduce access time. The advantages of repeatedly issuing instructions in the instruction buffer and preventing split long instructions, can substantially improve the performance in fetching instructions. Simulation results show that the mechanism is efficient at the instruction level for the basic DSP/IMG library by improving performance by 35% on average.  相似文献   

19.
众核处理器设计在芯片面积上受到了巨大挑战,如何将有限的芯片面积投入到运算能力中,是众核处理器体系结构研究的热点。聚焦众核处理器的指令缓存结构设计,研究通过在多核核心之间共享一级指令缓存,以获取指令系统及处理器流水线性能的提升。给出了共享指令缓存的结构设计,对该结构进行了节拍级精确的性能模拟,并通过RTL级代码的综合得到了面积开销和时序指标。测试结果表明,共享指令缓存可以降低11%~27%的缓存脱靶率,提升4%~7%的流水线性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号