首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到17条相似文献,搜索用时 265 毫秒
1.
本文提出了一种基于改进的LRU替换策略划分最后一级共享Cache的算法,隔离了线程间的数据冲突,实现了改进的Cache替换策略,通过划分最后一级共享Cache也减少了访存延迟,提高了系统吞吐率.  相似文献   

2.
LRU替换算法在单核处理器中得到了广泛应用,而多核环境大都采用多核共享最后一级Cache(LLC)的策略,随着LLC容量和相联度的增加以及多核应用的工作集增大,LRU替换算法和理论最优替换算法之间的差距越来越大。该文提出了一种平均划分下基于频率的多核共享Cache替换算法(ALRU-F)。该算法将当前所需要的部分工作集保留在Cache内,逐出无用块,同时还提出了块粒度动态划分下基于频率的替换算法(BLRU-F)。该文提出的ALRU-F算法相比传统的LRU算法缺失率降低了26.59%, CPU每一时钟周期内所执行的指令数IPC(Instruction Per Clock)则提升了13.59%。在此基础上提出的块粒度动态划分下,基于频率的BLUR-F算法相比较传统的LRU算法性能提高更大,缺失率降低了33.72%,而IPC 则提升了16.59%。提出的两种算法在性能提升的同时,并没有明显地增加能耗。  相似文献   

3.
以V-Way Cache结构为原型,提出一种面向CMP的可变相联度混合Cache结构CMP-VH.CMP-VH将最后一级片上Cache划分成一种优化的私有/共享结构,Tag私有,数据部分私有部分共享.采用基于数据块的重用信息替换策略,提供显式和隐式两种机制在核间对共享数据进行容量划分.并行程序负载SPLASH-2的模拟...  相似文献   

4.
陈芳园  张冬松  王志英 《电子学报》2012,40(7):1372-1378
在共享Cache的多核处理器中,线程在共享Cache中的指令可能被其他并行线程的指令替换,从而导致了线程间在共享Cache上的干扰.多核结构下WCET估值需要考虑并行线程间在共享Cache上的干扰.针对当前典型的共享Cache和共享总线的多核结构,本文提出了一种迭代的WCET估值分析方法.考虑共享总线对共享Cache访问的时序影响,基于该时序分析线程间在共享Cache上的干扰,得到较精确的WCET估值.理论分析证明了该方法的有效性,实验结果表明本文的分析方法较之当前的两种方法分别可以提高21%和14%的精确度.  相似文献   

5.
本文首先分析了多核系统中二级Cache私有和共享管理方式的优缺点.并在此基础上,分析了现有的基于私有和共享方式的优化策略,现有的优化策略均通过混合私有和共享的方式在Cache访问延迟和Cache命中率之间找到一种平衡.  相似文献   

6.
一种结合动态写策略的磁盘Cache替换算法   总被引:1,自引:0,他引:1  
磁盘Cache是改善I/O性能的一种技术.通过分析Cache写策略和LRU、LFU替换算法对磁盘Cache性能的影响,引入一种动态写策略,改进替换算法,使基于频率的块替换算法FBR与动态写策略相结合.二者结合较好地应用于磁盘存取中,充分利用局部性规律,提高I/O性能,使磁盘在多种工作环境和不同Cache大小下的性能更优.  相似文献   

7.
在多核环境下,对共享L2 Cache的优化显得尤为重要,因为当被访问的数据块不在L2 Cache中时(发生L2缺失),CPU需要花费几百个周期访问主存的代价是相当大的.在设计Cache时,替换算法是考虑的一个重要因素,替换算法的好坏直接影响Cache的性能和计算机的整体性能.虽然LRU替换算法已经被广泛应用在片上Cache中,但是也存在着一些不足:当Cache容量小于程序工作集时,容易产生冲突缺失;且LRU替换算法不考虑数据块被访问的频率.文中把冒泡替换算法应用到多核共享Cache中,同时考虑数据块被访问的频率和最近访问的信息.通过分析实验数据,与LRU替换算法相比,采用冒泡替换算法可以使MPKI(Misses per Kilo instructions)和L2 Cache命中率均有所改善.  相似文献   

8.
LRU-Assist:一种高效的Cache漏流功耗控制算法   总被引:5,自引:4,他引:1       下载免费PDF全文
随着集成电路制造工艺进入超深亚微米阶段,漏电流功耗在微处理器总功耗中所占的比例越来越大,在开发新的低漏流工艺和电路技术之外,如何在体系结构级控制和优化漏流功耗成为业界研究的热点.Cache在微处理器中面积最大,是进行漏流控制的首要部件.LRU是组相联Cache最常用的替换算法,而研究发现,访存操作命中LRU后半区的概率很低.LRU-Assist算法以Drowsy Cache、Cache Decay等控制策略为基础,在保证处理器性能不受影响的前提下,利用既有的LRU信息把Cache的关闭率平均提高了15%,大大降低了漏电流功耗.  相似文献   

9.
为进一步缩小外存与CPU间的速度差异,满足日益增高的I/O请求率,在磁盘阵列设计中引入Cache,实现一种适用于RAID控制器的Cache管理策略。此方法重点研究了Cache的组织与管理,利用优化的最近最少使用算法(LRU)提高Cache命中率,通过树形结构的转换减少磁盘I/O次数,以提高系统的整体性能.从该策略在RAID控制器原型的实验结果分析,证明该策略对减少写磁盘I/O确有显著效果.  相似文献   

10.
高性能DSP器件对功耗指标要求越来越高,功耗主要来源于对存储空间的访问,因此提出了一种改进型Cache功耗优化策略,实现了对指令Cache的分阶段访问,同时兼顾了Cache的动态功耗和静态漏流功耗的优化,改进了传统的基于非分阶段访问的按需唤醒策略NPOWP(Non-Phased Cache with On-Demand Wakeup Prediction)显著影响处理器性能的缺点。设计应用于DSP设计的4路组相连昏睡指令Cache中,使用基于分阶段访问的按需唤醒策略POWP(Phased Cache with On-Demand Wakeup Prediction)策略平均可降低75.4%的指令Cache功耗,降低6.7%的处理器总功耗,性能损失仅为0.77%.  相似文献   

11.
一种基于LRU算法改进的缓存方案研究与实现   总被引:1,自引:0,他引:1  
廖鑫 《电子工程师》2008,34(7):46-48
LRU(最近最少使用)替换算法在单处理器结构的许多应用中被广泛使用。然而在多处理器结构中,传统LRU算法对降低共享缓存的缺失率并不是最优的。文中研究了基本的缓存块替换算法,在分析LRU算法的基础上,提出基于LRU算法及访问概率改进的缓存方案,综合考虑最近使用次数和访问频率来决定候选的替换块,增强了替换算法对多处理器的适应性。  相似文献   

12.
Design Space Exploration for 3-D Cache   总被引:1,自引:0,他引:1  
As technology scales, interconnects have become a major performance bottleneck and a major source of power consumption for sub-micro integrated circuit (IC) chips. One promising option to mitigate the interconnect challenges is 3D ICs, in which a stack of multiple device layers are put together on the same chip. In this paper, we explore the architectural design of cache memories using 3D circuits. We present a delay and energy model 3D cache delay-energy estimation tool (3D-Cacti) to explore different 3D design options of partitioning a cache. The tool allows partitioning of a cache across different device layers at various levels of granularity. The tool has been validated by comparing its results with those obtained from circuit simulation of custom 3D layouts. We also explore the effects of various cache partitioning parameters and 3D technology parameters on delay and energy to demonstrate the utility of the tool.  相似文献   

13.
The system, circuit, layout and device levels of an integrated cache memory (ICM), which includes 32 kbyte DATA memory with typical address to HIT delay of 18 ns and address to DATA delay of 23 ns, are described. The ICM offers the largest memory size and the fastest speed ever reported in a cache memory. The device integrates a 32 kbyte DATA INSTRUCTION memory, a 34 kbit TAG memory, an 8 kbit VALID flat, a 2 kbit least recently used (LRU) flag, comparators, and CPU interface logic circuits on a chip. The inclusion of the DATA memory is crucial in improving system cycle time. The device uses several novel circuit design technologies, including a double-word-line scheme, low-noise flush clear, a low-power comparator, noise immunity, and directly testable memory design. Its newly proposed way-slice architecture increases both flexibility and expandability  相似文献   

14.
Web caching has been the solution of choice to web latency problems. The efficiency of a Web cache is strongly affected by the replacement algorithm used to decide which objects to evict once the cache is saturated. Numerous web cache replacement algorithms have appeared in the literature. Despite their diversity, a large number of them belong to a class known as stack‐based algorithms. These algorithms are evaluated mainly via trace‐driven simulation. The very few analytical models reported in the literature were targeted at one particular replacement algorithm, namely least recently used (LRU) or least frequently used (LFU). Further they provide a formula for the evaluation of the Hit Ratio only. The main contribution of this paper is an analytical model for the performance evaluation of any stack‐based web cache replacement algorithm. The model provides formulae for the prediction of the object Hit Ratio, the byte Hit Ratio, and the delay saving ratio. The model is validated against extensive discrete event trace‐driven simulations of the three popular stack‐based algorithms, LRU, LFU, and SIZE, using NLANR and DEC traces. Results show that the analytical model achieves very good accuracy. The mean error deviation between analytical and simulation results is at most 6% for LRU, 6% for the LFU, and 10% for the SIZE stack‐based algorithms. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

15.

Although multi-core processors enhance the performance yet the challenge of estimating Worst-Case Execution Time (WCET) of a task remains in such systems due to interference in shared resources like Last Level Caches (LLC). Cache partitioning has been used to reduce the interference problem by isolating the shared cache among each thread to ease the WCET estimation. However, it prevents information shared among parallel threads running in different cores. In current work, we propose sharing and reuse aware partitioned cache (SRCP) framework such that replication of shared information, data, or instruction, in different partitions could be avoided in LLC. Further, enhancement in existing cache replacement policy is proposed, which avoids eviction of cache blocks shared among multiple cores accessing partitioned last level cache. Tighter WCET, as well as improved resource utilization, is thereby ensured with the proposed framework. Experimental results show that SRCP shows significant improvement in cache hit-rate for PARSEC and SPLASH2 benchmarks as compared to least recently used cache replacement policy and outperforms EHC and TA-DRRIP, which are state-of-the-art replacement policies.

  相似文献   

16.
Recently, Internet energy efficiency is paid more and more attention. New Internet architectures with more energy efficiency were proposed to promote the scalability in energy consumption. The eontent-eentrie networking (CCN) proposed a content-centric paradigm which was proven to have higher energy efficiency. Based on the energy optimization model of CCN with in-network caching, the authors derive expressions to tradeoff the caching energy and the transport energy, and then design a new energy efficiency cache scheme based on virtual round trip time (EV) in CCN. Simulation results show that the EV scheme is better than the least recently used (LRU) and popularity based cache policies on the network average energy consumption, and its average hop is also much better than LRU policy.  相似文献   

17.
As the number of cores in chip multi-processor systems increases, the contention over shared last-level cache (LLC) resources increases, thus making LLC optimization critical, especially for embedded systems with strict area/energy/power constraints. We propose cache partitioning with partial sharing (CaPPS), which reduces LLC contention using cache partitioning and improves utilization with sharing configuration. Sharing configuration enables the partitions to be privately allocated to a single core, partially shared with a subset of cores, or fully shared with all cores based on the co-executing applications’ requirements. CaPPS imposes low hardware overhead and affords an extensive design space to increase optimization potential. To facilitate fast design space exploration, we develop an analytical model to quickly estimate the miss rates of all CaPPS configurations using the applications’ isolated LLC access traces to predict runtime LLC contention. Experimental results demonstrate that the analytical model estimates cache miss rates with an average error of only 0.73 % and with an average speedup of \(3505\times \) as compared to a cycle-accurate simulator. Due to CaPPS’s extensive design space, CaPPS can reduce the average LLC miss rate by as much as 25 % as compared to baseline configurations and as much as 14–17 % as compared to prior works.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号