首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
近年来,功耗成为处理器设计领域的关键问题之一.传统应对功耗的方法如DVFS(Dynamic VoltageFrequency Scaling)目前遭遇了收益递减律.随着多核/众核处理器的普及化,片上缓存占有了越来越多的CPU芯片面积和功耗.针对降低功耗的问题,文中提出了通过过滤不必要的缓存路访问来降低缓存动态功耗的方法.该方法包括采用无效访问过滤器(Invalid Filter)来消除对含无效数据块的缓存路的访问;采用指令数据访问过滤器(I/D Filter)来消除对与访问类型(指令或数据)不匹配的数据块所在的缓存路的访问;以及采用tag低位过滤器(Tag-2Filter)来消除对tag低位不匹配的数据块所在的缓存路的访问.文中提出将以上3种方法合并,称为Invalid+I/D+Tag-2Filter,以期取得更好的效果.通过分析和实验验证了3种方法的有效性和互补性.同时,实验也表明,与Invalid+I/D Filter相比,Invalid+I/D+Tag-2Filter在64KB 4路组相联缓存上可以取得19.6%~47.8%(平均34.3%)的效果提升,在128KB 8路组相联缓存上可以取得19.6%~55.2%(平均39.2%)的效果提升;与Invalid+Tag-2Filter相比,Invalid+I/D+Tag-2Filter在64KB 4路组相联缓存上可以取得16.1%~27.7%(平均16.6%)的效果提升,在128KB 8路组相联缓存上可以取得6.9%~44.4%(平均25.0%)的效果提升.  相似文献   

2.
随着工艺的持续进步,多核处理器集成了越来越多的核以及片上缓存系统,因此利用非一致缓存架构(NUCA)应对片上多核处理器的缓存系统中逐渐增大的线延迟。高效的缓存块迁移策略对整个缓存系统至关重要。当前动态非一致缓存架构(D-NUCA)中的缓存块迁移策略未考虑缓存块的历史访问信息,导致缓存块在不同的bank之间抖动从而增加缓存块的访问延迟。为此,提出一种重用感知的缓存块迁移(RABM)策略,采用缓存块的历史迁移信息来预测将来的缓存块迁移,从而提升D-NUCA的性能以及降低整个缓存系统的功耗。基于PARSEC基准测试程序的全系统仿真结果显示,与D-NUCA相比,基于RABM的D-NUCA可以使每时钟周期指令数平均提高9.6%,片上缓存系统功耗降低14%。  相似文献   

3.
现代嵌入式处理器中指令高速缓存的功耗十分显著,对此提出一种基于路访问轨迹的组相联指令高速缓存的低功耗策略,利用改进的指令高速缓存和转移目标缓存建立和维护运行时指令高速缓存的路访问轨迹来减少指令高速缓存命中检测及无关路访问.进一步提出了基于跨行访问前驱指针、转移前驱状态、转移前驱指针及转移目标索引的路访问轨迹信息维护策略用以降低信息重建的频度,从而更有效地利用已建立的路访问轨迹信息.实验结果表明:采用优化后的路访问轨迹策略的指令高速缓存的标志存储器访问和数据存储器访问分别降低到传统指令高速缓存的3.60%和27.70%.  相似文献   

4.
方娟  王帅  于璐 《计算机科学》2014,41(7):36-39,73
如何提高多核处理器的性能和降低多核处理器中Cache的功耗已经成为下一代多核处理器的研究热点。为了降低片上多核处理器的功耗,基于路适应算法可以采用一种新的动态划分机制,该机制主要由路分配模块和动态功耗控制模块组成。路分配模块在程序运行过程中根据处理器核所运行线程的工作集的大小调整处理器核所分配的Cache路。动态功耗控制模块利用程序运行的局部性原理,将处理器核所运行线程的工作空间控制在少数Cache路中。关闭剩余的Cache路,从而达到降低Cache功耗的目的。该机制使用Simics全系统模拟平台模拟多核处理器,并用SpecOMP测试集测试了系统的性能和功耗。与传统的Cache(Conventional L2Cache,C-L2)相比,其IPC提高了9.27%,功耗降低了10.95%。  相似文献   

5.
分支目标缓存(BTB)是高端嵌入式CPU的主要耗能部件之一。针对BTB访问中引入的冗余功耗问题,提出了一种循环体访问过滤机制消除循环体指令流中顺序指令对BTB的无效访问。进一步提出了一种分支跟踪方法补偿循环过滤机制对循环体中非循环类分支指令的错误过滤造成的性能损失,节省了循环体指令流中顺序指令访问BTB的大量冗余功耗。基于Powerstone基准程序的仿真实验表明,在128表项BTB配置下,二级循环过滤器和4表项分支踪迹表可以减少约71.9%的BTB功耗,而平均每条指令周期数(CPI)退化仅为0.66%。  相似文献   

6.
高效能是处理器设计的重要指标。由于指令部件在处理器芯片中开始占据越来越多的芯片面积,消耗了较多的芯片功耗,研究人员提出了零级指令缓存设计。零级指令缓存容量小、访问耗能低,与流水线紧密耦合、取指命中时可以门控流水线部分逻辑。因此,零级指令缓存可以有效提高流水线指令部件的能效比。综述了现有的零级指令缓存的不同结构、各结构的发展与应用情况;展望了零级指令缓存设计的未来研究思路。  相似文献   

7.
为了降低访问时延,提升用户体验,当前网络交互性能改进的主要手段包括缓存技术与预取技术。当前的缓存替换机制主要考虑对象的访问时间与访问频度,然而Web对象本身存在语义性。本文首先对实际Web数据访问情况进行分析,发现访问间隔的变化率对于命中率的影响具有更高的准确性,进而提出一种基于对象属性的缓存替换策略,该策略通过统计近期缓存对象的平均访问间隔,并结合该对象的标签属性作为对象在缓存中的价值。实验结果表明该策略比基于Aging的缓存策略和基于协作式中心化决策缓存策略提升7%-10%的命中率。  相似文献   

8.
为了提高片上Flash在嵌入式应用中的读取速度,提出了一种基于预取和缓存原理的片上Flash加速控制器。该控制器包括预取缓存和高速缓存两种加速方案。其中预取缓存方案采用位宽扩展和预取技术加速顺序指令的读取,并采用分支缓存存储非顺序指令,降低由非顺序指令造成的预取缺失代价;而高速缓存方案采用组相联和路预测技术,提高指令重用率,减少Flash访问次数,降低系统功耗。针对不同的应用场景,两种加速方案既可通过寄存器来静态切换,也可通过软件流程来自适应动态切换,从而获得最佳的读取速度提升。多项基准程序的测试结果表明了所提出的片上Flash加速控制器在性能和功耗优化上的可行性和高效性。  相似文献   

9.
一种基于有限记忆多LRU的Web缓存替换算法   总被引:3,自引:0,他引:3  
Web缓存的核心是缓存内容的替换算法.在动态不确定的网络环境下,本文提出一种基于有限记忆的多LRU (LH-MLRU)Web缓存替换算法,它是一种低开销、高性能和适应性的算法.LH-MLRU综合考虑各项因素对Web对象使用多个LRU队列进行分类管理,引入Web对象最近被访问的历史作为缓存内容替换的一个关键因素,来预测对象可能再次被访问的概率.通过周期性的训练参数可以适应动态不确定的网络环境.轨迹驱动的仿真实验表明LH-MLRU在各项性能指标上均优于其他算法,可以显著的提高Web缓存的性能.  相似文献   

10.
基于存储器访问局部性原理,提出了一种基于预测缓存的低功耗转换旁置缓冲器(TLB)快速访问机制。该机制采用单端口静态随机存储器(SRAM)代替传统的内容寻址存储器(CAM)结构,通过匹配搜索实现全相连TLB的快速访问,在两级TLB之间设计可配置的访问预测缓存,用于动态预测第二级TLB访问顺序,减少第二级TLB搜索匹配的延时,并有效降低第二级TLB访问功耗。采用该机制明显降低了TLB的缺失代价,当第一级TLB缺失时访问第二级TLB的平均访问延时接近1个时钟周期,约为原有平均访问延时的20%,增加的面积开销仅为  相似文献   

11.
The L1 cache in today’s high-performance processors accesses all ways of a selected set in parallel. This constitutes a major source of energy inefficiency: at most one of the N fetched blocks can be useful in an N-way set-associative cache. The other N-1 cachelines will all be tag mismatches and subsequently discarded.We propose to eliminate unnecessary associative fetches by exploiting certain software semantics in cache design, thus reducing dynamic power consumption. Specifically, we use memory region information to eliminate unnecessary fetches in the data cache, and ring level information to optimize fetches in the instruction cache. We present a design that is performance-neutral, transparent to applications, and incurs a space overhead of mere 0.41% of the L1 cache.We show significantly reduced cache lookups with benchmarks including SPEC CPU, SPECjbb, SPECjAppServer, PARSEC, and Apache. For example, for SPEC CPU 2006, the proposed mechanism helps to reduce cache block fetches from the data and instruction caches by an average of 29% and 53% respectively, resulting in power savings of 17% and 35% in the caches, compared to the aggressively clock-gated baselines.  相似文献   

12.
基于记录缓冲的低功耗指令Cache方案   总被引:1,自引:1,他引:1  
现代微处理器大多采用片上Cache来缓解主存储器与中央处理器(CPU)之间速度的巨大差异,但Cache也成为处理器功耗的主要来源,尤其是其中大部分功耗来自于指令Cache.采用缓冲器可以过滤掉大部分的指令Cache访问,从而降低功耗,但仍存在相当程度不必要的存储体访问,据此提出了一种基于记录缓冲的低功耗指令Cache结构RBC.通过记录缓冲器和对存储体的改造,RBC能够过滤大部分不必要的存储体访问,有效地降低了Cache的功耗.对10个SPEC2000标准测试程序的仿真结果表明,与传统基于缓冲器的Cache结构相比,在仅牺牲6.01%处理器性能和3.75%面积的基础上,该方案可以节省24.33%的指令Cache功耗.  相似文献   

13.
The power consumed by memory systems accounts for 45% of the total power consumed by an embedded system, and the power consumed during a memory access is 10 times higher than during a cache access. Thus, increasing the cache hit rate can effectively reduce the power consumption of the memory system and improve system performance. In this study, we increased the cache hit rate and reduced the cache-access power consumption by developing a new cache architecture known as a single linked cache (SLC) that stores frequently executed instructions. SLC has the features of low power consumption and low access delay, similar to a direct mapping cache, and a high cache hit rate similar to a two way-set associative cache by adding a new link field. In addition, we developed another design known as a multiple linked caches (MLC) to further reduce the power consumption during each cache access and avoid unnecessary cache accesses when the requested data is absent from the cache. In MLC, the linked cache is split into several small linked caches that store frequently executed instructions to reduce the power consumption during each access. To avoid unnecessary cache accesses when a requested instruction is not in the linked caches, the addresses of the frequently executed blocks are recorded in the branch target buffer (BTB). By consulting the BTB, a processor can access the memory to obtain the requested instruction directly if the instruction is not in the cache. In the simulation results, our method performed better than selective compression, traditional cache, and filter cache in terms of the cache hit rate, power consumption, and execution time.  相似文献   

14.
嵌入式处理器中Cache的应用极大地提高了处理器的性能,同时Cache,尤其是指令Cache功耗占据了处理器很大一部分功耗,关闭不必要的tag SRAM和data SRAM的访问,可以极大地降低功耗。提出了一种流水化的指令Cache访问机制,关闭不必要的data SRAM的访问;并且通过记录指令Cache行的信息和预测下一行的Cache形成一个Cache行滑动窗口,关闭不必要的tag SRAM访问。所提出的方法没有性能损失,在SMIC 90nm工艺下进行功耗分析,其指令访问的功耗降低50%。  相似文献   

15.
倪亚路  周晓方 《计算机工程》2011,37(22):231-233
综合效用最优划分共享Cache方法和传统LRU方法的优点,提出一种新的动态划分共享Cache方法。该方法可消除不同线程在共享Cache中的相互影响,当多核并行执行的程序均对共享Cache中占有的路数敏感时,可解决采用效用最优划分方法时的性能下降问题。经SPEC CPU2000测试表明,该方法与传统LRU和效用最优划分方法相比,系统整体性能平均分别提高20.28%和14.37%。  相似文献   

16.
On-chip instruction cache is a potential power hungry component in embedded systems due to its large chip area and high access-frequency. Aiming at reducing power consumption of the on-chip cache, we propose a Reduced One-Bit Tag Instruction Cache (ROBTIC), where the cache size is judiciously reduced and the cache tag field only contains the least significant bit of the full-tag. We develop a cache operational control scheme for ROBTIC so that with the one-bit cache tag, the program locality can still be efficiently exploited. For applications where most of the memory accesses are localized, our cache can achieve similar performance as a traditional full-tag cache; however, the power consumption of the cache can be significantly reduced due to the much smaller cache size, narrower tag array (just one bit), and tinier tag comparison circuit being used. Experiments on a set of benchmarks implemented in CMOS 180 nm process technology demonstrate that our proposed design can reduce up to 27.3% dynamic power consumption and 30.9% area of the traditional cache when the cache size is fixed at 32 instructions, which outperforms the existing partial-tag based cache design. With the cache size customization, a further 47.8% power saving can be achieved. Our experimental results also show that when implemented in the deep sub-micron technologies where the leakage power is not ignorable, our design is still efficient - a coherent power saving trend (about 22%) has been observed for technologies from 130 nm down to 65 nm.  相似文献   

17.
The instruction fetch unit (IFU) usually dissipates a considerable portion of total chip power. In traditional IFU architectures, as soon as the fetch address is generated, it needs to be sent to the instruction cache and TLB arrays for instruction fetch. Since limited work can be done by the power-saving logic after the fetch address generation and before the instruction fetch, previous power-saving approaches usually suffer from the unnecessary restrictions from traditional IFU architectures. In this paper, we present CASA, a new power-aware IFU architecture, which effectively reduces the unnecessary restrictions on the power-saving approaches and provides sufficient time and information for the power-saving logic of both instruction cache and TLB. By analyzing, recording, and utilizing the key information of the dynamic instruction flow early in the front-end pipeline, CASA brings the opportunity to maximize the power efficiency and minimize the performance overhead. Compared to the baseline configuration, the leakage and dynamic power of instruction cache is reduced by 89.7% and 64.1% respectively, and the dynamic power of instruction TLB is reduced by 90.2%. Meanwhile the performance degradation in the worst case is only 0.63%. Compared to previous state-of-the-art power-saving approaches, the CASA-based approach saves IFU power more effectively, incurs less performance overhead and achieves better scalability. It is promising that CASA can stimulate further work on architectural solutions to power-efficient IFU designs.  相似文献   

18.
Modern Internet routers have to handle a large number of packet classification rules, which requires classification schemes to be scalable both in time and space. In this paper, we present a scalable packet classification algorithm that is developed by combining two new concepts to the well‐known bit vector (BV) scheme. We propose a range search method based on a cache‐aware tree (CATree) which makes full use of processor's cache line to reduce the number of dynamic random access memory (DRAM) accesses. Theoretically, the number of DRAM accesses of CATree is about log(m+1) times lower than that of the widely used binary search algorithm, where m is the number of keys in a single cache line. Based on our computational results on a set of 1024 keys, the CATree algorithm is 36% faster than binary search algorithm and the performance is better when applied to a larger set of keys. In addition, we develop a rule re‐arrangement algorithm to reduce the bitmap space of BV. With this re‐arrangement, the rules for the same action may be assigned an identical priority. This reduces the number of priorities as well as the memory space of the bitmap. Furthermore, this also reduces the number of memory accesses and hence, increases the CPU cache utilization. With CATree and rule re‐arrangement, the cache‐aware bit vector with rule re‐arrangement algorithm achieves better performance in comparison with the regular BV scheme, both in space and time. In our experiments, the proposed algorithm reduces the bitmap memory space of a practical set of firewall rules by two orders of magnitude and is 91% faster than the regular BV.  相似文献   

19.
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.  相似文献   

20.
The success of large-scale, hierarchical and distributed shared memory systems hinges on our ability to reduce delays resulting from remote accesses to shared data. To facilitate this, we present a compile-time algorithm for analyzing programs with doall-style parallelism to determine when read and write accesses to shared data areredundant (unnecessary). One identified, redundant remote accesses can be replaced by local accesses or eliminated entirely. This optimization improves program performance in two ways. First, slow memory accesses are replaced by faster ones. Second, the time to perform other remote memory accesses may be reduced as a result of the decreased traffic level. We also show how the information obtained through redundancy analysis can be used for other compiler optimizations such as prefetching and cache management.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号