期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

阮深沉王海霞汪东升《计算机工程与科学》2016,38(8):1568-1573

寻找新型存储材料代替DRAM内存是当前的一个研究热点。相变存储PCM因其具有低功耗、高存储密度和非易失性的优点受到广泛的关注,然而PCM的可擦写次数有限,要用作内存必须考虑如何减少对其的写操作。针对该问题,一种有效的解决方法是优化Cache替换策略,减少Cache中脏块被替换出的数量。现有研究主要通过在插入和访问命中时给脏块设定较高的保护优先级来达到给脏块额外保护的目的,但是在降级过程中不再对脏块与干净块进行区分,这导致Cache可能在存在大量干净块的情况下仍然先替换脏块。提出一种新型的Cache替换策略MAC,它通过一个多维分级结构在脏块与干净块之间设置了不可逾越的界限,使得脏块能得到更有力的保护。模拟实验表明,相对LRU替换策略,MAC以较低的硬件开销代价平均减少约25.12%的内存写,同时对程序运行性能几乎没有影响。相似文献

2.

基于突发集中性访问模式的缓存替换算法

《计算机工程》2017,(1):105-108

随着计算机系统对突发集中性问题访问规模的不断扩大,传统的最近最少使用(LRU)、最近最不常使用(LFU)等缓存替换算法已经难以满足高命中率、低延迟的要求。为此,针对数据突发集中性访问模式的特点,基于该模式对数据内容流行度变化趋势的影响,设计一种突发集中性访问模式的策略。该策略根据缓存的访问次数、访问时间、流行度预测等缓存信息,周期性地更新数据内容的置换优先级。同时对比LRU,LFU,LIRS及新策略在各种情况下,决定其数据内容置换优先级的因素。在模拟器Simple Scalar上的仿真结果表明,该策略在突发集中性访问模式下的性能优于传统的缓存替换策略。相似文献

3.

面向多线程程序基于效用的Cache优化策略

唐轶轩吴俊敏陈国良隋秀峰黄景《计算机研究与发展》2013,50(1):170-180

为了提供高速的数据访问,多核处理器常使用Cache划分机制来分配二级Cache资源,但传统的共享Cache划分算法大多是面向多道程序的,忽略了多线程负载中共享和私有数据访问模式的差别,使得共享数据的使用效率降低.提出了一种面向多线程程序的Cache管理机制UPP,它通过监控Cache中共享、私有数据的效用信息,为每个线程以及共享数据分配Cache空间,使得各个线程以及共享数据的边际效用最大化,从而提高负载的整体性能.另外,UPP还考虑了程序中数据的使用频率以及临近性信息,通过提升、动态插入策略过滤低重用数据,从而使得高频数据块留在Cache中.通过实验表明,其性能相对于基于LRU的纯共享Cache结构和基于公平的静态Cache划分结构均有提升. 相似文献

4.

指令Cache的替换策略

邢二保周兴铭《计算机学报》1993,16(6):424-430

本文用理论分析和程序模拟的方法分析了指令Cache的替换策略和组织,用程序的循环模式研究了Cache的替换策略和组织,得出随机替换策略优于LRU和FIFO策略,在一定条件下,直接相联和组相联优于全相联映象算法,分析指令踪迹模拟结果表明,循环模式是Cache行为的较好的解释。相似文献

5.

ELF:基于无用块消除和低重用块过滤的共享Cache管理策略

隋秀峰吴俊敏陈国良唐轶轩《计算机学报》2011,34(1):143-153

当代CMP处理器通常采用基于LRU替换策略或其近似算法的共享最后一级Cache设计.然而,随着LLC容量和相联度的增长,LRU和理论最优替换算法之间的性能差距日趋增大.为此已提出多种Cache管理策略来解决这一问题,但是它们多数仅针对单一的内存访问类型,且对Cache访问的频率信息关注较少,因而性能提升具有很大的局限性... 相似文献

6.

Cache与内存地址映像的教学探讨

黄涛《计算机光盘软件与应用》2012,(19):243-244

CPU当前访问的数据若在高速缓存(Cache)中,就应该将已知的内存地址转换为Cache地址,具体转换的方式与当前内存和Cache的映像方式有关。每一种映像方式中二者地址间的关系的关键是搞清楚Cache和内存的地址结构。相似文献

7.

一种高效的流媒体代理缓存替换算法 总被引：2，自引：0，他引：2

下载免费PDF全文

王小燕《计算机工程》2009,35(14):72-74

提出基于流行度和将来访问次数的最小效用替换算法（SCU-PFUT）,考虑流媒体文件的字节有效性和文件块大小等因素,使替换出内存的数据块更合理。避免LRU和LFU算法中出现的媒体文件被连续替换的问题,与LRU, LFU和SCU-2算法相比,该算法的缓存命中率、字节命中率和空间利用率较高。相似文献

8.

SAGA:一种由流特性制导的微处理器高速缓存分配策略

陈彧林隽民乔林汤志忠《计算机学报》2008,31(11)

传统的缓存替换策略,如广泛使用的LRU算法,在程序工作集大于缓存容量的情况下,不能有效开发流式数据的重用性,导致缓存性能很差.文中提出一种流特性制导的缓存分配策略(SAGA).该策略利用流检测引擎来发掘程序中的流特性信息,进而动态地在发生缓存缺失时指导是否为缺失数据分配缓存块,最终提高数据缓存的性能.实验表明,对于SPEC2000FP程序集,在1MB缓存上,比较于LRU策略,使用SAGA策略时缓存的缺失平均减少了31%,程序平均CPI降低4%. 相似文献

9.

基于统计信息的Cache漏流功耗估算方法

周宏伟张承义张民选《计算机研究与发展》2008,45(2):367-374

随着工艺尺寸的缩小,漏流功耗逐渐成为制约微处理器设计的主要因素之一.Sleep Cache与Drowsy Cache是两种降低Cache漏流功耗的重要技术.基于统计信息的Cache漏流功耗估算方法(SB-CLPE)用于对Sleep Cache或Drowsy Cache进行Cache漏流功耗估算,根据该方法设计的Cache体系结构能够在程序执行过程中实时估算Cache漏流功耗.通过对所有Cache块的访问间隔时间进行统计,SB_CLPE可以估算出使用不同衰退间隔时Cache的漏流功耗,从而得到使Cache漏流功耗最低的最佳衰退间隔.实验表明,SB_CLPE对Sleep Cache的漏流功耗的估算结果与HotLeakage漏流功耗模拟器通过模拟获得的结果相比,平均偏差仅为3.16%,得到的最佳衰退间隔也可以较好吻合.使用SB_CLPE的Cache体系结构可以用于在程序执行过程中对最佳衰退间隔进行实时估算,通过动态调整衰退间隔以达到最优的功耗降低效果. 相似文献

10.

基于OPT Cache替换Profiling的Cache提示生成

田兴彦黄春陈火旺《计算机工程》2005,31(20):85-87

提出了一个基于最优Cache替换（OPT）Profiling的静态Cache提示生成方法,并通过模拟SPEC2000Int测试程序,对该方法与LRU Cache替换策略进行了性能比较。相似文献

11.

One billion transistors, one uniprocessor, one chip

Patt Y.N. Patel S.J. Evers M. Friendly D.H. Stark J. 《Computer》1997,30(9):51-57

Billion-transistor processors will be much as they are today, just bigger, faster and wider (issuing more instructions at once). The authors describe the key problems (instruction supply, data memory supply and an implementable execution core) that prevent current superscalar computers from scaling up to 16- or 32-instructions per issue. They propose using out-of-order fetching, multi-hybrid branch predictors and trace caches to improve the instruction supply. They predict that replicated first-level caches, huge on-chip caches and data value speculation will enhance the data supply. To provide a high-speed, implementable execution core that is capable of sustaining the necessary instruction throughput, they advocate a large, out-of-order-issue instruction window (2,000 instructions), clustered (separated) banks of functional units and hierarchical scheduling of ready instructions. They contend that the current uniprocessor model can provide sufficient performance and use a billion transistors effectively without changing the programming model or discarding software compatibility 相似文献

12.

非一致Cache体系结构技术综述

吴俊杰杨学军《计算机工程与科学》2011,33(2):51

存储墙问题使得Cache技术的研究始终非常重要。面对日益增长的片上Cache容量,线延迟逐渐成为制约Cache设计的重要因素。为了提供统一的访问延迟,传统的Cache设计方法不得不迁就离处理器最远的Cache Bank的访问时间。为此,研究人员提出了一种非一致Cache结构(NUCA),NUCA几乎成为未来处理器中大容量Cache设计的一种趋势。处理器访问NUCA时,如果在离处理器较近的Bank中发生命中,处理器的等待时间就较短;如果在离处理器较远的Bank中发生命中,处理器的等待时间就较长。本文综述了NUCA技术产生的原因、发展,以及当前最典型的NUCA系统;并且指出了对NUCA技术研究有借鉴的两种多机存储系统技术——NUMA和COMA;最后,提出了NUCA技术研究的关键问题,并给出了相应的解决思路。相似文献

13.

Energy minimization in the STT-RAM-based high-capacity last-level caches

Khajekarimi Elyas Jamshidi Kamal Vafaei Abbas 《The Journal of supercomputing》2019,75(10):6831-6854

The Journal of Supercomputing - Spin-transfer torque random access memory (STT-RAM) is a suitable alternative to DRAM in the large last-level caches (L3Cs) on account of low leakage, the absence of... 相似文献

14.

A workload independent energy reduction strategy for D-NUCA caches

Pierfrancesco Foglia Manuel Comparetti 《The Journal of supercomputing》2014,68(1):157-182

Wire delays and leakage energy consumption are both growing problems in the design of large on chip caches built in deep submicron technologies. D-NUCA caches (Dynamic-Nonuniform Cache Architecture) exploit an aggressive subbanking of the cache and a migration mechanism to speed up frequently accessed data access latency, to limit wire delays effects on performances. Way Adaptable D-NUCA is a leakage power reduction technique specifically suited for D-NUCA caches. It dynamically varies the portion of the powered-on cache area based on the running workload caching needs, but it relies on application dependent parameters that must be evaluated off-line. This limits the effectiveness of Way Adaptable D-NUCA in the general purpose, multiprogrammed environment. In this paper, we propose a new power reduction technique for D-NUCA caches, which still adapts the powered-on cache area to the needs of the running workload, but it does not rely on application-dependent parameters. Results show that our proposal saves around 49 % of total cache energy consumption in a single core environment and 44 % in CMP environment. By adding a timer, it performs similarly to previously proposed techniques to reduce leakage power consumptions, and outperforms them when they are applied in a workload independent manner. 相似文献

15.

A leakage-aware L2 cache management technique for producer-consumer sharing in low-power chip multiprocessors

Hyunhee Kim Author VitaeJihong KimAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(12):1545-1557

This paper proposes a novel leakage management technique for applications with producer-consumer sharing patterns. Although previous research has proposed leakage management techniques by turning off inactive cache blocks, these techniques can be further improved by exploiting the various run-time characteristics of target applications in CMPs. By exploiting particular access sequences observed in producer-consumer sharing patterns and the spatial locality of shared buffers, our technique enables a more aggressive turn-off of L2 cache blocks of these buffers. Experimental results using a CMP simulator show that our proposed technique reduces the energy consumption of on-chip L2 caches, a shared bus, and off-chip memory by up to 31.3% over the existing cache leakage power management techniques with no significant performance loss. 相似文献

16.

Data forwarding in scalable shared-memory multiprocessors

Koufaty D.A. Xiangfeng Chen Poulsen D.K. Torrellas J. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(12):1250-1264

Scalable shared-memory multiprocessors are often slowed down by long-latency memory accesses. One way to cope with this problem is to use data forwarding to overlap memory accesses with computation. With data forwarding, when a processor produces a datum, in addition to updating its cache, it sends a copy of the datum to the caches of the processors that the compiler identified as consumers of it. As a result, when the consumer processors access the datum, they find it in their caches. This paper addresses two main issues. First, it presents a framework for a compiler algorithm for forwarding. Second, using address traces, it evaluates the performance impact of different levels of support for forwarding. Our simulations of a 32-processor machine show that an optimistic support for forwarding speeds up five applications by an average of 50% for large caches and 30% for small caches. For large caches, most sharing read misses are eliminated, while for small caches, forwarding does not increase the number of conflict misses significantly. Overall, support for forwarding in shared-memory multiprocessors promises to deliver good application speedups 相似文献

17.

Replacement techniques for dynamic NUCA cache designs on CMPs

Javier Lira Carlos Molina Ryan N. Rakvic Antonio González 《The Journal of supercomputing》2013,64(2):548-579

The growing influence of wire delay in cache design has meant that access latencies to last-level cache banks are no longer constant. Non-Uniform Cache Architectures (NUCAs) have been proposed to address this problem. Furthermore, an efficient last-level cache is crucial in chip multiprocessors (CMP) architectures to reduce requests to the offchip memory, because of the significant speed gap between processor and memory. Therefore, a bank replacement policy that efficiently manages the NUCA cache is desirable. However, the decentralized nature of NUCA has eliminated the effectiveness of replacement policies because banks operate independently of each other, and hence their replacement decisions are restricted to a single NUCA bank. In this paper, we propose three different techniques to deal with replacements in NUCA caches. 相似文献

18.

Access pattern restructuring for memory energy

De La Luz V. Kadayif I. Kandemir M. Sezer U. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(4):289-303

Improving memory energy consumption of programs that manipulate arrays is an important problem as these codes spend large amounts of energy in accessing off-chip memory. We propose a data-driven strategy to optimize the memory energy consumption in a banked memory system. Our compiler-based strategy modifies the original execution order of loop iterations in array-dominated applications to increase the length of the time period(s) in which memory banks are idle (i.e., not accessed by any loop iteration). To achieve this, it first classifies loop iterations according to their bank accesses patterns and then, with the help of a polyhedral tool, tries to bring the iterations with similar bank access patterns close together. Increasing the idle periods of memory banks brings two major benefits: first, it allows us to place more memory banks into low-power operating modes and, second, it enables us to use a more aggressive (i.e., more energy saving) operating mode (hence, saving more energy) for a given bank (instead of a less aggressive mode). The proposed strategy can reduce memory energy consumption in both sequential and parallel applications. Our strategy has been implemented in an experimental compiler using a polyhedral tool and evaluated using nine array-dominated applications on both a cacheless system and a system with cache memory. Our experimental results indicate that the proposed strategy is very successful in reducing the memory system energy and improves the memory energy by as much as 36.8 percent over a strategy that uses low-power modes without optimizing data access pattern. Our results also show that optimizations that target reducing off-chip memory energy can generate very different results from those that target at improving only cache locality. 相似文献

19.

Implementation of cache memory and fir filter using FINFETs at 22 nm technology for SOC designs

《Microprocessors and Microsystems》2020

The FINFET (Fin field-effect transistors) is projected as a favourable alternative to address challenges faced by continue scaling. Since nanometer procedure schemes are more advanced, the density of chip and frequency of operation have augmented, by making consumption of power in portable devices that are operated by battery could be a significant concern. Though for devices that are non-portable, the consumption of power is significant due to enhanced cooling & packaging costs and possible reliability issues. The metal oxide steady miniaturization semi-conductor field transistor by every novel generation of CMOS (Complementary Metal Oxide Semi-conductor) scheme enhances leakage currents because of minimum channel impacts. The power accounts that are in leakage has been enhancing in a large amount of total consumption of power in deep submicron schemes. Various strategies or schemes are proposes for lessening power leakage. Further, in a system on chip (SOC) designs, caches occupy a significant amount of area in DSP systems, leading to an increase in leakage power. Also, cache memories are used to store filter coefficients. Because of multiple gates, FINFETs structures have better electrostatic control over short channel effects, thus reduces leakage power effectively at the nano regime. In this paper, cache memory and FIR filter are designed by utilizing FINFETs at a 22-nanometer strategy utilizing HSPICE. The experimental outcomes exhibit that structures of FINFET have better leakage control over MOSFET and offers better performance. 相似文献

20.

An Address Transformation Combining Block- and Word-Int erleaving

《Computer Architecture Letters》2002,1(1):8-8

As future superscalar processors employhigher issue widths, an increasing number of load/storeinstruckionsneeds to be executed each cycIe to sustain highperformance. Multi-bank data caches attempt to addressthis issue in a cost-effective way. R multi-bank cache consistsof multiple cache banks that each support one load/storeinstructionper clock cycle. The interleaving of cache blocksover the banks is of primary importance. Two commonchoices are block-interleaving and word-interleaving. ACthough word-interleaving leads to higher PC, it is moreexpensive to implement than block-interIeaving since it rpquires the tag array of the cache to be multi-ported.By swapping the bits in the effective addresa that are usedby word-interleaving with those used by block-interleaving,it is possible to implement a word-interleaved cache with thesame cost, cycle time and power consumption of a blockinterleavedcache. Because thIs makes the L1 data cacheblocks sparse, additional costs are incurred at different levelsof the memory hierarchy. 相似文献