首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we propose a novel integrated circuit and architectural level technique to reduce leakage power consumption in high-performance cache memories using single V/sub t/ (transistor threshold voltage) process. We utilize the concept of gated-ground (nMOS transistor inserted between ground line and SRAM cell) to achieve a reduction in leakage energy without significantly affecting performance. Experimental results on gated-ground caches show that data is retained (DRG-Cache) even if the memory is put in the standby mode of operation. Data is restored when the gated-ground transistor is turned on. Turning off the gated-ground transistor in turn gives a large reduction in leakage power. This technique requires no extra circuitry; the row decoder itself can be used to control the gated-ground transistor. The technique is applicable to data and instruction caches as well as different levels of cache hierarchy, such as the L1, L2, or L3 caches. We fabricated a test chip in TSMC 0.25-/spl mu/m technology to show the data retention capability and the cell stability of the DRG-Cache. Our simulation results on 100-nm and 70-nm processes (Berkeley Predictive Technology Model) show 16.5% and 27% reduction in consumed energy in L1 cache and 50% and 47% reduction in L2 cache, respectively, with less than 5% impact on execution time and within 4% increase in area overhead.  相似文献   

2.
LRU-Assist:一种高效的Cache漏流功耗控制算法   总被引:5,自引:4,他引:1       下载免费PDF全文
随着集成电路制造工艺进入超深亚微米阶段,漏电流功耗在微处理器总功耗中所占的比例越来越大,在开发新的低漏流工艺和电路技术之外,如何在体系结构级控制和优化漏流功耗成为业界研究的热点.Cache在微处理器中面积最大,是进行漏流控制的首要部件.LRU是组相联Cache最常用的替换算法,而研究发现,访存操作命中LRU后半区的概率很低.LRU-Assist算法以Drowsy Cache、Cache Decay等控制策略为基础,在保证处理器性能不受影响的前提下,利用既有的LRU信息把Cache的关闭率平均提高了15%,大大降低了漏电流功耗.  相似文献   

3.
This paper describes a low power Intel Architecture (IA) processor specifically designed for Mobile Internet Devices (MID) with performance similar to mainstream Ultra-Mobile PCs. The design relies on high residency in a new low-power state in order to keep average power and idle power below 220 and 80 mW, respectively. The design consists of an in-order pipeline capable of issuing 2 instructions per cycle supporting 2 threads, 32 KB instruction and 24 KB data L1 caches, independent integer and floating point execution units, times86 front end execution unit, a 512 KB L2 cache and a 533 MT/s dual-mode (GTL and CMOS) front-side-bus (FSB). The design contains 47 million transistors in a die size under 25 mm2 manufactured in a 9-metal 45 nm CMOS process with optimized transistors for low leakage. Maximum thermal design power (TDP) consumption is measured at 2 W at 1.0 V, 90degC using a synthetic power-virus test at a frequency of 1.86 GHz.  相似文献   

4.
Static energy reduction techniques for microprocessor caches   总被引:1,自引:0,他引:1  
Microprocessor performance has been improved by increasing the capacity of on-chip caches. However, the performance gain comes at the price of static energy consumption due to subthreshold leakage current in cache memory arrays. This paper compares three techniques for reducing static energy consumption in on-chip level-1 and level-2 caches. One technique employs low-leakage transistors in the memory cell. Another technique, power supply switching, can be used to turn off memory cells and discard their contents. A third alternative is dynamic threshold modulation, which places memory cells in a standby state that preserves cell contents. In our experiments, we explore the energy and performance tradeoffs of these techniques. We also investigate the sensitivity of microprocessor performance and energy consumption to additional cache latency caused by leakage-reduction techniques.  相似文献   

5.
Most microprocessors employ the on-chip caches to bridge the performance gap between the processor and the main memory. However, the cache accesses usually contribute significantly to the total power consumption of the chip. Based on the observation that an overwhelming majority of the values written to the cache are "0", in this paper we propose a zero-aware SRAM cell with an asymmetric inverter pair, called ZA cell, to minimize the cache power consumption in writing "0". The ZA cell uses a circuit-level technique, which is software independent and orthogonal to other low-power techniques at architecture-level. Compared to the conventional SRAM cell, the experimental results based on the SPEC2000 and MediaBench traces show that without compromise of both performance and stability, the ZA cell can reduce the average cache write power consumption over 60% for both the baseline instruction and data caches. In particular, the ZA cell is attractive in the data caches, which reveal the high write-zero rate.  相似文献   

6.
An aggressive drowsy cache block management, where the cache block is forced into drowsy mode all the time except during write and read operations, is proposed. The word line (WL) is used to enable the normal supply voltage (V DD_high) to the cache line only when it is accessed for read or write whereas the drowsy supply voltage (V DD_low) is enabled to the cache cell otherwise. The proposed block management neither needs extra cycles nor extra control signals to wake the drowsy cache cell, thereby reducing the performance penalty associated with traditional drowsy caches. In fact, the proposed aggressive drowsy mode can reduce the total power consumption of the traditional drowsy mode by 13% or even more, depending on the cache access rate, access frequency and the CMOS technology used.  相似文献   

7.
This superscalar microprocessor is the first implementation of a 32-bit RISC architecture specification incorporating a single-instruction, multiple-data vector processing engine. Two instructions per cycle plus a branch can be dispatched to two of seven execution units in this microarchitecture designed for high execution performance, high memory bandwidth, and low power for desktop, embedded, and multiprocessing systems. The processor features an enhanced memory subsystem, 128-bit internal data buses for improved bandwidth, and 32-KB eight-way instruction/data caches. The integrated L2 tag and cache controller with a dedicated L2 bus interface supports L2 cache sizes of 512 KB, 1 MB, or 2 MB with two-way set associativity. At 450 MHz, and with a 2-MB L2 cache, this processor is estimated to have a floating-point and integer performance metric of 20 while dissipating only 7 W at 1.8 V. The 10.5 million transistor, 83-mm2 die is fabricated in a 1.8-V, 0.20-μm CMOS process with six layers of copper interconnect  相似文献   

8.
On-chip L1 and L2 caches represent a sizeable fraction of the total power consumption of microprocessors. In nanometer-scale technology, the subthreshold leakage power is becoming one of the dominant total power consumption components of those caches. In this study, we present optimization techniques to reduce the subthreshold leakage power of on-chip caches assuming that there are multiple threshold voltages, V/sub T/'s, available. First, we show a cache leakage optimization technique that examines the tradeoff between access time and subthreshold leakage power by assigning distinct V/sub T/'s to each of the four main cache components-address bus drivers, data bus drivers, decoders, and static random access memory (SRAM) cell arrays with sense amplifiers. Second, we show optimization techniques to reduce the leakage power of L1 and L2 on-chip caches without affecting the average memory access time. The key results are: 1) two additional high V/sub T/'s are enough to minimize leakage in a single cache-3 V/sub T/'s if we include a nominal low V/sub T/ for microprocessor core logic; 2) if L1 size is fixed, increasing L2 size can result in much lower leakage without reducing average memory access time; 3) if L2 size is fixed, reducing L1 size may result in lower leakage without loss of the average memory access time for the SPEC2K benchmarks; and 4) smaller L1 and larger L2 caches than are typical in today's processors result in significant leakage and dynamic power reduction without affecting the average memory access time.  相似文献   

9.
In this paper, we make the case for building high-performance asymmetric-cell caches (ACCs) that employ recently-proposed asymmetric SRAMs to reduce leakage proportionally to the number of resident zero bits. Because ACCs target memory value content (independent of cell activity and access patterns), they complement prior proposals for reducing cache leakage that target memory access characteristics. Through detailed simulation and leakage estimation using a commercial 0.13-$mu$m CMOS process model, we show that: 1) on average 75% of resident data cache bits and 64% of resident instruction cache bits are zero; 2) while prior research carefully evaluated the fraction of accessed zero bytes, we show that a high fraction of accessed zero bytes is neither a necessary nor a sufficient condition for a high fraction of resident zero bits; 3) the zero-bit program behavior persists even when we restrict our attention to live data, thereby complementing prior leakage-saving techniques that target inactive cells; and 4) ACCs can reduce leakage on the average by 4.3$times$compared to a conventional data cache without any performance loss, and by 9$times$at the cost of a 5% increase in overall cache access latency.  相似文献   

10.
Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. In this paper, we first show that these power reduction techniques can be suboptimal when thermal effects are considered. Then, we propose a thermal-aware cache power-down technique that minimizes the power density of the active parts by turning off alternating rows of memory cells instead of entire banks. The decrease in the power density lowers the temperature, which then exponentially reduces the leakage. Thus, leakage power of the active parts is reduced in addition to the power eliminated from the parts that are turned off. Simulations based on SPEC2000, NetBench, and MediaBench applications in a 70-nm technology show that the proposed thermal-aware architecture can reduce the total energy consumption by 53% compared to a conventional cache, and 14% compared to a cache architecture with thermal-unaware power reduction scheme. Second, we show a block permutation scheme that can be used during the design of the caches to maximize the distance between blocks with consecutive addresses. Because of spatial locality, blocks with consecutive addresses are likely to be accessed within a short time interval. By maximizing the distance between such blocks, we minimize the power density of the hot spots in the cache, and hence reduce the peak temperature. This, in return, results in an average leakage power reduction of 8.7% compared to a conventional cache without affecting the dynamic power and the latency. Overall, both of our architectures add no extra run-time penalty compared to the thermal-unaware power reduction schemes, yet they result in a significant reduction in the total energy consumption of a cache  相似文献   

11.
Power consumption is an increasingly pressing problem in modern processor design. Since the on-chip caches usually consume a significant amount of power, it is one of the most attractive targets for power reduction. This paper presents a two-level filter scheme, which consists of the L1 and L2 filters, to reduce the power consumption of the on-chip cache. The main idea of the proposed scheme is motivated by the substantial unnecessary activities in conventional cache architecture. We use a single block buffer as the L1 filter to eliminate the unnecessary cache accesses. In the L2 filter, we then propose a new sentry-tag architecture to further filter out the unnecessary way activities in case of the L1 filter miss. We use SimpleScalar to simulate the SPEC2000 benchmarks and perform the HSPICE simulations to evaluate the proposed architecture. Experimental results show that the two-level filter scheme can effectively reduce the cache power consumption by eliminating most unnecessary cache activities, while the compromise of system performance is negligible. Compared to a conventional instruction cache (32 kB, two-way) implemented with only the L1 filter, the use of a two-level filter can result in roughly 30% reduction in total cache power consumption. Similarly, compared to a conventional data cache (32 kB, four-way) implemented with only the L1 filter, the total cache power reduction is approximately 46%.  相似文献   

12.
Coarse-grained reconfigurable architectures (CGRAs) require many processing elements (PEs) and a configuration memory unit (configuration cache) for reconfiguration of its PE array. Although this structure is meant for high performance and flexibility, it consumes significant power. Specially, power consumption by configuration cache is explicit overhead compared to other types of intellectual property (IP) cores. Reducing power is very crucial for CGRA to be more competitive and reliable processing core in embedded systems. In this paper, we propose a reusable context pipelining (RCP) architecture to reduce power-overhead caused by reconfiguration. It shows that the power reduction can be achieved by using the characteristics of loop pipelining, which is a multiple instruction stream, multiple data stream (MIMD)-style execution model. RCP efficiently reduces power consumption in configuration cache without performance degradation. Experimental results show that the proposed approach saves much power even with reduced configuration cache size. Power reduction ratio in the configuration cache and the entire architecture are up to 86.33% and 37.19%, respectively, compared to the base architecture.  相似文献   

13.
Tuning a configurable cache subsystem to an application can greatly reduce memory hierarchy energy consumption. Previous tuning methods use a level one configurable cache only, or a second level with separate instruction and data configurable caches. We instead use a commercially-common unified second level cache, a seemingly minor difference that actually expands the configuration space from 500 to about 20$thinspace$000. We develop additive way tuning for tuning a cache subsystem with this large space, yielding 61% energy savings and 9% performance improvements over a nonconfigurable cache, greatly outperforming an extension of a previous method.   相似文献   

14.
Soft errors issues in low-power caches   总被引:1,自引:0,他引:1  
As technology scales, reducing leakage power and improving reliability of data stored in memory cells is both important and challenging. While lower threshold voltages increase leakage, lower supply voltages and smaller nodal capacitances reduce energy consumption but increase soft errors rates. In this work, we present a comprehensive study of soft error rates on low-power cache design. First, we study the effect of circuit level techniques, used to reduce the leakage energy consumption, on soft error rates. Our results using custom designs show that many of these approaches may increase the soft error rates as compared to a standard 6T SRAM. We also validate the effects of voltage scaling on soft error rate by performing accelerated tests on off-the-shelf SRAM-based chips using a neutron beam source. Next, we study the impact of cache decay and drowsy cache, which are two commonly used architectural-level leakage reduction approaches, on the cache reliability. Our results indicate that the leakage optimization techniques change the reliability of cache memory. More importantly, we demonstrate that there is a tradeoff between optimizing for leakage power and improving the immunity to soft error. We also study the impact of error correcting codes on soft error rates. Based on this study, we propose an adaptive error correcting scheme to reduce the leakage energy consumption and improve reliability.  相似文献   

15.
Data stability of SRAM cells has become an important issue with the scaling of CMOS technology. Memory banks are also important sources of leakage since the majority of transistors are utilized for on-chip caches in today's high performance microprocessors. A new nine-transistor (9T) SRAM cell is proposed in this paper for simultaneously reducing leakage power and enhancing data stability. The proposed 9T SRAM cell completely isolates the data from the bit lines during a read operation. The read static-noise-margin of the proposed circuit is thereby enhanced by 2 X as compared to a conventional six-transistor (6T) SRAM cell. The idle 9T SRAM cells are placed into a super cutoff sleep mode, thereby reducing the leakage power consumption by 22.9% as compared to the standard 6T SRAM cells in a 65-nm CMOS technology. The leakage power reduction and read stability enhancement provided with the new circuit technique are also verified under process parameter variations.  相似文献   

16.
Deep-submicron CMOS designs maintain high transistor switching speeds by scaling down the supply voltage and proportionately reducing the transistor threshold voltage. Lowering the threshold voltage increases leakage energy dissipation due to subthreshold leakage current even when the transistor is not switching. Estimates suggest a five-fold increase in leakage energy in every future generation. In modern microarchitectures, much of the leakage energy is dissipated in large on-chip cache memory structures with high transistor densities. While cache utilization varies both within and across applications, modern cache designs are fixed in size resulting in transistor leakage inefficiencies. This paper explores an integrated architectural and circuit-level approach to reducing leakage energy in instruction caches (i-caches). At the architecture level, we propose the Dynamically ResIzable i-cache (DRI i cache), a novel i-cache design that dynamically resizes and adapts to an application's required size. At the circuit-level, we use gated-Vdd, a novel mechanism that effectively turns off the supply voltage to, and eliminates leakage in, the SRAM cells in a DRI i-cache's unused sections. Architectural and circuit-level simulation results indicate that a DRI i-cache successfully and robustly exploits the cache size variability both within and across applications. Compared to a conventional i-cache using an aggressively-scaled threshold voltage a 64 K DRI i-cache reduces on average both the leakage energy-delay product and cache size by 62%, with less than 4% impact on execution time. Our results also indicate that a wide NMOS dual-Vt gated-Vdd transistor with a charge pump offers the best gating implementation and virtually eliminates leakage energy with minimal increase in an SRAM cell read time area as compared to an i-cache with an aggressively-scaled threshold voltage  相似文献   

17.
一种低功耗Cache设计技术的研究   总被引:2,自引:0,他引:2  
低功耗、高性能的cache系统设计是嵌入式DSP芯片设计的关键。本文在多媒体处理DSP芯片MD32的设计实践中,提出一种利用读/写缓冲器作为零级cache,减少对数据、指令cache的读/写次数,由于缓冲器读取功耗远远小于片上cache,从而减小cache相关功耗的方法。通过多种多媒体处理测试程序的验证,该技术可减少对指令cache或者数据cache20%~40%的读取次数,以较小芯片面积的增加换取了较大的功耗降低。  相似文献   

18.
An area model suitable for comparing data buffers of different organizations (e.g. caches versus register files) and arbitrary sizes is described. The area model considers the supplied bandwidth of a memory cell and includes such buffer overhead as control logic, driver logic and tag storage. The model gave less than 10% error when verified against real caches and register files. It is shown that, comparing caches and register files in terms of area for the same storage capacity, caches generally occupy more area per bit than register files for small caches because the overhead dominates the cache area at these sizes. For larger caches, the smaller storage cells in the cache provide a smaller total cache area per bit than the register set. Studying cache performance (traffic ratio) as a function of area, it is shown that, for small caches (less than the area occupied by 256 registers bits-r.b.e.-or 32 b), direct-mapped caches perform significantly better than four-way set-associative caches and, for caches of medium areas (between 256 r.b.e. and 4096 r.b.e.), both direct-mapped and set-associative caches perform better than fully associative caches  相似文献   

19.
谢子超  陆俊林  佟冬  王箫音  程旭 《电子学报》2011,39(11):2473-2479
路选择技术可以有效降低指令缓存能耗开销,但已有方法通常会由于预测错误或更新机制复杂而引入额外的取指延迟,导致整体能效性降低.本文面向典型超标量处理器的指令缓存结构,提出了一种高能效的路选择融合技术(Combining Way Selective Cache,CWS-Cache).基于对路预测和路历史技术适用条件的分析,...  相似文献   

20.
Energy consumption is one of the important parameters to be optimized during the design of portable embedded systems. Thus, most of the contemporary portable devices feature low-power processors coupled with on-chip memories (e.g., caches, scratchpads). Scratchpads are better than traditional caches in terms of power, performance, area, and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime. We demonstrate that the problem of overlaying scratchpad is an extension of the Global Register Allocation problem. We present optimal and near-optimal approaches for solving the scratchpad overlay problem. The near-optimal scratchpad overlay approach achieves close to the optimal results and is significantly faster than the optimal approach. Our approaches improve upon the previously known static allocation technique for assigning both variables and code segments onto the scratchpad. The evaluation of the approaches for ARM7 processor reports, average energy, and execution time reductions of 26% and 14% over the static approach, respectively. Additional experiments comparing the overlayed scratchpads against unified caches of the same size, report average energy, and execution time savings of 20% and 10%, respectively. We also report data memory energy reductions of 45%-57% due to the insertion of a 1024-bytes scratchpad memory in the memory hierarchy of a digital signal processor (DSP).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号