期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Quantitative analysis and optimization techniques for on-chip cache leakage power

Nam Sung Kim Blaauw D. Mudge T. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(10):1147-1156

On-chip L1 and L2 caches represent a sizeable fraction of the total power consumption of microprocessors. In nanometer-scale technology, the subthreshold leakage power is becoming one of the dominant total power consumption components of those caches. In this study, we present optimization techniques to reduce the subthreshold leakage power of on-chip caches assuming that there are multiple threshold voltages, V/sub T/'s, available. First, we show a cache leakage optimization technique that examines the tradeoff between access time and subthreshold leakage power by assigning distinct V/sub T/'s to each of the four main cache components-address bus drivers, data bus drivers, decoders, and static random access memory (SRAM) cell arrays with sense amplifiers. Second, we show optimization techniques to reduce the leakage power of L1 and L2 on-chip caches without affecting the average memory access time. The key results are: 1) two additional high V/sub T/'s are enough to minimize leakage in a single cache-3 V/sub T/'s if we include a nominal low V/sub T/ for microprocessor core logic; 2) if L1 size is fixed, increasing L2 size can result in much lower leakage without reducing average memory access time; 3) if L2 size is fixed, reducing L1 size may result in lower leakage without loss of the average memory access time for the SPEC2K benchmarks; and 4) smaller L1 and larger L2 caches than are typical in today's processors result in significant leakage and dynamic power reduction without affecting the average memory access time. 相似文献

2.

Zero-aware asymmetric SRAM cell for reducing cache power in writing zero

Yen-Jen Chang Feipei Lai Chia-Lin Yang 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2004,12(8):827-836

Most microprocessors employ the on-chip caches to bridge the performance gap between the processor and the main memory. However, the cache accesses usually contribute significantly to the total power consumption of the chip. Based on the observation that an overwhelming majority of the values written to the cache are "0", in this paper we propose a zero-aware SRAM cell with an asymmetric inverter pair, called ZA cell, to minimize the cache power consumption in writing "0". The ZA cell uses a circuit-level technique, which is software independent and orthogonal to other low-power techniques at architecture-level. Compared to the conventional SRAM cell, the experimental results based on the SPEC2000 and MediaBench traces show that without compromise of both performance and stability, the ZA cell can reduce the average cache write power consumption over 60% for both the baseline instruction and data caches. In particular, the ZA cell is attractive in the data caches, which reveal the high write-zero rate. 相似文献

3.

Thermal Management of On-Chip Caches Through Power Density Minimization

Ja Chun Ku Ozdemir S. Memik G. Ismail Y. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(5):592-604

Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. In this paper, we first show that these power reduction techniques can be suboptimal when thermal effects are considered. Then, we propose a thermal-aware cache power-down technique that minimizes the power density of the active parts by turning off alternating rows of memory cells instead of entire banks. The decrease in the power density lowers the temperature, which then exponentially reduces the leakage. Thus, leakage power of the active parts is reduced in addition to the power eliminated from the parts that are turned off. Simulations based on SPEC2000, NetBench, and MediaBench applications in a 70-nm technology show that the proposed thermal-aware architecture can reduce the total energy consumption by 53% compared to a conventional cache, and 14% compared to a cache architecture with thermal-unaware power reduction scheme. Second, we show a block permutation scheme that can be used during the design of the caches to maximize the distance between blocks with consecutive addresses. Because of spatial locality, blocks with consecutive addresses are likely to be accessed within a short time interval. By maximizing the distance between such blocks, we minimize the power density of the hot spots in the cache, and hence reduce the peak temperature. This, in return, results in an average leakage power reduction of 8.7% compared to a conventional cache without affecting the dynamic power and the latency. Overall, both of our architectures add no extra run-time penalty compared to the thermal-unaware power reduction schemes, yet they result in a significant reduction in the total energy consumption of a cache 相似文献

4.

Circuit and microarchitectural techniques for reducing cache leakage power 总被引：2，自引：0，他引：2

Nam Sung Kim Flautner K. Blaauw D. Mudge T. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2004,12(2):167-184

On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache. 相似文献

5.

Overlay techniques for scratchpad memories in low power embedded processors

Verma M. Marwedel P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(8):802-815

Energy consumption is one of the important parameters to be optimized during the design of portable embedded systems. Thus, most of the contemporary portable devices feature low-power processors coupled with on-chip memories (e.g., caches, scratchpads). Scratchpads are better than traditional caches in terms of power, performance, area, and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime. We demonstrate that the problem of overlaying scratchpad is an extension of the Global Register Allocation problem. We present optimal and near-optimal approaches for solving the scratchpad overlay problem. The near-optimal scratchpad overlay approach achieves close to the optimal results and is significantly faster than the optimal approach. Our approaches improve upon the previously known static allocation technique for assigning both variables and code segments onto the scratchpad. The evaluation of the approaches for ARM7 processor reports, average energy, and execution time reductions of 26% and 14% over the static approach, respectively. Additional experiments comparing the overlayed scratchpads against unified caches of the same size, report average energy, and execution time savings of 20% and 10%, respectively. We also report data memory energy reductions of 45%-57% due to the insertion of a 1024-bytes scratchpad memory in the memory hierarchy of a digital signal processor (DSP). 相似文献

6.

A third-generation SPARC V9 64-b microprocessor

Heald R. Aingaran K. Amir C. Ang M. Boland M. Dixit P. Gouldsberry G. Greenley D. Grinberg J. Hart J. Horel T. Wen-Jay Hsu Kaku J. Chin Kim Song Kim Klass F. Kwan H. Lauterbach G. Lo R. McIntyre H. Mehta A. Murata D. Nguyen S. Yet-Ping Pai Patel S. Shin K. Tam K. Vishwanthaiah S. Wu J. Yee G. You E. 《Solid-State Circuits, IEEE Journal of》2000,35(11):1526-1538

This quad-issue processor achieves 1-GHz operation through improved dynamic circuit techniques in critical paths and a more extensive on-chip memory system which scales in both bandwidth and latency. Critical logic paths use domino, delayed clocked domino, and logic embedded in dynamic flip-flops for minimum delay. A 64-KB sum-addressed memory data cache combines the address offset add with the cache decode, allowing the average memory latency to scale by more than the clock ratio. Memory bandwidth is improved by using wave pipelined SRAM designs for on-chip caches and a write cache for store traffic. Memory power is controlled without increased latency by use of delayed-reset logic decoders. The chip operates at 1000 MHz and dissipates less than 80 W from a 1.6-V supply. It contains 23 million transistors (12 million in RAM cells) on a 244 mm² die 相似文献

7.

A single-V/sub t/ low-leakage gated-ground cache for deep submicron

Agarwal A. Hai Li Roy K. 《Solid-State Circuits, IEEE Journal of》2003,38(2):319-328

In this paper, we propose a novel integrated circuit and architectural level technique to reduce leakage power consumption in high-performance cache memories using single V/sub t/ (transistor threshold voltage) process. We utilize the concept of gated-ground (nMOS transistor inserted between ground line and SRAM cell) to achieve a reduction in leakage energy without significantly affecting performance. Experimental results on gated-ground caches show that data is retained (DRG-Cache) even if the memory is put in the standby mode of operation. Data is restored when the gated-ground transistor is turned on. Turning off the gated-ground transistor in turn gives a large reduction in leakage power. This technique requires no extra circuitry; the row decoder itself can be used to control the gated-ground transistor. The technique is applicable to data and instruction caches as well as different levels of cache hierarchy, such as the L1, L2, or L3 caches. We fabricated a test chip in TSMC 0.25-/spl mu/m technology to show the data retention capability and the cell stability of the DRG-Cache. Our simulation results on 100-nm and 70-nm processes (Berkeley Predictive Technology Model) show 16.5% and 27% reduction in consumed energy in L1 cache and 50% and 47% reduction in L2 cache, respectively, with less than 5% impact on execution time and within 4% increase in area overhead. 相似文献

8.

The implementation of the Itanium 2 microprocessor

Naffziger S.D. Colon-Bonet G. Fischer T. Riedlinger R. Sullivan T.J. Grutkowski T. 《Solid-State Circuits, IEEE Journal of》2002,37(11):1448-1460

This 64-b microprocessor is the second-generation design of the new Itanium architecture, termed explicitly parallel instruction computing (EPIC). The design seeks to extract maximum performance from EPIC by optimizing the memory system and execution resources for a combination of high bandwidth and low latency. This is achieved by tightly coupling microarchitecture choices to innovative circuit designs and the capabilities of the transistors and wires in the 0.18-/spl mu/m bulk Al metal process. The key features of this design are: a short eight-stage pipeline, 11 sustainable issue ports (six integer, four floating point, half-cycle access level-1 caches, 64-GB/s level-2 cache and 3-MB level-3 cache), all integrated on a 421 mm/sup 2/ die. The chip operates at over 1 GHz and is built on significant advances in CMOS circuits and methodologies. After providing an overview of the processor microarchitecture and design, this paper describes a few of these key enabling circuits and design techniques. 相似文献

9.

Memory Design and Exploration for Low Power, Embedded Systems

Wen-Tsong Shiue Chaitali Chakrabarti 《The Journal of VLSI Signal Processing》2001,29(3):167-178

In this paper, we describe a procedure for memory design and exploration for low power embedded systems. Our system consists of an instruction cache and a data cache on-chip, and a large memory off-chip. In the first step, we try to reduce the power consumption due to memory traffic by applying memory-optimizing transformations such as loop transformations. Next we use a memory exploration procedure to choose a cache configuration (cache size and line size) that satisfies the system requirements of area, number of cycles and energy consumption. We include energy in the performance metrics, since for different cache configurations, the variation in energy consumption is quite different from the variation in the number of cycles. The memory exploration procedure is very efficient since it exploits the trends in the cycles and energy characteristics to reduce the search space significantly. 相似文献

10.

Hybrid crossbar architecture for a memristor based cache

《Microelectronics Journal》2015,46(11):1020-1032

This paper describes a new memristor crossbar architecture that is proposed for use in a high density cache design. This design has less than 10% of the write energy consumption than a simple memristor crossbar. Also, it has up to 3 times the bit density of an STT-MRAM system and up to 11 times the bit density of an SRAM architecture. The proposed architecture is analyzed using a detailed SPICE analysis that accounts for the resistance of the wires in the memristor structure. Additionally, the memristor model used in this work has been matched to specific device characterization data to provide accurate results in terms of energy, area, and timing. The proposed memory system was analyzed by modeling two different devices that vary in resistance range and switching time. This system does not require that the memristor devices have inherent diode effects which limit alternate current paths. Therefore this system is capable of utilizing a much broader class of devices.An architectural analysis has also been completed that shows how the memory system may perform as a cache memory. A hybrid cache structure was used to alleviate the long write latencies of memristor devices. This approach consisted of the tag array being made of SRAM cells while the data array was made of the memristor circuit proposed. This hybrid scheme allows multiple reads and writes to concurrently access different sub-arrays within a cache. The performance of these novel memristor based caches was compared to SRAM and STT-MRAM based caches through detailed simulations. The results show that the memristor caches are denser and allow better performance along with lower system power when compared to the STT-MRAM and SRAM caches. 相似文献

11.

A novel low power hybrid cache using GC-EDRAM cells

《Integration, the VLSI Journal》2021

In a typical embedded CPU, large on-chip storage is critical to meet high performance requirements. However, the fast increasing size of the on-chip storage based on traditional SRAM cells makes the area cost and energy consumption unsustainable for future embedded applications. Replacing SRAM with DRAM on the CPU’s chip is generally considered not worthwhile because DRAM is not compatible with the common CMOS logic and requires additional processing steps beyond what is required for CMOS. However a special DRAM technology, Gain-Cell embedded-DRAM (GC-eDRAM) [1], [2], [3] is logic compatible and retains some of the good properties of DRAM (small and low power). In this paper we evaluate the performance of a novel hybrid cache memory where the data array, generally populated with SRAM cells, is replaced with GC-eDRAM cells while the tag array continues to use SRAM cells. Our evaluation of this cache demonstrates that, compared to the conventional SRAM-based designs, our novel architecture exhibits comparable performance with less energy consumption and smaller silicon area, enabling the sustainable on-chip storage scaling for future embedded CPUs. 相似文献

12.

Reducing process variation impact on replica-timed static random access memory sense timing

Nishith N. Jonathan R. Lawrence T. 《Integration, the VLSI Journal》2009,42(4):437-448

The read access delay of a static random access memory (SRAM) is dominated by the time required to develop a voltage differential on the bit-lines, particularly for small, fast level-1 (L1) caches in microprocessors. For a robust design, the bit-lines must develop a differential sufficient to overcome mismatch due to sense amplifier offsets and other signal path components before the data is sensed. This must be accomplished across all process skews and voltages. This paper proposes a design and optimization technique to minimize the bit-line voltage differential variation across process corners and voltages, which increases the read frequency by reducing the delay guard-band required at the design process corner. The technique reduces the required timing guard-band by minimizing the effects of process variation on the circuit delays. On a 90 nm high-performance cache memory data array, the typical corner guard-band required to generate the differential is reduced by 78%. Total variation in bit-line differential is reduced from 243 to 45 mV across process and voltage corners. 相似文献

13.

Analysis of power gating in different hierarchical levels of 2MB cache,considering variation

Mohsen Jafari Morteza Fathipour 《International Journal of Electronics》2013,100(9):1594-1608

This article reintroduces power gating technique in different hierarchical levels of static random-access memory (SRAM) design including cell, row, bank and entire cache memory in 16 nm Fin field effect transistor. Different structures of SRAM cells such as 6T, 8T, 9T and 10T are used in design of 2MB cache memory. The power reduction of the entire cache memory employing cell-level optimisation is 99.7% with the expense of area and other stability overheads. The power saving of the cell-level optimisation is 3× (1.2×) higher than power gating in cache (bank) level due to its superior selectivity. The access delay times are allowed to increase by 4% in the same energy delay product to achieve the best power reduction for each supply voltages and optimisation levels. The results show the row-level power gating is the best for optimising the power of the entire cache with lowest drawbacks. Comparisons of cells show that the cells whose bodies have higher power consumption are the best candidates for power gating technique in row-level optimisation. The technique has the lowest percentage of saving in minimum energy point (MEP) of the design. The power gating also improves the variation of power in all structures by at least 70%. 相似文献

14.

Fast Configurable-Cache Tuning With a Unified Second-Level Cache

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(1):80-91

Tuning a configurable cache subsystem to an application can greatly reduce memory hierarchy energy consumption. Previous tuning methods use a level one configurable cache only, or a second level with separate instruction and data configurable caches. We instead use a commercially-common unified second level cache, a seemingly minor difference that actually expands the configuration space from 500 to about 20$thinspace$000. We develop additive way tuning for tuning a cache subsystem with this large space, yielding 61% energy savings and 9% performance improvements over a nonconfigurable cache, greatly outperforming an extension of a previous method. 相似文献

15.

An area model for on-chip memories and its application

Mulder J.M. Quach N.T. Flynn M.J. 《Solid-State Circuits, IEEE Journal of》1991,26(2):98-106

An area model suitable for comparing data buffers of different organizations (e.g. caches versus register files) and arbitrary sizes is described. The area model considers the supplied bandwidth of a memory cell and includes such buffer overhead as control logic, driver logic and tag storage. The model gave less than 10% error when verified against real caches and register files. It is shown that, comparing caches and register files in terms of area for the same storage capacity, caches generally occupy more area per bit than register files for small caches because the overhead dominates the cache area at these sizes. For larger caches, the smaller storage cells in the cache provide a smaller total cache area per bit than the register set. Studying cache performance (traffic ratio) as a function of area, it is shown that, for small caches (less than the area occupied by 256 registers bits-r.b.e.-or 32 b), direct-mapped caches perform significantly better than four-way set-associative caches and, for caches of medium areas (between 256 r.b.e. and 4096 r.b.e.), both direct-mapped and set-associative caches perform better than fully associative caches 相似文献

16.

Instruction code mapping for performance increase and energy reduction in embedded computer systems

Parameswaran S. Henkel J. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(4):498-502

In this paper, we present a novel and fast constructive technique that relocates the instruction code in such a manner into the main memory that the cache is utilized more efficiently. The technique is applied as a preprocessing step, i.e., before the code is executed. Our technique is applicable in embedded systems where the number and characteristics of tasks running on the system is known a priori. The technique does not impose any computational overhead to the system. As a result of applying our technique to a variety of real-world applications we observed through simulation a significant drop of cache misses. Furthermore, the energy consumption of the whole system (CPU, caches, buses, main memory) is reduced by up to 65%. These benefits could be achieved by a slightly increased main memory size of about 13% on average. 相似文献

17.

Design and analysis of low-power cache using two-level filter scheme

Yen-Jen Chang Shanq-Jang Ruan Feipei Lai 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(4):568-580

Power consumption is an increasingly pressing problem in modern processor design. Since the on-chip caches usually consume a significant amount of power, it is one of the most attractive targets for power reduction. This paper presents a two-level filter scheme, which consists of the L1 and L2 filters, to reduce the power consumption of the on-chip cache. The main idea of the proposed scheme is motivated by the substantial unnecessary activities in conventional cache architecture. We use a single block buffer as the L1 filter to eliminate the unnecessary cache accesses. In the L2 filter, we then propose a new sentry-tag architecture to further filter out the unnecessary way activities in case of the L1 filter miss. We use SimpleScalar to simulate the SPEC2000 benchmarks and perform the HSPICE simulations to evaluate the proposed architecture. Experimental results show that the two-level filter scheme can effectively reduce the cache power consumption by eliminating most unnecessary cache activities, while the compromise of system performance is negligible. Compared to a conventional instruction cache (32 kB, two-way) implemented with only the L1 filter, the use of a two-level filter can result in roughly 30% reduction in total cache power consumption. Similarly, compared to a conventional data cache (32 kB, four-way) implemented with only the L1 filter, the total cache power reduction is approximately 46%. 相似文献

18.

Reducing leakage in a high-performance deep-submicron instructioncache

Powell M. Se-Hyun Yang Falsafi B. Roy K. Vijaykumar N. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2001,9(1):77-89

Deep-submicron CMOS designs maintain high transistor switching speeds by scaling down the supply voltage and proportionately reducing the transistor threshold voltage. Lowering the threshold voltage increases leakage energy dissipation due to subthreshold leakage current even when the transistor is not switching. Estimates suggest a five-fold increase in leakage energy in every future generation. In modern microarchitectures, much of the leakage energy is dissipated in large on-chip cache memory structures with high transistor densities. While cache utilization varies both within and across applications, modern cache designs are fixed in size resulting in transistor leakage inefficiencies. This paper explores an integrated architectural and circuit-level approach to reducing leakage energy in instruction caches (i-caches). At the architecture level, we propose the Dynamically ResIzable i-cache (DRI i cache), a novel i-cache design that dynamically resizes and adapts to an application's required size. At the circuit-level, we use gated-V_dd, a novel mechanism that effectively turns off the supply voltage to, and eliminates leakage in, the SRAM cells in a DRI i-cache's unused sections. Architectural and circuit-level simulation results indicate that a DRI i-cache successfully and robustly exploits the cache size variability both within and across applications. Compared to a conventional i-cache using an aggressively-scaled threshold voltage a 64 K DRI i-cache reduces on average both the leakage energy-delay product and cache size by 62%, with less than 4% impact on execution time. Our results also indicate that a wide NMOS dual-V_t gated-V_dd transistor with a charge pump offers the best gating implementation and virtually eliminates leakage energy with minimal increase in an SRAM cell read time area as compared to an i-cache with an aggressively-scaled threshold voltage 相似文献

19.

Fault buffers

Tayyeb Mahmood Soontae Kim 《Design Automation for Embedded Systems》2013,17(2):411-438

Voltage scaling can be applied to cache memories to reduce their energy consumptions. However, reduced supply voltage to the cache memories increases the number of defective SRAM cells due to process variations, which will decrease their yields and nullify the benefits of voltage scaling. To mitigate this problem, we propose a fault buffer-based scheme for L1 caches. Faults are identified and isolated at the granularity of individual words in the L1 caches. Actively used faulty cache words are dynamically allocated in the fault buffers. The fault buffers are organized as multiple banks for low cost implementation and can be dynamically reconfigured to reflect varying performance demands of programs. This dynamic scheme is shown to be more energy- and area-efficient than, and to be performing comparably to, the previously proposed static schemes. 相似文献

20.

Code and Data Placement for Embedded Processors with Scratchpad and Cache Memories

Yuriko Ishitobi Tohru Ishihara Hiroto Yasuura 《Journal of Signal Processing Systems》2010,60(2):211-224

This paper proposes a code placement problem, its ILP formulation, and a heuristic algorithm for reducing the total energy consumption of embedded processor systems including a CPU core, on-chip and off-chip memories. Our approach exploits a non-cacheable memory region for an effective use of a cache memory and as a result, reduces the number of off-chip accesses. Our algorithm simultaneously finds a code layout for a cacheable region, a scratchpad region, and the other non-cacheable region of the address space so as to minimize the total energy consumption of the processor system. Experiments using a commercial embedded processor and an off-chip SDRAM demonstrate that our algorithm reduces the energy consumption of the processor system by 23% without any performance degradation compared to the best result achieved by the conventional approach. 相似文献