期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Static energy reduction techniques for microprocessor caches 总被引：1，自引：0，他引：1

Hanson H. Hrishikesh M.S. Agarwal V. Keckler S.W. Burger D. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(3):303-313

Microprocessor performance has been improved by increasing the capacity of on-chip caches. However, the performance gain comes at the price of static energy consumption due to subthreshold leakage current in cache memory arrays. This paper compares three techniques for reducing static energy consumption in on-chip level-1 and level-2 caches. One technique employs low-leakage transistors in the memory cell. Another technique, power supply switching, can be used to turn off memory cells and discard their contents. A third alternative is dynamic threshold modulation, which places memory cells in a standby state that preserves cell contents. In our experiments, we explore the energy and performance tradeoffs of these techniques. We also investigate the sensitivity of microprocessor performance and energy consumption to additional cache latency caused by leakage-reduction techniques. 相似文献

2.

A single-V/sub t/ low-leakage gated-ground cache for deep submicron

Agarwal A. Hai Li Roy K. 《Solid-State Circuits, IEEE Journal of》2003,38(2):319-328

In this paper, we propose a novel integrated circuit and architectural level technique to reduce leakage power consumption in high-performance cache memories using single V/sub t/ (transistor threshold voltage) process. We utilize the concept of gated-ground (nMOS transistor inserted between ground line and SRAM cell) to achieve a reduction in leakage energy without significantly affecting performance. Experimental results on gated-ground caches show that data is retained (DRG-Cache) even if the memory is put in the standby mode of operation. Data is restored when the gated-ground transistor is turned on. Turning off the gated-ground transistor in turn gives a large reduction in leakage power. This technique requires no extra circuitry; the row decoder itself can be used to control the gated-ground transistor. The technique is applicable to data and instruction caches as well as different levels of cache hierarchy, such as the L1, L2, or L3 caches. We fabricated a test chip in TSMC 0.25-/spl mu/m technology to show the data retention capability and the cell stability of the DRG-Cache. Our simulation results on 100-nm and 70-nm processes (Berkeley Predictive Technology Model) show 16.5% and 27% reduction in consumed energy in L1 cache and 50% and 47% reduction in L2 cache, respectively, with less than 5% impact on execution time and within 4% increase in area overhead. 相似文献

3.

Reducing leakage in a high-performance deep-submicron instructioncache

Powell M. Se-Hyun Yang Falsafi B. Roy K. Vijaykumar N. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2001,9(1):77-89

Deep-submicron CMOS designs maintain high transistor switching speeds by scaling down the supply voltage and proportionately reducing the transistor threshold voltage. Lowering the threshold voltage increases leakage energy dissipation due to subthreshold leakage current even when the transistor is not switching. Estimates suggest a five-fold increase in leakage energy in every future generation. In modern microarchitectures, much of the leakage energy is dissipated in large on-chip cache memory structures with high transistor densities. While cache utilization varies both within and across applications, modern cache designs are fixed in size resulting in transistor leakage inefficiencies. This paper explores an integrated architectural and circuit-level approach to reducing leakage energy in instruction caches (i-caches). At the architecture level, we propose the Dynamically ResIzable i-cache (DRI i cache), a novel i-cache design that dynamically resizes and adapts to an application's required size. At the circuit-level, we use gated-V_dd, a novel mechanism that effectively turns off the supply voltage to, and eliminates leakage in, the SRAM cells in a DRI i-cache's unused sections. Architectural and circuit-level simulation results indicate that a DRI i-cache successfully and robustly exploits the cache size variability both within and across applications. Compared to a conventional i-cache using an aggressively-scaled threshold voltage a 64 K DRI i-cache reduces on average both the leakage energy-delay product and cache size by 62%, with less than 4% impact on execution time. Our results also indicate that a wide NMOS dual-V_t gated-V_dd transistor with a charge pump offers the best gating implementation and virtually eliminates leakage energy with minimal increase in an SRAM cell read time area as compared to an i-cache with an aggressively-scaled threshold voltage 相似文献

4.

A third-generation SPARC V9 64-b microprocessor

Heald R. Aingaran K. Amir C. Ang M. Boland M. Dixit P. Gouldsberry G. Greenley D. Grinberg J. Hart J. Horel T. Wen-Jay Hsu Kaku J. Chin Kim Song Kim Klass F. Kwan H. Lauterbach G. Lo R. McIntyre H. Mehta A. Murata D. Nguyen S. Yet-Ping Pai Patel S. Shin K. Tam K. Vishwanthaiah S. Wu J. Yee G. You E. 《Solid-State Circuits, IEEE Journal of》2000,35(11):1526-1538

This quad-issue processor achieves 1-GHz operation through improved dynamic circuit techniques in critical paths and a more extensive on-chip memory system which scales in both bandwidth and latency. Critical logic paths use domino, delayed clocked domino, and logic embedded in dynamic flip-flops for minimum delay. A 64-KB sum-addressed memory data cache combines the address offset add with the cache decode, allowing the average memory latency to scale by more than the clock ratio. Memory bandwidth is improved by using wave pipelined SRAM designs for on-chip caches and a write cache for store traffic. Memory power is controlled without increased latency by use of delayed-reset logic decoders. The chip operates at 1000 MHz and dissipates less than 80 W from a 1.6-V supply. It contains 23 million transistors (12 million in RAM cells) on a 244 mm² die 相似文献

5.

A Case for Asymmetric-Cell Cache Memories

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(7):877-881

In this paper, we make the case for building high-performance asymmetric-cell caches (ACCs) that employ recently-proposed asymmetric SRAMs to reduce leakage proportionally to the number of resident zero bits. Because ACCs target memory value content (independent of cell activity and access patterns), they complement prior proposals for reducing cache leakage that target memory access characteristics. Through detailed simulation and leakage estimation using a commercial 0.13-$mu$m CMOS process model, we show that: 1) on average 75% of resident data cache bits and 64% of resident instruction cache bits are zero; 2) while prior research carefully evaluated the fraction of accessed zero bytes, we show that a high fraction of accessed zero bytes is neither a necessary nor a sufficient condition for a high fraction of resident zero bits; 3) the zero-bit program behavior persists even when we restrict our attention to live data, thereby complementing prior leakage-saving techniques that target inactive cells; and 4) ACCs can reduce leakage on the average by 4.3$times$compared to a conventional data cache without any performance loss, and by 9$times$at the cost of a 5% increase in overall cache access latency. 相似文献

6.

Thermal Management of On-Chip Caches Through Power Density Minimization

Ja Chun Ku Ozdemir S. Memik G. Ismail Y. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(5):592-604

Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. In this paper, we first show that these power reduction techniques can be suboptimal when thermal effects are considered. Then, we propose a thermal-aware cache power-down technique that minimizes the power density of the active parts by turning off alternating rows of memory cells instead of entire banks. The decrease in the power density lowers the temperature, which then exponentially reduces the leakage. Thus, leakage power of the active parts is reduced in addition to the power eliminated from the parts that are turned off. Simulations based on SPEC2000, NetBench, and MediaBench applications in a 70-nm technology show that the proposed thermal-aware architecture can reduce the total energy consumption by 53% compared to a conventional cache, and 14% compared to a cache architecture with thermal-unaware power reduction scheme. Second, we show a block permutation scheme that can be used during the design of the caches to maximize the distance between blocks with consecutive addresses. Because of spatial locality, blocks with consecutive addresses are likely to be accessed within a short time interval. By maximizing the distance between such blocks, we minimize the power density of the hot spots in the cache, and hence reduce the peak temperature. This, in return, results in an average leakage power reduction of 8.7% compared to a conventional cache without affecting the dynamic power and the latency. Overall, both of our architectures add no extra run-time penalty compared to the thermal-unaware power reduction schemes, yet they result in a significant reduction in the total energy consumption of a cache 相似文献

7.

CACTI: an enhanced cache access and cycle time model 总被引：3，自引：0，他引：3

Wilton S.J.E. Jouppi N.P. 《Solid-State Circuits, IEEE Journal of》1996,31(5):677-688

This paper describes an analytical model for the access and cycle times of on-chip direct-mapped and set-associative caches. The inputs to the model are the cache size, block size, and associativity, as well as array organization and process parameters. The model gives estimates that are within 6% of Hspice results for the circuits we have chosen. This model extends previous models and fixes many of their major shortcomings. New features include models for the tag array, comparator, and multiplexor drivers, nonstep stage input slopes, rectangular stacking of memory subarrays, a transistor-level decoder model, column-multiplexed bitlines controlled by an additional array organizational parameter, load-dependent size transistors for wordline drivers, and output of cycle times as well as access times. Software implementing the model is available via ftp 相似文献

8.

A reconfigurable multilevel parallel texture cache memory with75-GB/s parallel cache replacement bandwidth

Se-Jeong Park Jeong-Su Kim Ramchan Woo Se-Joong Lee Kang-Min Lee Tae-Hum Yang Jin-Yong Jung Hoi-Jun Yoo 《Solid-State Circuits, IEEE Journal of》2002,37(5):612-623

Recently, the level of realism in PC graphics applications has been approaching that of high-end graphics workstations, necessitating a more sophisticated texture data cache memory to overcome the finite bandwidth of the AGP or PCI bus. This paper proposes a multilevel parallel texture cache memory to reduce the required data bandwidth on the AGP or PCI bus and to accelerate the operations of parallel graphics pipelines in PC graphics cards. The proposed cache memory is fabricated by 0.16-μm DRAM-based SOC technology. It is composed of four components: an 8-MB DRAM L2 cache, 8-way parallel SRAM L1 caches, pipelined texture data filters, and a serial-to-parallel loader. For high-speed parallel L1 cache data replacement, the internal bus bandwidth has been maximized up to 75 GB/s with a newly proposed hidden double data transfer scheme. In addition, the cache memory has a reconfigurable architecture in its line size for optimal caching performance in various graphics applications from three-dimensional (3-D) games to high-quality 3-D movies 相似文献

9.

Zero-aware asymmetric SRAM cell for reducing cache power in writing zero

Yen-Jen Chang Feipei Lai Chia-Lin Yang 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2004,12(8):827-836

Most microprocessors employ the on-chip caches to bridge the performance gap between the processor and the main memory. However, the cache accesses usually contribute significantly to the total power consumption of the chip. Based on the observation that an overwhelming majority of the values written to the cache are "0", in this paper we propose a zero-aware SRAM cell with an asymmetric inverter pair, called ZA cell, to minimize the cache power consumption in writing "0". The ZA cell uses a circuit-level technique, which is software independent and orthogonal to other low-power techniques at architecture-level. Compared to the conventional SRAM cell, the experimental results based on the SPEC2000 and MediaBench traces show that without compromise of both performance and stability, the ZA cell can reduce the average cache write power consumption over 60% for both the baseline instruction and data caches. In particular, the ZA cell is attractive in the data caches, which reveal the high write-zero rate. 相似文献

10.

A 450-MHz RISC microprocessor with enhanced instruction set andcopper interconnect

Nicoletta C. Alvarez J. Barkin E. Chai-Chin Chao Johnson B.R. Lassandro F.M. Patel P. Reid D. Sanchez H. Seigel J. Snyder M. Sullivan S. Taylor S.A. Minh Vo 《Solid-State Circuits, IEEE Journal of》1999,34(11):1478-1491

This superscalar microprocessor is the first implementation of a 32-bit RISC architecture specification incorporating a single-instruction, multiple-data vector processing engine. Two instructions per cycle plus a branch can be dispatched to two of seven execution units in this microarchitecture designed for high execution performance, high memory bandwidth, and low power for desktop, embedded, and multiprocessing systems. The processor features an enhanced memory subsystem, 128-bit internal data buses for improved bandwidth, and 32-KB eight-way instruction/data caches. The integrated L2 tag and cache controller with a dedicated L2 bus interface supports L2 cache sizes of 512 KB, 1 MB, or 2 MB with two-way set associativity. At 450 MHz, and with a 2-MB L2 cache, this processor is estimated to have a floating-point and integer performance metric of 20 while dissipating only 7 W at 1.8 V. The 10.5 million transistor, 83-mm² die is fabricated in a 1.8-V, 0.20-μm CMOS process with six layers of copper interconnect 相似文献

11.

Circuit and microarchitectural techniques for reducing cache leakage power 总被引：2，自引：0，他引：2

Nam Sung Kim Flautner K. Blaauw D. Mudge T. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2004,12(2):167-184

On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache. 相似文献

12.

A 400-MHz S/390 microprocessor

Webb C.F. Anderson C.J. Sigal L. Shepard K.L. Liptay J.S. Warnock J.D. Curran B. Krumm B.W. Mayo M.D. Camporese P.J. Schwarz E.M. Farrell M.S. Restle P.J. Averill R.M. III Slegel T.J. Houtt W.V. Chan Y.H. Wile B. Nguyen T.N. Emma P.G. Beece D.K. Ching-Te Chuang Price C. 《Solid-State Circuits, IEEE Journal of》1997,32(11):1665-1675

A microprocessor implementing IBM S/390 architecture operates in a 10+2 way system at frequencies up to 411 MHz (2.43 ns). The chip is fabricated in a 0.2-μm L_eff CMOS technology with five layers of metal and tungsten local interconnect. The chip size is 17.35 mm×17.30 mm with about 7.8 million transistors. The power supply is 2.5 V and measured power dissipation at 300 MHz is 37 W. The microprocessor features two instruction units (IUs), two fixed point units (FXUs), two floating point units (FPUs), a buffer control element (BCE) with a unified 64-KB L1 cache, and a register unit (RU). The microprocessor dispatches one instruction per cycle. The dual-instruction, fixed, and floating point units are used to check each other to increase reliability and not for improved performance. A phase-locked-loop (PLL) provides a processor clock that runs at 2× the system bus frequency. High-frequency operation was achieved through careful static circuit design and timing optimization, along with limited use of dynamic circuits for highly critical functions, and several different clocking/latching strategies for cycle time reduction. Timing-driven synthesis and placement of the control logic provided the maximum flexibility with minimum turnaround time. Extensive use of self-resetting CMOS (SRCMOS) circuits in the on-chip L1 cache provides a 2.0-ns access time and up to 500 MHz operation 相似文献

13.

A separated bit-line unified cache: Conciliating small on-chipcache die-area and low miss ratio

Mizuno H. Ishibashi K. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1999,7(1):139-144

This paper describes an on-chip cache, called a separated bit-line unified cache, which minimizes the chip-area cost in high-performance microprocessors. This unified cache has two ports; one for the instruction bus and the other for the data bus. A separated bit-line memory hierarchy architecture realizes memory hierarchy design with only 10%-20% area overhead. The total cache area can be reduced by more than 20%-30% on the average at capacities of larger than 64 KB with the same hit rate as the conventional cache. The cache latency reaches 4.2 ns at a supply voltage of 1 V. Additionally, the cache is physically addressable even if the cache has a large capacity 相似文献

14.

A single-chip, 1.6-billion, 16-b MAC/s multiprocessor DSP

Ackland B. Anesko A. Brinthaupt D. Daubert S.J. Kalavade A. Knobloch J. Micca E. Moturi M. Nicol C.J. O'Neill J.H. Othmer J. Sackinger E. Singh K.J. Sweet J. Terman C.J. Williams J. 《Solid-State Circuits, IEEE Journal of》2000,35(3):412-424

An MIMD multiprocessor digital signal-processing (DSP) chip containing four 64-b processing elements (PE's) interconnected by a 128-b pipelined split transaction bus (STBus) is presented. Each PE contains a 32-b RISC core with DSP enhancements and a 64-b single-instruction, multiple-data vector coprocessor with four 16-b MAC/s and a vector reduction unit. PEs are connected to the STBus through reconfigurable dual-ported snooping L1 cache memories that support shared memory multiprocessing using a modified-MESI data coherency protocol. High-bandwidth data transfers between system memory and on-chip caches are managed in a pipelined memory controller that supports multiple outstanding transactions. An embedded RTOS dynamically schedules multiple tasks onto the PEs. Process synchronization is achieved using cached semaphores. The 200-mm², 0.25-μm CMOS chip operates at 100 MHz and dissipates 4 W from a 3.3-V supply 相似文献

15.

片上二级cache漏流功耗控制策略研究

周宏伟欧国东齐树波张民选《电子学报》2008,36(8):1532-1537

随着工艺尺寸缩小和处理器频率的提高,大容量的片上L2 cache成为处理器漏流功耗的主要来源.提出的保守多状态(C-SP&;SD)和推断多状态(S-SP&;SD)两种L2 cache漏流功耗控制策略能够将状态保留(State-Preserving)与状态破坏(State-Destroying)两种低功耗模式相结合.如果一个数据在多级cache存储层次中存在多个副本,那么只保留一个副本处于活跃状态,其他副本均被转换到低功耗模式,并且在不显著影响处理器性能的前提下尽可能转换到更低功耗的状态破坏模式.与传统的L2 cache漏流控制策略相比,C-SP&;SD策略以较小的处理器性能损失换取较大的L2 cache漏流功耗节省,而S-SP&;SD策略则实现了最优的L2 cache漏流功耗节省和处理器能量效率. 相似文献

16.

A Sub-2 W Low Power IA Processor for Mobile Internet Devices in 45 nm High-k Metal Gate CMOS

Gerosa G. Curtis S. D'Addeo M. Bo Jiang Kuttanna B. Merchant F. Patel B. Taufique M.H. Samarchi H. 《Solid-State Circuits, IEEE Journal of》2009,44(1):73-82

This paper describes a low power Intel Architecture (IA) processor specifically designed for Mobile Internet Devices (MID) with performance similar to mainstream Ultra-Mobile PCs. The design relies on high residency in a new low-power state in order to keep average power and idle power below 220 and 80 mW, respectively. The design consists of an in-order pipeline capable of issuing 2 instructions per cycle supporting 2 threads, 32 KB instruction and 24 KB data L1 caches, independent integer and floating point execution units, times86 front end execution unit, a 512 KB L2 cache and a 533 MT/s dual-mode (GTL and CMOS) front-side-bus (FSB). The design contains 47 million transistors in a die size under 25 mm² manufactured in a 9-metal 45 nm CMOS process with optimized transistors for low leakage. Maximum thermal design power (TDP) consumption is measured at 2 W at 1.0 V, 90degC using a synthetic power-virus test at a frequency of 1.86 GHz. 相似文献

17.

Partial address directory for cache access

Lishing Liu 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1994,2(2):226-240

In most high performance computers the speeds of cache accessing are critical in determining the cycle times. A classical method for designing set-associative caches is to late-select array data based on the results of cache directory lookups. The impact on the critical path timing due to late-select will become more significant in future microprocessors with very high clock frequencies. In this paper we propose a new approach to the optimization of array access timing for set-associative caches. The basic idea is to utilize a relatively small partial address directory (PAD) for fast and accurate approximations of cache access coordinates. The PAD can speed up most cache array access by accurately predicting cache locations without having to wait for results from conventional cache directory lookups. Occasionally when the PAD guesses wrong, a memory access can be re-issued with only 1-cycle delays. The PAD may be closely integrated with the array design. The effectiveness of the PAD method is analyzed through combinatorial and simulation studies 相似文献

18.

Design and analysis of low-power cache using two-level filter scheme

Yen-Jen Chang Shanq-Jang Ruan Feipei Lai 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(4):568-580

Power consumption is an increasingly pressing problem in modern processor design. Since the on-chip caches usually consume a significant amount of power, it is one of the most attractive targets for power reduction. This paper presents a two-level filter scheme, which consists of the L1 and L2 filters, to reduce the power consumption of the on-chip cache. The main idea of the proposed scheme is motivated by the substantial unnecessary activities in conventional cache architecture. We use a single block buffer as the L1 filter to eliminate the unnecessary cache accesses. In the L2 filter, we then propose a new sentry-tag architecture to further filter out the unnecessary way activities in case of the L1 filter miss. We use SimpleScalar to simulate the SPEC2000 benchmarks and perform the HSPICE simulations to evaluate the proposed architecture. Experimental results show that the two-level filter scheme can effectively reduce the cache power consumption by eliminating most unnecessary cache activities, while the compromise of system performance is negligible. Compared to a conventional instruction cache (32 kB, two-way) implemented with only the L1 filter, the use of a two-level filter can result in roughly 30% reduction in total cache power consumption. Similarly, compared to a conventional data cache (32 kB, four-way) implemented with only the L1 filter, the total cache power reduction is approximately 46%. 相似文献

19.

Design and implementation of an embedded 512-KB level-2 cache subsystem

Shin J.L. Petrick B. Singh M. Leon A.S. 《Solid-State Circuits, IEEE Journal of》2005,40(9):1815-1820

Dual on-chip 512-KB unified second level (L2) caches for an UltraSparc processor are implemented using 0.13-/spl mu/m technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85/spl deg/C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 cache are discussed. 相似文献

20.

Gate oxide leakage and delay tradeoffs for dual-T/sub ox/ circuits

Sultania A.K. Sylvester D. Sapatnekar S.S. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(12):1362-1375

Gate oxide tunneling current (I/sub gate/) is comparable to subthreshold leakage current in CMOS circuits when the equivalent physical oxide thickness (T/sub ox/) is below 15 /spl Aring/. Increasing the value of T/sub ox/ reduces the leakage at the expense of increased delay, and hence a practical tradeoff between delay and leakage can be achieved by assigning one of two permissible T/sub ox/ values to each transistor. In this paper, we propose an algorithm for dual-T/sub ox/ assignment to optimize the total leakage power under delay constraints and generate a leakage/delay tradeoff curve. As compared to the case where all transistors are set to low T/sub ox/, our approach achieves an average leakage reduction of 86% under 100 nm models and 81% under 70 nm models. We also propose a transistor and pin reordering technique that has minimal layout impact to further reduce the total leakage current up to 12% and I/sub gate/ up to 27% without incurring any delay penalty. 相似文献