首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
Power consumption is an increasingly pressing problem in modern processor design. Since the on-chip caches usually consume a significant amount of power, it is one of the most attractive targets for power reduction. This paper presents a two-level filter scheme, which consists of the L1 and L2 filters, to reduce the power consumption of the on-chip cache. The main idea of the proposed scheme is motivated by the substantial unnecessary activities in conventional cache architecture. We use a single block buffer as the L1 filter to eliminate the unnecessary cache accesses. In the L2 filter, we then propose a new sentry-tag architecture to further filter out the unnecessary way activities in case of the L1 filter miss. We use SimpleScalar to simulate the SPEC2000 benchmarks and perform the HSPICE simulations to evaluate the proposed architecture. Experimental results show that the two-level filter scheme can effectively reduce the cache power consumption by eliminating most unnecessary cache activities, while the compromise of system performance is negligible. Compared to a conventional instruction cache (32 kB, two-way) implemented with only the L1 filter, the use of a two-level filter can result in roughly 30% reduction in total cache power consumption. Similarly, compared to a conventional data cache (32 kB, four-way) implemented with only the L1 filter, the total cache power reduction is approximately 46%.  相似文献   

2.
Most microprocessors employ the on-chip caches to bridge the performance gap between the processor and the main memory. However, the cache accesses usually contribute significantly to the total power consumption of the chip. Based on the observation that an overwhelming majority of the values written to the cache are "0", in this paper we propose a zero-aware SRAM cell with an asymmetric inverter pair, called ZA cell, to minimize the cache power consumption in writing "0". The ZA cell uses a circuit-level technique, which is software independent and orthogonal to other low-power techniques at architecture-level. Compared to the conventional SRAM cell, the experimental results based on the SPEC2000 and MediaBench traces show that without compromise of both performance and stability, the ZA cell can reduce the average cache write power consumption over 60% for both the baseline instruction and data caches. In particular, the ZA cell is attractive in the data caches, which reveal the high write-zero rate.  相似文献   

3.
标志预访问和组选择历史相结合的低功耗指令cache   总被引:2,自引:0,他引:2       下载免费PDF全文
指令cache是处理器的主要耗能部件之一.研究发现,在指令顺序执行的情况下,访问同一cache行只需要访问一次标志存储器,因此标志存储器存在大量空闲周期.本方法利用标志存储器的空闲周期来预先访问地址连续的下一个cache行的标志,从而预先获得cache行命中和组选择信息,这样当真正取下一行的指令时,根据获得的该cache行的标志信息就无需访问没有被选中的数据存储器.预先访问标志存储器的另一个优点是可以加入组预测算法来减少对标志存储器的访问.为了减少短距离跳转时对cache的访问,环形历史缓冲区(CHB)保存了部分组选择结果来获得跳转目标地址的cache行信息.该方法没有性能损失,而且具有硬件实现简单,硬件代价小等优点.该方法已被应用于250MHz的RISC处理器中.  相似文献   

4.
Due to the powerful error correcting performance, turbo codes have been adopted in many wireless communication standards such as W-CDMA and CDMA2000. Although several low-power techniques have been proposed, power consumption is still a major issue to be solved in practical implementations. Since turbo decoding is classified as a memory-intensive algorithm, reducing memory accesses is crucial to achieve a low power design. To reduce the number of memory accesses for maximum a posteriori (MAP) decoding, this paper proposes an approximate reverse calculation method that can be implemented with simple arithmetic operations such as addition and comparison. Simulation results show that the proposed method applied to the W-CDMA standard reduces the access rate of the backward metric memory by 87% without degrading error-correcting performance. A prototype log-MAP decoder based on the proposed reverse calculation achieves 29% power reduction compared to a conventional decoder that does not use the reverse calculation.  相似文献   

5.
This paper explores non-volatile cache memories implemented by spin-transfer torque magnetic random access memories (STT-MRAMs) based on state-of-the-art perpendicular magnetic tunnel junctions (MTJs) and FinFETs. The use of double-barrier MTJs with two reference layers (DMTJs) is benchmarked against solutions relying on single-barrier MTJs (SMTJs) at different technology nodes (from 28-nm down to 20-nm). Our study is carried out through a cross-layer simulation platform, starting from the device- up to the system-level. Our results point out that, thanks to the reduced switching currents, DMTJ-based STT-MRAMs allow decreasing write access time of about 63% as compared to their SMTJ-based counterparts. This is achieved while assuring reduced energy consumption under both write (−42%) and read (−28%) accesses, lower area occupancy (−40%) and smaller leakage power (−25%), at the only cost of worsened read access time. This makes DMTJ-based STT-MRAM a promising candidate to replace conventional semiconductor-based cache memory for the next-generation of low-power microprocessors with on-chip non-volatility.  相似文献   

6.
Koo  S. Kim  S. Azougagh  D. Cho  Y. Maeng  S. 《Electronics letters》2006,42(10):569-571
By studying the behaviour of general programmes, it was observed that over 50% of bytes in a data cache are zero-valued. To reduce this waste of zero-valued spaces in a data cache, an overlapped cache scheme, which allows one cache line entry to hold up to two cache lines, is proposed. Experimental results show that, for SPEC2000 benchmarks, the proposed design reduces cache misses by 29% on average over a conventional direct-mapped cache.  相似文献   

7.
Chen  X. Bajwa  H. 《Electronics letters》2007,43(1):12-13
Low leakage current and area-efficient dual-port cache design, which uses isolation nodes and local sense amplifiers to facilitate dual-port accesses without duplicating the bit lines for the second port, is presented. Compared with conventional hardwired dual-port cache designs, the average bit line leakage current can be reduced by 50%  相似文献   

8.
一种低功耗Cache设计技术的研究   总被引:2,自引:0,他引:2  
低功耗、高性能的cache系统设计是嵌入式DSP芯片设计的关键。本文在多媒体处理DSP芯片MD32的设计实践中,提出一种利用读/写缓冲器作为零级cache,减少对数据、指令cache的读/写次数,由于缓冲器读取功耗远远小于片上cache,从而减小cache相关功耗的方法。通过多种多媒体处理测试程序的验证,该技术可减少对指令cache或者数据cache20%~40%的读取次数,以较小芯片面积的增加换取了较大的功耗降低。  相似文献   

9.
A 1-V 16-KB (L2) 2-KB (L1) four-way set-associative cache was fabricated using a 0.25-μm CMOS technology for future low-power high-speed microprocessors. Effective latency of 6.9 ns and power consumption of 10 mW at 100 MHz are obtained at a supply voltage of 1 V. This performance is achieved by using a new separated bit-line memory hierarchy architecture (SBMHA) that speeds up latency and reduces power consumption, and domino tag comparators (DTC's) that reduce the power dissipation of tag comparisons  相似文献   

10.
A 6-ns cycle, 7.7-ns access cache memory and memory management unit (CAMMU) chip has been developed. The circuit includes two 5-ns 128-kb cache memories, two 4-ns 64-entry fully associative translation lookaside buffers (TLBs), two 4-ns 64-line tag RAMs, comparators, registers, and control logic. The TLB design contains a line encoder and valid bits with flash clear. Timing control allows read, write, associative accesses, and invalid search accesses with identical timings. The two caches time-share data input and sense amplifier circuits for improved density, and they are pipelined to allow a new access to start before the previous access is complete  相似文献   

11.
With scaling of device dimensions, microscopic variations in number and location of dopant atoms in the channel region of the device induce increasingly limiting electrical deviations in device characteristics such as threshold voltage. These atomic-level intrinsic fluctuations cannot be eliminated by external control of the manufacturing process and are most pronounced in minimum-geometry transistors commonly used in area-constrained circuits such as SRAM cells. Consequently, a large number of cells in a memory are expected to be faulty due to process variations in sub-50-nm technologies. This paper analyzes SRAM cell failures under process variation and proposes new variation-aware cache architecture suitable for high performance applications. The proposed architecture adaptively resizes the cache to avoid faulty cells, thereby improving yield. This scheme is transparent to processor architecture and has negligible energy and area overhead. Experimental results on a 32 K direct map L1 cache show that the proposed architecture can achieve 93% yield compared to its original 33%. The Simplescalar simulation shows that designing the data and instruction cache using the proposed architecture results in 1.5% and 5.7% average CPU performance loss (over SPEC 2000 benchmarks), respectively, for the chips with maximum number of faulty cells which can be tolerated by our proposed scheme.  相似文献   

12.
本文提出了一种直接映象式高速缓存块冲突预测方法,即借助于高缓块的最近替换行为动态预测冲突发生.基于该方法,我们设计了一种高缓结构-冲突预测高缓,主体为一个直接映象式高缓和一个较小的全相联高缓,利用冲突预测表实行高缓块的动态分配.应用于片上数据高缓的SPEC95 仿真结果表明,与16kB直接映象式高缓相比,(8+1)kB冲突预测高缓命中率平均可提高12.2%,与类似结构的高缓(如NTS高缓、PCS高缓等)相比降低了硬件开销,简化了控制机构,易于实现,并且在命中率和总线交通量等性能方面都有所提高.  相似文献   

13.
Efficient use of an optimized custom memory hierarchy to exploit temporal locality in the data accesses can have a very large impact on the power consumption in data dominated applications. In the past, experiments have demonstrated that this task is crucial in a complete low-power memory management methodology. But effective formalized techniques to deal with this specific task have not been addressed yet. In this paper, the surprisingly large design freedom available for the basic problem is explored in-depth and the outline of a systematic solution methodology is proposed. The efficiency of the methodology is illustrated on a real-life motion estimation application. The results obtained for this application show power reductions of about 85% for the memory subsystem compared to the case without a custom memory hierarchy. These large gains justify that data reuse and memory hierarchy decisions should be taken early in the design flow  相似文献   

14.
In this paper, we present a new adaptive error-cancellation (AEC) technique, denoted as multi-input-multi-output (MIMO)-AEC, for the design of low-power MIMO signal processing systems. The MIMO-AEC technique builds on the previously proposed AEC technique by employing an algorithm transformation denoted as MIMO decorrelating (MIMO-DECOR) transform. MIMO-DECOR reduces complexity by exploiting correlations inherent in MIMO systems, thereby improving the energy efficiency of AEC. The proposed MIMO-AEC enables energy minimization of MIMO systems by correcting transient/soft errors that arise in very large scale integration signal processing implementations due to inherent process nonidealities and/or aggressive low-power design styles, such as voltage overscaling. We employ the MIMO-AEC in the design of a low-power Gigabit Ethernet 1000Base-T device. Simulation results indicate 69.1%-64.2% energy savings over optimally voltage-scaled present-day systems with no loss in algorithmic performance.  相似文献   

15.
This paper describes a low power Intel Architecture (IA) processor specifically designed for Mobile Internet Devices (MID) with performance similar to mainstream Ultra-Mobile PCs. The design relies on high residency in a new low-power state in order to keep average power and idle power below 220 and 80 mW, respectively. The design consists of an in-order pipeline capable of issuing 2 instructions per cycle supporting 2 threads, 32 KB instruction and 24 KB data L1 caches, independent integer and floating point execution units, times86 front end execution unit, a 512 KB L2 cache and a 533 MT/s dual-mode (GTL and CMOS) front-side-bus (FSB). The design contains 47 million transistors in a die size under 25 mm2 manufactured in a 9-metal 45 nm CMOS process with optimized transistors for low leakage. Maximum thermal design power (TDP) consumption is measured at 2 W at 1.0 V, 90degC using a synthetic power-virus test at a frequency of 1.86 GHz.  相似文献   

16.
Data race is a major factor which causes multi-core programs to produce concurrent bugs.To address the high hardware cost in happens-before detection proposals,a light-weight hardware data race detection approach based on sliding window technology was proposed.It used sliding windows to save recent memory instructions in thread execution and dynamically detected data races with small race distance which more easily lead to concurrent bugs.Considering the race distance,parallel thread segments were subdivided into concurrent race regions with lock and concurrent race regions without lock.A pair of alternate rewritable sliding windows was used to store the memory instructions in concurrent race region without lock,and a sliding window with variable size was used to store the memory instructions in concurrent race region with lock.When there was a conflict between a remote sharing access and memory accesses in sliding windows,a data race was detected.In the hardware implementation,the addresses of the data in sliding windows were automatically encoded into three hardware signatures with small size.Data races can be detected quickly without modifying the L1 cache and cache coherence protocol messages.This approach supplies efficient guidance to help users to diagnose concurrency bugs occurred in the development and production run of multi-core programs,achieving smaller hardware and bandwidth overhead.  相似文献   

17.
Circuit and microarchitectural techniques for reducing cache leakage power   总被引:2,自引:0,他引:2  
On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache.  相似文献   

18.
A novel scheme for low-power image coding and decoding based on classified vector quantisation is presented. The main idea is the replacement of the memory accesses to large background memories (most power-consuming operations), by arithmetic and/or application-specific computations. Specifically, the proposed image coding scheme uses small sub-codebooks to reduce the memory requirements and memory-related power consumption in comparison with classical vector quantisation schemes. By applying simple transformations on the codewords during coding, the proposed scheme extends the small sub-codebooks, compensating for the quality degradation introduced by their small size. Thus, the main coding task becomes computation-based rather than memory-based, leading to a significant reduction in power consumption. The proposed scheme achieves image qualities comparable with, or better than, those of traditional vector quantisation schemes, as the parameters of the transformations depend on the image block under coding, and the small sub-codebooks are dynamically adapted each time to this specific image block. The main disadvantage of the proposed scheme is the decrease in the compression ratio in comparison with classical vector quantisation. A joint (quality-compression ratio) optimisation procedure is used to keep this side-effect as small as possible  相似文献   

19.
Integrated layer processing (ILP) is an implementation concept that “permits the implementor the option of performing all the (data) manipulation steps in one or two integrated processing loops”. To estimate the achievable benefits of ILP, a file transfer application with an encryption function on top of a user-level TCP has been implemented and the performance of the application in terms of throughput and packet processing times has been measured. The results show that it is possible to obtain performance benefits by integrating marshalling, encryption, and TCP checksum calculation. The experiments yielded in a throughput gain of only 10-20% in contrast to the 50% gain achieved for simple loop experiments. Simulations of memory access and cache hit rate show that the main benefit of ILP is reduced memory access rather than an improved cache hit rate. ILP reduced the number of memory accesses up to 30% in the experiment, but the relative amount of cache misses could not be reduced compared to a carefully designed non-ILP implementation. The results also show that data manipulation characteristics may significantly influence the cache behavior and the achievable performance gain of ILP. Considering these results, ILP can only be recommended in cases where the the ILP loop consists of several, but very simple data manipulations without complex calculations over the data  相似文献   

20.
This paper presents a new data cache design, cache-processor coupling, which tightly binds an on-chip data cache with a microprocessor. Parallel architectures and high-speed circuit techniques are developed for speeding address handling process associated with accessing the data cache. The address handling time has been reduced by 51% by these architectures and circuit techniques. On the other hand, newly proposed instructions increase data cache bandwidth by eight times. Excessive power consumption due to the wide-bandwidth data transfer is carefully avoided by newly developed circuit techniques, which reduce dissipation power per bit to 1/26. Simulation study of the proposed architecture and circuit techniques yields a 1.8 ns delay each for address handling, cache access, and register access for a 16 kilobyte direct mapped cache with a 0.4 μm CMOS design rule  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号