首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. In this paper, we first show that these power reduction techniques can be suboptimal when thermal effects are considered. Then, we propose a thermal-aware cache power-down technique that minimizes the power density of the active parts by turning off alternating rows of memory cells instead of entire banks. The decrease in the power density lowers the temperature, which then exponentially reduces the leakage. Thus, leakage power of the active parts is reduced in addition to the power eliminated from the parts that are turned off. Simulations based on SPEC2000, NetBench, and MediaBench applications in a 70-nm technology show that the proposed thermal-aware architecture can reduce the total energy consumption by 53% compared to a conventional cache, and 14% compared to a cache architecture with thermal-unaware power reduction scheme. Second, we show a block permutation scheme that can be used during the design of the caches to maximize the distance between blocks with consecutive addresses. Because of spatial locality, blocks with consecutive addresses are likely to be accessed within a short time interval. By maximizing the distance between such blocks, we minimize the power density of the hot spots in the cache, and hence reduce the peak temperature. This, in return, results in an average leakage power reduction of 8.7% compared to a conventional cache without affecting the dynamic power and the latency. Overall, both of our architectures add no extra run-time penalty compared to the thermal-unaware power reduction schemes, yet they result in a significant reduction in the total energy consumption of a cache  相似文献   

2.
In this brief, we propose a two-level don't-care gating (DCG) scheme that aims to reduce the ternary content-addressable memory (TCAM) power dissipated in the search-line switching activity. By exploiting the vertically continuous ldquodon't-carerdquo feature, the two-level DCG scheme can largely reduce the average search-line power consumption during a switch pattern. In addition, we also use the search enable technique to eliminate the unnecessary search-line switching activity in the quiet pattern. By reducing both the search-line switching activity and average switching power, the proposed design can minimize the TCAM search-line power consumption. For a 128 times 32 TCAM, the best configuration we examined shows that when the first-level and second-level gating granularities are 16 and 8, respectively, with a 9% search performance improvement, the two-level DCG scheme can achieve 70% search-line energy reduction.  相似文献   

3.
本文提出了一种直接映象式高速缓存块冲突预测方法,即借助于高缓块的最近替换行为动态预测冲突发生.基于该方法,我们设计了一种高缓结构-冲突预测高缓,主体为一个直接映象式高缓和一个较小的全相联高缓,利用冲突预测表实行高缓块的动态分配.应用于片上数据高缓的SPEC95 仿真结果表明,与16kB直接映象式高缓相比,(8+1)kB冲突预测高缓命中率平均可提高12.2%,与类似结构的高缓(如NTS高缓、PCS高缓等)相比降低了硬件开销,简化了控制机构,易于实现,并且在命中率和总线交通量等性能方面都有所提高.  相似文献   

4.
作为计算量最多的模块之一,运动补偿占用了解码器与片外数据存储器之间约70%的带宽,是实现超高清视频解码的瓶颈。通过所设计的基于Cache的HEVC运动补偿模块,在保证实时解码数据吞吐量的同时,有效减少了80%的带宽消耗。首先,利用由可复用滤波器构成的插值计算模块和2D Cache设计了可并行化流水线数据处理的运动补偿模块,满足计算过程中高数据吞吐量需求。其次,设计高效内部存储器RAM结构,并提出片内Cache功耗降低的有效解决方案。最后,利用了参考帧数据相关性,设计插值顺序重排,将Cache的硬件开销减少了87.5%。基于HM9.0的HEVC标准测试视频序列实验结构表明,该设计显著地减少了带宽消耗和硬件开销。  相似文献   

5.
In a typical embedded CPU, large on-chip storage is critical to meet high performance requirements. However, the fast increasing size of the on-chip storage based on traditional SRAM cells makes the area cost and energy consumption unsustainable for future embedded applications. Replacing SRAM with DRAM on the CPU’s chip is generally considered not worthwhile because DRAM is not compatible with the common CMOS logic and requires additional processing steps beyond what is required for CMOS. However a special DRAM technology, Gain-Cell embedded-DRAM (GC-eDRAM)  [1], [2], [3] is logic compatible and retains some of the good properties of DRAM (small and low power). In this paper we evaluate the performance of a novel hybrid cache memory where the data array, generally populated with SRAM cells, is replaced with GC-eDRAM cells while the tag array continues to use SRAM cells. Our evaluation of this cache demonstrates that, compared to the conventional SRAM-based designs, our novel architecture exhibits comparable performance with less energy consumption and smaller silicon area, enabling the sustainable on-chip storage scaling for future embedded CPUs.  相似文献   

6.
A low-power I-cache architecture is proposed that is appropriate for embedded low-power processors. Unlike existing schemes, the proposed organisation places an extra small cache in parallel alongside the L1 cache. Since it allows simultaneous accesses to both caches, the proposed scheme introduces little performance degradation. Using simple hardware logic (for sequential accesses) and a compiler transformation (for loop accesses), most L1 cache requests are served by a small cache, so that the amount of energy consumed by the L1 cache is significantly reduced. Experimental results show that for the SPEC95 benchmarks, the proposed organisation reduces the energy-delay product on average by 67.2% over a conventional cache design and 16.8% over the filter cache design  相似文献   

7.
On-chip L1 and L2 caches represent a sizeable fraction of the total power consumption of microprocessors. In nanometer-scale technology, the subthreshold leakage power is becoming one of the dominant total power consumption components of those caches. In this study, we present optimization techniques to reduce the subthreshold leakage power of on-chip caches assuming that there are multiple threshold voltages, V/sub T/'s, available. First, we show a cache leakage optimization technique that examines the tradeoff between access time and subthreshold leakage power by assigning distinct V/sub T/'s to each of the four main cache components-address bus drivers, data bus drivers, decoders, and static random access memory (SRAM) cell arrays with sense amplifiers. Second, we show optimization techniques to reduce the leakage power of L1 and L2 on-chip caches without affecting the average memory access time. The key results are: 1) two additional high V/sub T/'s are enough to minimize leakage in a single cache-3 V/sub T/'s if we include a nominal low V/sub T/ for microprocessor core logic; 2) if L1 size is fixed, increasing L2 size can result in much lower leakage without reducing average memory access time; 3) if L2 size is fixed, reducing L1 size may result in lower leakage without loss of the average memory access time for the SPEC2K benchmarks; and 4) smaller L1 and larger L2 caches than are typical in today's processors result in significant leakage and dynamic power reduction without affecting the average memory access time.  相似文献   

8.
一种低功耗Cache设计技术的研究   总被引:2,自引:0,他引:2  
低功耗、高性能的cache系统设计是嵌入式DSP芯片设计的关键。本文在多媒体处理DSP芯片MD32的设计实践中,提出一种利用读/写缓冲器作为零级cache,减少对数据、指令cache的读/写次数,由于缓冲器读取功耗远远小于片上cache,从而减小cache相关功耗的方法。通过多种多媒体处理测试程序的验证,该技术可减少对指令cache或者数据cache20%~40%的读取次数,以较小芯片面积的增加换取了较大的功耗降低。  相似文献   

9.
We propose a new CAM architecture for the large-scale integration and low-power operation of a network router application. This CAM reduces entry count by an average of 52%, using a newly developed one-hot-spot block code. This code eliminates redundancy in a memory cell and improves the efficiency of IP address compression. To implement the proposed code, a hierarchical match-line structure and an on-chip entry compression/extraction scheme are introduced. With this architecture, a search-depth control scheme deactivates unnecessary search lines and reduces power consumption by 45%. Using a DRAM cell, our new content addressable memory (CAM) can achieve 1.5 million entries in 0.13-/spl mu/m technology, which is six times more than a conventional static ternary CAM.  相似文献   

10.
Presents a 32-b RISC microcontroller having an on-chip dc-dc converter. The chip standby power including the dc-dc converter is reduced to 63 μW with a new hybrid regulator scheme in which the microcontroller selects a switching regulator in active mode and a series regulator in standby mode. The achieved standby power corresponds to only about 1% of the standby power with a conventional scheme which always uses a switching regulator. Off-chip power transistor-type configuration suppresses pin-count increase caused by implementing an on-chip switching regulator. A series regulator that could be used instead of the switching regulator in active mode was also implemented on the chip. The series regulator does not impact the chip size because it is divided and inserted into I/O area that was originally left unused  相似文献   

11.
In this paper, we present the characterization and design of energy-efficient, on chip cache memories. The characterization of power dissipation in on-chip cache memories reveals that the memory peripheral interface circuits and bit array dissipate comparable power. To optimize performance and power in a processor's cache, a multidivided module (MDM) cache architecture is proposed to conserve energy in the bit array as well as the memory peripheral circuits. Compared to a conventional, nondivided, 16-kB cache, the latency and power of the MDM cache are reduced by a factor of 1.9 and 4.6, respectively. Based on the MDM cache architecture, the energy efficiency of the complete memory hierarchy is analyzed with respect to cache parameters in a multilevel processor cache design. This analysis was conducted by executing the SPECint92 benchmark programs with the miss ratios for reduced instruction set computer (RISC) and complex instruction set computer (CISC) machines  相似文献   

12.
This paper presents a new data cache design, cache-processor coupling, which tightly binds an on-chip data cache with a microprocessor. Parallel architectures and high-speed circuit techniques are developed for speeding address handling process associated with accessing the data cache. The address handling time has been reduced by 51% by these architectures and circuit techniques. On the other hand, newly proposed instructions increase data cache bandwidth by eight times. Excessive power consumption due to the wide-bandwidth data transfer is carefully avoided by newly developed circuit techniques, which reduce dissipation power per bit to 1/26. Simulation study of the proposed architecture and circuit techniques yields a 1.8 ns delay each for address handling, cache access, and register access for a 16 kilobyte direct mapped cache with a 0.4 μm CMOS design rule  相似文献   

13.
Dual on-chip 512-KB unified second level (L2) caches for an UltraSparc processor are implemented using 0.13-/spl mu/m technology. Each 512-KB unit is implemented using 34 million transistors to achieve 1.4 GHz and 2.6 W at 1.3 V and 85/spl deg/C. This fully integrated subsystem is composed of conventional data and tag SRAMs along with datapaths, controller, and test engines. The unit achieves one of the shortest on-chip L2 cache latencies reported for 64-bit microprocessors, with a data latency of only four cycles including ECC correction for 128-bit data. In addition, balanced custom and automated design methodologies are used to achieve the aggressive design cycle. Architectural and physical design solutions to build this integrated short latency L2 cache are discussed.  相似文献   

14.
研制了一款可编程6阶巴特沃斯有源RC滤波器.为提高滤波器中运算放大器的增益带宽积,设计了一种新型的前馈补偿运算放大器.为消除工艺偏差和环境变化对截止频率的影响,设计了一种片上数字控制频率调谐电路,并采用TSMC 0.18 μm CMOS工艺进行了流片.滤波器采用低通滤波结构,测试结果表明,3 dB截止频率为1~32 MHz,步进1 MHz,带内增益0 dB,带内纹波0.8 dB,2倍带宽处带外抑制不小于24 dBc,5倍带宽处带外抑制不小于68 dBc,滤波器等效输入噪声为340 nV/√Hz@1MHz,调谐误差为±3%.滤波器裸芯片面积0.87 mm×1.05 mm.采用1.8V电源电压,滤波器整体功耗小于20 mW.  相似文献   

15.
This paper discusses signature caching strategies to reduce power consumption for wireless broadcast and filtering services. The two-level signature scheme is used for indexing the information frames. A signature is considered as the basic caching entity in this paper. Four caching policies are compared in terms of tune-in time and access time. With reasonable access time delay, all of the caching policies reduce the tune-in time for the two-level signature scheme. Moreover, two cache replacement policies are presented and compared by simulation. The result shows that, when the cache size is small, caching only the integrated signatures is recommended. When the size of cache is greater than that of the integrated signatures, caching both of the integrated and simple signatures is better.  相似文献   

16.
This paper aims at finding fundamental design principles for hierarchical Web caching. An analytical modeling technique is developed to characterize an uncooperative two-level hierarchical caching system where the least recently used (LRU) algorithm is locally run at each cache. With this modeling technique, we are able to identify a characteristic time for each cache, which plays a fundamental role in understanding the caching processes. In particular, a cache can be viewed roughly as a low-pass filter with its cutoff frequency equal to the inverse of the characteristic time. Documents with access frequencies lower than this cutoff frequency have good chances to pass through the cache without cache hits. This viewpoint enables us to take any branch of the cache tree as a tandem of low-pass filters at different cutoff frequencies, which further results in the finding of two fundamental design principles. Finally, to demonstrate how to use the principles to guide the caching algorithm design, we propose a cooperative hierarchical Web caching architecture based on these principles. Both model-based and real trace simulation studies show that the proposed cooperative architecture results in more than 50% memory saving and substantial central processing unit (CPU) power saving for the management and update of cache entries compared with the traditional uncooperative hierarchical caching architecture.  相似文献   

17.
采用0.18μm/1.8V 1P6M数字CMOS工艺设计并实现了一种用于高性能32位RISC微处理器的64kb四路组相联片上高速缓冲存储器(cache).当采用串行访问方式时,该四路组相联cache的功耗比采用传统并行访问方式在cache命中时降低26%,在cache失效时降低35%.该cache的设计中还采用了高速电路模块如高速电流灵敏放大器和分裂式动态tag比较器等来提高电路工作速度.电路仿真结果显示cache命中时从时钟输入到数据输出的延时为2.7ns.  相似文献   

18.
Coarse-grained reconfigurable architectures (CGRAs) require many processing elements (PEs) and a configuration memory unit (configuration cache) for reconfiguration of its PE array. Although this structure is meant for high performance and flexibility, it consumes significant power. Specially, power consumption by configuration cache is explicit overhead compared to other types of intellectual property (IP) cores. Reducing power is very crucial for CGRA to be more competitive and reliable processing core in embedded systems. In this paper, we propose a reusable context pipelining (RCP) architecture to reduce power-overhead caused by reconfiguration. It shows that the power reduction can be achieved by using the characteristics of loop pipelining, which is a multiple instruction stream, multiple data stream (MIMD)-style execution model. RCP efficiently reduces power consumption in configuration cache without performance degradation. Experimental results show that the proposed approach saves much power even with reduced configuration cache size. Power reduction ratio in the configuration cache and the entire architecture are up to 86.33% and 37.19%, respectively, compared to the base architecture.  相似文献   

19.
Most microprocessors employ the on-chip caches to bridge the performance gap between the processor and the main memory. However, the cache accesses usually contribute significantly to the total power consumption of the chip. Based on the observation that an overwhelming majority of the values written to the cache are "0", in this paper we propose a zero-aware SRAM cell with an asymmetric inverter pair, called ZA cell, to minimize the cache power consumption in writing "0". The ZA cell uses a circuit-level technique, which is software independent and orthogonal to other low-power techniques at architecture-level. Compared to the conventional SRAM cell, the experimental results based on the SPEC2000 and MediaBench traces show that without compromise of both performance and stability, the ZA cell can reduce the average cache write power consumption over 60% for both the baseline instruction and data caches. In particular, the ZA cell is attractive in the data caches, which reveal the high write-zero rate.  相似文献   

20.
This paper presents an optimized 3-D Discrete Wavelet Transform (3-DDWT) architecture. 1-DDWT employed for the design of 3-DDWT architecture uses reduced lifting scheme approach. Further the architecture is optimized by applying block enabling technique, scaling, and rounding of the filter coefficients. The proposed architecture uses biorthogonal (9/7) wavelet filter. The architecture is modeled using Verilog HDL, simulated using ModelSim, synthesized using Xilinx ISE and finally implemented on Virtex-5 FPGA. The proposed 3-DDWT architecture has slice register utilization of 5%, operating frequency of 396 MHz and a power consumption of 0.45 W.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号