共查询到20条相似文献,搜索用时 78 毫秒
1.
2.
Yen-Jen Chang Shanq-Jang Ruan Feipei Lai 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2003,11(4):568-580
Power consumption is an increasingly pressing problem in modern processor design. Since the on-chip caches usually consume a significant amount of power, it is one of the most attractive targets for power reduction. This paper presents a two-level filter scheme, which consists of the L1 and L2 filters, to reduce the power consumption of the on-chip cache. The main idea of the proposed scheme is motivated by the substantial unnecessary activities in conventional cache architecture. We use a single block buffer as the L1 filter to eliminate the unnecessary cache accesses. In the L2 filter, we then propose a new sentry-tag architecture to further filter out the unnecessary way activities in case of the L1 filter miss. We use SimpleScalar to simulate the SPEC2000 benchmarks and perform the HSPICE simulations to evaluate the proposed architecture. Experimental results show that the two-level filter scheme can effectively reduce the cache power consumption by eliminating most unnecessary cache activities, while the compromise of system performance is negligible. Compared to a conventional instruction cache (32 kB, two-way) implemented with only the L1 filter, the use of a two-level filter can result in roughly 30% reduction in total cache power consumption. Similarly, compared to a conventional data cache (32 kB, four-way) implemented with only the L1 filter, the total cache power reduction is approximately 46%. 相似文献
3.
A standard fast programmable logic array (PLA) structure is discussed, with emphasis on its power consumption drawbacks. An exploration of alternatives to this structure leads to a presentation of the architecture and design of a low-power PLA structure used in a digital signal processing (DSP) environment. This PLA achieves low power consumption by a combination of pipelining, use of a NAND-OR configuration, and a simplified addressing scheme. Experimental results for the temperature range of 0 to 70°C indicate that the circuit works as expected in a range extending to at least 3.5 V 相似文献
4.
Prashant Jain Andrew Laffely Wayne Burleson Russell Tessier Dennis Goeckel 《The Journal of VLSI Signal Processing》2004,36(1):27-40
Signal processing algorithms and architectures can use dynamic reconfiguration to exploit variations in signal statistics with the objectives of improved performance and reduced power. Parameters provide a simple and formal way to characterize incremental changes to a computation and its computing mechanism. This paper develops a framework for dynamic parameterization and applies it to MPEG-4 motion estimation. A novel motion estimation architecture facilitates the dynamic variation of parameters to achieve power-compression tradeoffs. Our work shows that parameter variation in motion estimation helps achieve power reduction by an order of magnitude, trading off higher compression for lower power. The magnitude of the tradeoffs depends on the input signal variation. The monitoring of input and output signal statistics and subsequent variation of parameters is accomplished by a hardware controller. To provide the controller with a model of the parameter space and corresponding measures in terms of power and performance, a configuration sample space graph is developed. This graph identifies the parameters which present the best power-performance tradeoffs. The controller samples the operating environment to select the appropriate parameters. This reduces the average power consumption by 40% for 2% loss in compression. Four other signal dependent computations: (1) 2D Discrete Cosine Transform, (2) Lempel-Ziv lossless compression, (3) 3D graphics light rendering, and (4) Viterbi decoding are briefly discussed to demonstrate the applicability of dynamic reconfiguration. 相似文献
5.
Ja Chun Ku Ozdemir S. Memik G. Ismail Y. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(5):592-604
Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. In this paper, we first show that these power reduction techniques can be suboptimal when thermal effects are considered. Then, we propose a thermal-aware cache power-down technique that minimizes the power density of the active parts by turning off alternating rows of memory cells instead of entire banks. The decrease in the power density lowers the temperature, which then exponentially reduces the leakage. Thus, leakage power of the active parts is reduced in addition to the power eliminated from the parts that are turned off. Simulations based on SPEC2000, NetBench, and MediaBench applications in a 70-nm technology show that the proposed thermal-aware architecture can reduce the total energy consumption by 53% compared to a conventional cache, and 14% compared to a cache architecture with thermal-unaware power reduction scheme. Second, we show a block permutation scheme that can be used during the design of the caches to maximize the distance between blocks with consecutive addresses. Because of spatial locality, blocks with consecutive addresses are likely to be accessed within a short time interval. By maximizing the distance between such blocks, we minimize the power density of the hot spots in the cache, and hence reduce the peak temperature. This, in return, results in an average leakage power reduction of 8.7% compared to a conventional cache without affecting the dynamic power and the latency. Overall, both of our architectures add no extra run-time penalty compared to the thermal-unaware power reduction schemes, yet they result in a significant reduction in the total energy consumption of a cache 相似文献
6.
Many architectures have been proposed for rank order and stack filtering. To achieve additional speedup in these structures requires the use of parallel processing techniques such as pipelining and block processing. Pipelining is well understood but few block architectures have been developed for rank order and stack filtering. Block processing is essential for additional speedup when the original architecture has reached the throughput limits caused by the underlying technology. A trivial block structure simply repeats a single input, single output structure to generate a multiple input, multiple output structure. Therefore the architecture can achieve speedups equal to the number of multiple outputs or the block size. However, unlike linear filters, the rank order and stack filter outputs are calculated using comparisons. It is possible to share these comparisons within the block structure and thus substantially reduce the size of the block structure. The authors introduce a systematic method for applying block processing to rank order filters and stack filters. This method takes advantage of shared comparisons within the block structure to generate a block filter with shared substructures whose complexity is reduced by up to one-third compared to the original filter structure times the block size. Furthermore, block processing is important for the generation of low power designs. A block structure can trade its increased speedup for a throughput equal to the original single output architecture but with a significantly lower power requirement. The power reduction in the trivial block structures is limited by the power supply voltage. They demonstrate how block structures with shared substructures allow them to continue decreasing the power consumption beyond the limit imposed by the supply voltage 相似文献
7.
The optimum architecture design and mapping of QRD-RLS adaptive filters can be achieved through filter architecture selections, look-ahead transformations, and hierarchical pipelining/folding transformations. In this paper, a relaxed annihilation-reordering look-ahead (RARL) architecture is proposed, and shown to be more power and area efficient than pipelined processing architecture which was considered the most area efficient. The filters with this architecture are based on relaxed weight-update through filtering approximation, where a filter tap weight is updated upon arrival of every block of input data, and are speeded up with annihilation-reordering look-ahead transformation. As a result of the computational complexity reduction, this architecture does not change the iteration bound and filter clock frequency, and leads to speed up with linear increase in power consumption, while the pipelined processing architectures result in speedup with quadratic increase in power consumption. Upon hardware mapping, this architecture is also more advantageous to achieve low area designs. Two design examples are presented to illustrate mapping optimization using above transformations. These results are important for mapping designs onto ASICs, FPGAs or parallel computing machines. The results show significant improvements in throughput, power consumption and hardware requirement. It is also interesting to show through mathematics and simulations that the RARL QRD-RLS filters have no performance degradation in terms of convergence rate. 相似文献
8.
Kubosawa H. Takahashi H. Ando S. Asada Y. Asato A. Suga A. Kimura M. Higaki N. Miyake H. Sato T. Anbutsu H. Tsuda T. Yoshimura T. Amano I. Kai M. Mitarai S. 《Solid-State Circuits, IEEE Journal of》1998,33(11):1640-1648
We have designed a microprocessor that is based on a single instruction multiple data stream (SIMD) architecture. It features a two-way superscalar architecture for multimedia embedded systems that need to support especially MPEG2 video decoding/encoding and 3DCG image processing. This microprocessor meets all requirements of embedded systems, including (a) MPEG2 (MP@ML) decoding and graphic processing capabilities for three-dimensional images, (b) programming flexibility, and (c) low power consumption and low manufacturing cost. High performance was achieved by enhanced parallel processing capabilities while adopting a SIMD architecture and a two-way superscalar architecture. Programming flexibility was increased by providing 170 dedicated multimedia instructions. Low power consumption was achieved by utilizing advanced process technology and power-saving circuits. The processor supports a general-purpose RISC instruction set. This feature is important, as the processor will have to work as a controller of various target systems. The processor has been fabricated by 0.21-μm CMOS four-metal technology on a 9.84×10.12 mm die. It performs 2.16 GOPS/720 MFLOPS at an operating frequency of 180 MHz, with a power consumption of 1.2 W and a power supply of 1.8 V 相似文献
9.
Junyong Deng Lin Jiang Yun Zhu Xiaoyan Xie Xinchuang Liu Feilong He Shuang Song L.K.John 《半导体学报》2020,(2):42-50
In order to accommodate the variety of algorithms with different performance in specific application and improve power efficiency,reconfigurable architecture has become an effective methodology in academia and industry.However,existing architectures suffer from performance bottleneck due to slow updating of contexts and inadequate flexibility.This paper presents an H-tree based reconfiguration mechanism(HRM)with Huffman-coding-like and mask addressing method in a homogeneous processing element(PE)array,which supports both programmable and data-driven modes.The proposed HRM can transfer reconfiguration instructions/contexts to a particular PE or associated PEs simultaneously in one clock cycle in unicast,multicast and broadcast mode,and shut down the unnecessary PE/PEs according to the current configuration.To verify the correctness and efficiency,we implement it in RTL synthesis and FPGA prototype.Compared to prior works,the experiment results show that the HRM has improved the work frequency by an average of 23.4%,increased the updating speed by 2×,and reduced the area by 36.9%;HRM can also power off the unnecessary PEs which reduced 51%of dynamic power dissipation in certain application configuration.Furthermore,in the data-driven mode,the system frequency can reach 214 MHz,which is 1.68×higher compared with the programmable mode. 相似文献
10.
11.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2009,17(9):1343-1347
12.
《Microelectronics Journal》2015,46(6):551-562
Most commercial Field Programmable Gate Arrays (FPGAs) have limitations in terms of density, speed, configuration overhead and power consumption mostly due to the use of SRAM cells in Look-Up Tables (LUTs), configuration memory and programmable interconnects. Also, hardwired Application Specific Integrated Circuit (ASIC) blocks designed for high performance arithmetic circuits in FPGA reduce the area available for reconfiguration. In this paper, we propose a novel generalized hybrid CMOS-memristor based architecture using stateful-NOR gates as basic building blocks for implementation of logic functions. These logic functions are implemented on memristor nanocrossbar layers, while the CMOS layer is used for selection and connection of memristors. The proposed pipelined architecture combines the features of ASIC, FPGA and microprocessor based designs. It has high density due to the use of nanocrossbar layer and high throughput especially for arithmetic circuits. The proposed architecture for three input one output logic block is compared with conventional LUT based Configurable Logic Block (CLB) having the same number of inputs and outputs; which shows 1.82×area saving, 1.57×speedup and 3.63×less power consumption. The automation algorithm to implement any logic function using proposed architecture is also presented. 相似文献
13.
Moshovos A. Sohi G.S. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2001,89(11):1560-1575
Semiconductor technology scaling provides faster and more plentiful transistors to build microprocessors, and applications continue to drive the demand for more powerful microprocessors. Weaving the "raw" semiconductor material into a microprocessor that offers the performance needed by modern and future applications is the role of computer architecture. This paper overviews some of the microarchitectural techniques that empower modem high-performance microprocessors. The techniques are classified into: 1) techniques meant to increase the concurrency in instruction processing, while maintaining the appearance of sequential processing and 2) techniques that exploit program behavior. The first category includes pipelining, superscalar execution, out-of-order execution, register renaming, and techniques to overlap memory-accessing instructions. The second category includes memory hierarchies, branch predictors, trace caches, and memory-dependence predictors. The paper also discusses microarchitectural techniques likely to be used in future microprocessors, including data value speculation and instruction reuse, microarchitectures with multiple sequencers and thread-level speculation, and microarchitectural techniques for tackling the problems of power consumption and reliability 相似文献
14.
In this paper, we describe a procedure for memory design and exploration for low power embedded systems. Our system consists of an instruction cache and a data cache on-chip, and a large memory off-chip. In the first step, we try to reduce the power consumption due to memory traffic by applying memory-optimizing transformations such as loop transformations. Next we use a memory exploration procedure to choose a cache configuration (cache size and line size) that satisfies the system requirements of area, number of cycles and energy consumption. We include energy in the performance metrics, since for different cache configurations, the variation in energy consumption is quite different from the variation in the number of cycles. The memory exploration procedure is very efficient since it exploits the trends in the cycles and energy characteristics to reduce the search space significantly. 相似文献
15.
《Solid-State Circuits, IEEE Journal of》2008,43(9):2025-2035
16.
In a typical embedded CPU, large on-chip storage is critical to meet high performance requirements. However, the fast increasing size of the on-chip storage based on traditional SRAM cells makes the area cost and energy consumption unsustainable for future embedded applications. Replacing SRAM with DRAM on the CPU’s chip is generally considered not worthwhile because DRAM is not compatible with the common CMOS logic and requires additional processing steps beyond what is required for CMOS. However a special DRAM technology, Gain-Cell embedded-DRAM (GC-eDRAM) [1], [2], [3] is logic compatible and retains some of the good properties of DRAM (small and low power). In this paper we evaluate the performance of a novel hybrid cache memory where the data array, generally populated with SRAM cells, is replaced with GC-eDRAM cells while the tag array continues to use SRAM cells. Our evaluation of this cache demonstrates that, compared to the conventional SRAM-based designs, our novel architecture exhibits comparable performance with less energy consumption and smaller silicon area, enabling the sustainable on-chip storage scaling for future embedded CPUs. 相似文献
17.
《Microelectronics Journal》2015,46(11):1020-1032
This paper describes a new memristor crossbar architecture that is proposed for use in a high density cache design. This design has less than 10% of the write energy consumption than a simple memristor crossbar. Also, it has up to 3 times the bit density of an STT-MRAM system and up to 11 times the bit density of an SRAM architecture. The proposed architecture is analyzed using a detailed SPICE analysis that accounts for the resistance of the wires in the memristor structure. Additionally, the memristor model used in this work has been matched to specific device characterization data to provide accurate results in terms of energy, area, and timing. The proposed memory system was analyzed by modeling two different devices that vary in resistance range and switching time. This system does not require that the memristor devices have inherent diode effects which limit alternate current paths. Therefore this system is capable of utilizing a much broader class of devices.An architectural analysis has also been completed that shows how the memory system may perform as a cache memory. A hybrid cache structure was used to alleviate the long write latencies of memristor devices. This approach consisted of the tag array being made of SRAM cells while the data array was made of the memristor circuit proposed. This hybrid scheme allows multiple reads and writes to concurrently access different sub-arrays within a cache. The performance of these novel memristor based caches was compared to SRAM and STT-MRAM based caches through detailed simulations. The results show that the memristor caches are denser and allow better performance along with lower system power when compared to the STT-MRAM and SRAM caches. 相似文献
18.
A new CMOS gate array architecture for digital signal processing (DSP) is presented. The basic cell structure takes into account the high degree of regularity of DSP datapaths. Therefore, it supports in particular the implementation of systolic arrays in connection with a pipelining scheme of one addition per half clock cycle. Together with a new gate array approach (macrocell design style), macrocells can be implemented efficiently on the new architecture. All DSP macrocells use dynamic transmission gate latches. Furthermore, the routing is done exclusively by cell abutment which results in short intercell routing. The macrocell design style is compared with the conventional gate array approach. In the common gate array approach, conventional gate array architectures are used together with conventional design equipment and layout strategies. The comparison shows a reduction in area and power consumption by a factor of 2.5 and 3.7, respectively. The efficiency increases by a factor of at least nine. These results were proved by analog circuit simulations and test chip measurements 相似文献
19.
20.
Nam Sung Kim Flautner K. Blaauw D. Mudge T. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2004,12(2):167-184
On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache. 相似文献