首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Most of the coarse-grained reconfigurable architectures (CGRAs) are composed of reconfigurable ALU arrays and configuration cache (or context memory) to achieve high performance and flexibility. Specially, configuration cache is the main component in CGRA that provides distinct feature for dynamic reconfiguration in every cycle. However, frequent memory-read operations for dynamic reconfiguration cause much power consumption. Thus, reducing power in configuration cache has become critical for CGRA to be more competitive and reliable for its use in embedded systems. In this paper, we propose dynamically compressible context architecture for power saving in configuration cache. This power-efficient design of context architecture works without degrading the performance and flexibility of CGRA. Experimental results show that the proposed approach saves up to 39.72% power in configuration cache with negligible area overhead (2.16%).   相似文献   

2.
Power consumption is an increasingly pressing problem in modern processor design. Since the on-chip caches usually consume a significant amount of power, it is one of the most attractive targets for power reduction. This paper presents a two-level filter scheme, which consists of the L1 and L2 filters, to reduce the power consumption of the on-chip cache. The main idea of the proposed scheme is motivated by the substantial unnecessary activities in conventional cache architecture. We use a single block buffer as the L1 filter to eliminate the unnecessary cache accesses. In the L2 filter, we then propose a new sentry-tag architecture to further filter out the unnecessary way activities in case of the L1 filter miss. We use SimpleScalar to simulate the SPEC2000 benchmarks and perform the HSPICE simulations to evaluate the proposed architecture. Experimental results show that the two-level filter scheme can effectively reduce the cache power consumption by eliminating most unnecessary cache activities, while the compromise of system performance is negligible. Compared to a conventional instruction cache (32 kB, two-way) implemented with only the L1 filter, the use of a two-level filter can result in roughly 30% reduction in total cache power consumption. Similarly, compared to a conventional data cache (32 kB, four-way) implemented with only the L1 filter, the total cache power reduction is approximately 46%.  相似文献   

3.
A standard fast programmable logic array (PLA) structure is discussed, with emphasis on its power consumption drawbacks. An exploration of alternatives to this structure leads to a presentation of the architecture and design of a low-power PLA structure used in a digital signal processing (DSP) environment. This PLA achieves low power consumption by a combination of pipelining, use of a NAND-OR configuration, and a simplified addressing scheme. Experimental results for the temperature range of 0 to 70°C indicate that the circuit works as expected in a range extending to at least 3.5 V  相似文献   

4.
Signal processing algorithms and architectures can use dynamic reconfiguration to exploit variations in signal statistics with the objectives of improved performance and reduced power. Parameters provide a simple and formal way to characterize incremental changes to a computation and its computing mechanism. This paper develops a framework for dynamic parameterization and applies it to MPEG-4 motion estimation. A novel motion estimation architecture facilitates the dynamic variation of parameters to achieve power-compression tradeoffs. Our work shows that parameter variation in motion estimation helps achieve power reduction by an order of magnitude, trading off higher compression for lower power. The magnitude of the tradeoffs depends on the input signal variation. The monitoring of input and output signal statistics and subsequent variation of parameters is accomplished by a hardware controller. To provide the controller with a model of the parameter space and corresponding measures in terms of power and performance, a configuration sample space graph is developed. This graph identifies the parameters which present the best power-performance tradeoffs. The controller samples the operating environment to select the appropriate parameters. This reduces the average power consumption by 40% for 2% loss in compression. Four other signal dependent computations: (1) 2D Discrete Cosine Transform, (2) Lempel-Ziv lossless compression, (3) 3D graphics light rendering, and (4) Viterbi decoding are briefly discussed to demonstrate the applicability of dynamic reconfiguration.  相似文献   

5.
Various architectural power reduction techniques have been proposed for on-chip caches in the last decade. In this paper, we first show that these power reduction techniques can be suboptimal when thermal effects are considered. Then, we propose a thermal-aware cache power-down technique that minimizes the power density of the active parts by turning off alternating rows of memory cells instead of entire banks. The decrease in the power density lowers the temperature, which then exponentially reduces the leakage. Thus, leakage power of the active parts is reduced in addition to the power eliminated from the parts that are turned off. Simulations based on SPEC2000, NetBench, and MediaBench applications in a 70-nm technology show that the proposed thermal-aware architecture can reduce the total energy consumption by 53% compared to a conventional cache, and 14% compared to a cache architecture with thermal-unaware power reduction scheme. Second, we show a block permutation scheme that can be used during the design of the caches to maximize the distance between blocks with consecutive addresses. Because of spatial locality, blocks with consecutive addresses are likely to be accessed within a short time interval. By maximizing the distance between such blocks, we minimize the power density of the hot spots in the cache, and hence reduce the peak temperature. This, in return, results in an average leakage power reduction of 8.7% compared to a conventional cache without affecting the dynamic power and the latency. Overall, both of our architectures add no extra run-time penalty compared to the thermal-unaware power reduction schemes, yet they result in a significant reduction in the total energy consumption of a cache  相似文献   

6.
Many architectures have been proposed for rank order and stack filtering. To achieve additional speedup in these structures requires the use of parallel processing techniques such as pipelining and block processing. Pipelining is well understood but few block architectures have been developed for rank order and stack filtering. Block processing is essential for additional speedup when the original architecture has reached the throughput limits caused by the underlying technology. A trivial block structure simply repeats a single input, single output structure to generate a multiple input, multiple output structure. Therefore the architecture can achieve speedups equal to the number of multiple outputs or the block size. However, unlike linear filters, the rank order and stack filter outputs are calculated using comparisons. It is possible to share these comparisons within the block structure and thus substantially reduce the size of the block structure. The authors introduce a systematic method for applying block processing to rank order filters and stack filters. This method takes advantage of shared comparisons within the block structure to generate a block filter with shared substructures whose complexity is reduced by up to one-third compared to the original filter structure times the block size. Furthermore, block processing is important for the generation of low power designs. A block structure can trade its increased speedup for a throughput equal to the original single output architecture but with a significantly lower power requirement. The power reduction in the trivial block structures is limited by the power supply voltage. They demonstrate how block structures with shared substructures allow them to continue decreasing the power consumption beyond the limit imposed by the supply voltage  相似文献   

7.
The optimum architecture design and mapping of QRD-RLS adaptive filters can be achieved through filter architecture selections, look-ahead transformations, and hierarchical pipelining/folding transformations. In this paper, a relaxed annihilation-reordering look-ahead (RARL) architecture is proposed, and shown to be more power and area efficient than pipelined processing architecture which was considered the most area efficient. The filters with this architecture are based on relaxed weight-update through filtering approximation, where a filter tap weight is updated upon arrival of every block of input data, and are speeded up with annihilation-reordering look-ahead transformation. As a result of the computational complexity reduction, this architecture does not change the iteration bound and filter clock frequency, and leads to speed up with linear increase in power consumption, while the pipelined processing architectures result in speedup with quadratic increase in power consumption. Upon hardware mapping, this architecture is also more advantageous to achieve low area designs. Two design examples are presented to illustrate mapping optimization using above transformations. These results are important for mapping designs onto ASICs, FPGAs or parallel computing machines. The results show significant improvements in throughput, power consumption and hardware requirement. It is also interesting to show through mathematics and simulations that the RARL QRD-RLS filters have no performance degradation in terms of convergence rate.  相似文献   

8.
We have designed a microprocessor that is based on a single instruction multiple data stream (SIMD) architecture. It features a two-way superscalar architecture for multimedia embedded systems that need to support especially MPEG2 video decoding/encoding and 3DCG image processing. This microprocessor meets all requirements of embedded systems, including (a) MPEG2 (MP@ML) decoding and graphic processing capabilities for three-dimensional images, (b) programming flexibility, and (c) low power consumption and low manufacturing cost. High performance was achieved by enhanced parallel processing capabilities while adopting a SIMD architecture and a two-way superscalar architecture. Programming flexibility was increased by providing 170 dedicated multimedia instructions. Low power consumption was achieved by utilizing advanced process technology and power-saving circuits. The processor supports a general-purpose RISC instruction set. This feature is important, as the processor will have to work as a controller of various target systems. The processor has been fabricated by 0.21-μm CMOS four-metal technology on a 9.84×10.12 mm die. It performs 2.16 GOPS/720 MFLOPS at an operating frequency of 180 MHz, with a power consumption of 1.2 W and a power supply of 1.8 V  相似文献   

9.
In order to accommodate the variety of algorithms with different performance in specific application and improve power efficiency,reconfigurable architecture has become an effective methodology in academia and industry.However,existing architectures suffer from performance bottleneck due to slow updating of contexts and inadequate flexibility.This paper presents an H-tree based reconfiguration mechanism(HRM)with Huffman-coding-like and mask addressing method in a homogeneous processing element(PE)array,which supports both programmable and data-driven modes.The proposed HRM can transfer reconfiguration instructions/contexts to a particular PE or associated PEs simultaneously in one clock cycle in unicast,multicast and broadcast mode,and shut down the unnecessary PE/PEs according to the current configuration.To verify the correctness and efficiency,we implement it in RTL synthesis and FPGA prototype.Compared to prior works,the experiment results show that the HRM has improved the work frequency by an average of 23.4%,increased the updating speed by 2×,and reduced the area by 36.9%;HRM can also power off the unnecessary PEs which reduced 51%of dynamic power dissipation in certain application configuration.Furthermore,in the data-driven mode,the system frequency can reach 214 MHz,which is 1.68×higher compared with the programmable mode.  相似文献   

10.
Cache能够提高DSP处理器对外部存储器的存取速度,提高DSP的性能,设计高性能低功耗的Cache,对于提高DSP芯片的整体性能有着十分重大的意义。描述了DSP芯片中一种高性能低功耗的数据Cache。这种Cache可以通过增加具备重装功能的Line Buffer来减少处理器对Cache的访问频率,从而降低Cache功耗。通过FFT、AC3、FIR三种基准程序测试表明,Line Buffer可以降低35%的Cache访问频率,明显降低了数据Cache功耗。  相似文献   

11.
With advances in process technology, soft errors are becoming an increasingly critical design concern. Owing to their large area, high density, and low operating voltages, caches are worst hit by soft errors. Based on the observation that in multimedia applications, not all data require the same amount of protection from soft errors, we propose a partially protected cache (PPC) architecture, in which there are two caches, one protected and the other unprotected at the same level of memory hierarchy. We demonstrate that as compared to the existing unprotected cache architectures, PPC architectures can provide 47 times reduction in failure rate, at only 1% runtime and 3% power overheads. In addition, the failure rate reduction obtained by PPCs is very sensitive to the PPC cache configuration. Therefore, this observation provides an opportunity for further improvement of the solution by correctly parameterizing the PPC configurations. Consequently, we develop design space exploration (DSE) strategies to discover the best PPC configuration. Our DSE technique can reduce the exploration time by more than six times as compared to an exhaustive approach.   相似文献   

12.
《Microelectronics Journal》2015,46(6):551-562
Most commercial Field Programmable Gate Arrays (FPGAs) have limitations in terms of density, speed, configuration overhead and power consumption mostly due to the use of SRAM cells in Look-Up Tables (LUTs), configuration memory and programmable interconnects. Also, hardwired Application Specific Integrated Circuit (ASIC) blocks designed for high performance arithmetic circuits in FPGA reduce the area available for reconfiguration. In this paper, we propose a novel generalized hybrid CMOS-memristor based architecture using stateful-NOR gates as basic building blocks for implementation of logic functions. These logic functions are implemented on memristor nanocrossbar layers, while the CMOS layer is used for selection and connection of memristors. The proposed pipelined architecture combines the features of ASIC, FPGA and microprocessor based designs. It has high density due to the use of nanocrossbar layer and high throughput especially for arithmetic circuits. The proposed architecture for three input one output logic block is compared with conventional LUT based Configurable Logic Block (CLB) having the same number of inputs and outputs; which shows 1.82×area saving, 1.57×speedup and 3.63×less power consumption. The automation algorithm to implement any logic function using proposed architecture is also presented.  相似文献   

13.
Semiconductor technology scaling provides faster and more plentiful transistors to build microprocessors, and applications continue to drive the demand for more powerful microprocessors. Weaving the "raw" semiconductor material into a microprocessor that offers the performance needed by modern and future applications is the role of computer architecture. This paper overviews some of the microarchitectural techniques that empower modem high-performance microprocessors. The techniques are classified into: 1) techniques meant to increase the concurrency in instruction processing, while maintaining the appearance of sequential processing and 2) techniques that exploit program behavior. The first category includes pipelining, superscalar execution, out-of-order execution, register renaming, and techniques to overlap memory-accessing instructions. The second category includes memory hierarchies, branch predictors, trace caches, and memory-dependence predictors. The paper also discusses microarchitectural techniques likely to be used in future microprocessors, including data value speculation and instruction reuse, microarchitectures with multiple sequencers and thread-level speculation, and microarchitectural techniques for tackling the problems of power consumption and reliability  相似文献   

14.
In this paper, we describe a procedure for memory design and exploration for low power embedded systems. Our system consists of an instruction cache and a data cache on-chip, and a large memory off-chip. In the first step, we try to reduce the power consumption due to memory traffic by applying memory-optimizing transformations such as loop transformations. Next we use a memory exploration procedure to choose a cache configuration (cache size and line size) that satisfies the system requirements of area, number of cycles and energy consumption. We include energy in the performance metrics, since for different cache configurations, the variation in energy consumption is quite different from the variation in the number of cycles. The memory exploration procedure is very efficient since it exploits the trends in the cycles and energy characteristics to reduce the search space significantly.  相似文献   

15.
For the demands of mobile multimedia applications, a stream processor core is designed with 8.91 ${hbox {mm}}^{2}$ area in 0.18$ mu{hbox {m}}$ CMOS technology at 50 MHz. Several techniques and architectures are proposed to achieve high performance with low power consumption. First of all, an optimized core pipeline is designed with 2-issue VLIW architecture to achieve the processing capability of 400 MFLOPS or 800 MOPS. In addition, adaptive multi-thread scheme can increase the performance by increasing hardware utilization, and the proposed configurable memory array architecture can reduce off-chip memory accessing frequency by caching both input data and output results. Furthermore, for graphics applications, a geometry-content-aware technique called early-rejection-after-transformation is proposed to remove redundant operations for invisible triangles. As for video applications, the proposed video accelerating instruction set can support motion estimation for video coding. Experimental results show that 86% power reduction and more than ten times speedup of the VLIW architecture can be achieved with the proposed techniques to provide the processing speed of 25 Mvertices/s and power consumption of 8.6 mW. Moreover, CIF (352 $times$ 288) 30 fps video encoding with the search range of {H[ $-$24,24), V[ $-$16,16]} is also supported by the proposed stream processor. By supporting both video and graphics functions, this highly efficient, high performance, and low power processor core is applicable to multimedia mobile devices.   相似文献   

16.
In a typical embedded CPU, large on-chip storage is critical to meet high performance requirements. However, the fast increasing size of the on-chip storage based on traditional SRAM cells makes the area cost and energy consumption unsustainable for future embedded applications. Replacing SRAM with DRAM on the CPU’s chip is generally considered not worthwhile because DRAM is not compatible with the common CMOS logic and requires additional processing steps beyond what is required for CMOS. However a special DRAM technology, Gain-Cell embedded-DRAM (GC-eDRAM)  [1], [2], [3] is logic compatible and retains some of the good properties of DRAM (small and low power). In this paper we evaluate the performance of a novel hybrid cache memory where the data array, generally populated with SRAM cells, is replaced with GC-eDRAM cells while the tag array continues to use SRAM cells. Our evaluation of this cache demonstrates that, compared to the conventional SRAM-based designs, our novel architecture exhibits comparable performance with less energy consumption and smaller silicon area, enabling the sustainable on-chip storage scaling for future embedded CPUs.  相似文献   

17.
《Microelectronics Journal》2015,46(11):1020-1032
This paper describes a new memristor crossbar architecture that is proposed for use in a high density cache design. This design has less than 10% of the write energy consumption than a simple memristor crossbar. Also, it has up to 3 times the bit density of an STT-MRAM system and up to 11 times the bit density of an SRAM architecture. The proposed architecture is analyzed using a detailed SPICE analysis that accounts for the resistance of the wires in the memristor structure. Additionally, the memristor model used in this work has been matched to specific device characterization data to provide accurate results in terms of energy, area, and timing. The proposed memory system was analyzed by modeling two different devices that vary in resistance range and switching time. This system does not require that the memristor devices have inherent diode effects which limit alternate current paths. Therefore this system is capable of utilizing a much broader class of devices.An architectural analysis has also been completed that shows how the memory system may perform as a cache memory. A hybrid cache structure was used to alleviate the long write latencies of memristor devices. This approach consisted of the tag array being made of SRAM cells while the data array was made of the memristor circuit proposed. This hybrid scheme allows multiple reads and writes to concurrently access different sub-arrays within a cache. The performance of these novel memristor based caches was compared to SRAM and STT-MRAM based caches through detailed simulations. The results show that the memristor caches are denser and allow better performance along with lower system power when compared to the STT-MRAM and SRAM caches.  相似文献   

18.
A new CMOS gate array architecture for digital signal processing (DSP) is presented. The basic cell structure takes into account the high degree of regularity of DSP datapaths. Therefore, it supports in particular the implementation of systolic arrays in connection with a pipelining scheme of one addition per half clock cycle. Together with a new gate array approach (macrocell design style), macrocells can be implemented efficiently on the new architecture. All DSP macrocells use dynamic transmission gate latches. Furthermore, the routing is done exclusively by cell abutment which results in short intercell routing. The macrocell design style is compared with the conventional gate array approach. In the common gate array approach, conventional gate array architectures are used together with conventional design equipment and layout strategies. The comparison shows a reduction in area and power consumption by a factor of 2.5 and 3.7, respectively. The efficiency increases by a factor of at least nine. These results were proved by analog circuit simulations and test chip measurements  相似文献   

19.
提出了一种动态可重构高速缓存结构,提升了系统性能;同时,大大降低了功耗。该结构在传统高速缓存上作少量的硬件改动,实现了高速缓存容量、块大小和关联度的动态可配置性。实验结果表明,相对于传统结构,动态可重构高速缓冲存储器在不损失性能的前提下,取得了很好的降低系统功耗的效果。  相似文献   

20.
Circuit and microarchitectural techniques for reducing cache leakage power   总被引:2,自引:0,他引:2  
On-chip caches represent a sizable fraction of the total power consumption of microprocessors. As feature sizes shrink, the dominant component of this power consumption will be leakage. However, during a fixed period of time, the activity in a data cache is only centered on a small subset of the lines. This behavior can be exploited to cut the leakage power of large data caches by putting the cold cache lines into a state preserving, low-power drowsey mode. In this paper, we investigate policies and circuit techniques for implementing drowsy data caches. We show that with simple microarchitectural techniques, about 80%-90% of the data cache lines can be maintained in a drowsy state without affecting performance by more than 0.6%, even though moving lines into and out of a drowsy state incurs a slight performance loss. According to our projections, in a 70-nm complementary metal-oxide-semiconductor process, drowsy data caches will be able to reduce the total leakage energy consumed in the caches by 60%-75%. In addition, we extend the drowsy cache concept to reduce leakage power of instruction caches without significant impact on execution time. Our results show that data and instruction caches require different control strategies for efficient execution. In order to enable drowsy instruction caches, we propose a technique called cache subbank prediction, which is used to selectively wake up only the necessary parts of the instruction cache, while allowing most of the cache to stay in a low-leakage drowsy mode. This prediction technique reduces the negative performance impact by 78% compared with the no-prediction policy. Our technique works well even with small predictor sizes and enables a 75% reduction of leakage energy in a 32-kB instruction cache.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号