首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A low-power I-cache architecture is proposed that is appropriate for embedded low-power processors. Unlike existing schemes, the proposed organisation places an extra small cache in parallel alongside the L1 cache. Since it allows simultaneous accesses to both caches, the proposed scheme introduces little performance degradation. Using simple hardware logic (for sequential accesses) and a compiler transformation (for loop accesses), most L1 cache requests are served by a small cache, so that the amount of energy consumed by the L1 cache is significantly reduced. Experimental results show that for the SPEC95 benchmarks, the proposed organisation reduces the energy-delay product on average by 67.2% over a conventional cache design and 16.8% over the filter cache design  相似文献   

2.
For the processor working in the radiation environment in space, it tends to suffer from the single event effect on circuits and system failures, due to cosmic rays and high energy particle radiation. Therefore, the reliability of the processor has become an increasingly serious issue. The BCH-based error correction code can correct multi-bit errors, but it introduces large latency overhead. This paper proposes a hybrid error correction approach that combines BCH and EDAC to correct both multi-bit and single-bit errors for caches with low cost. The proposed technique can correct up to four-bit error, and correct single-bit error in one cycle. Evaluation results show that, the proposed hybrid error-correction scheme can improve the performance of cache accesses up to 20% compared to the pure BCH scheme.  相似文献   

3.
A 32-kB cache macro with an experimental reduced instruction set computer (RISC) is realized. A pipelined cache access to realize a cycle time shorter than the cache access time is proposed. A double-word-line architecture combines single-port cells, dual-port cells, and CAM cells into a memory array to improve silicon area efficiency. The cache macro exhibits 9-ns typical clock-to-HIT delay as a result of several circuit techniques, such as a section word-line selector, a dual transfer gate, and 1.0-μm CMOS technology. It supports multitask operation with logical addressing by a selective clear circuit. The RISC includes a double-word load/store instruction using a 64-b bus to fully utilize the on-chip cache macro. A test scheme allows measurement of the internal signal delay. The test device design is based on the unified design rules scalable through multigenerations of process technologies down to 0.8 μm  相似文献   

4.
A multilevel interconnect architecture design methodology that optimizes the interconnect cross-sectional dimensions of each metal layer is introduced that reduces logic macrocell area, cycle time, power consumption or number of metal layers. The predictive capability of this methodology, which is based on a stochastic wiring distribution, provides insight into defining the process technology parameters for current and future generations of microprocessors and application-specific integrated circuits (ASICs). Using this methodology on an ASIC logic macrocell case study for the 100 nm technology generation, the optimized n-tier multilevel interconnect architecture reduces macrocell area by 32%, cycle time by 16% or number of wiring tracks required on the topmost tier by 62% compared to a conventional design where pitches are doubled for every successive pair of levels. A new repeater insertion methodology is also described that further enhances gigascale integration (GSI) system performance. By using repeaters, a further reduction of 70% in macrocell area, 18% in cycle time, 25% in number of metal levels or 44% in power dissipation is achieved, when compared to an n-tier design without repeaters. The key distinguishing feature of the methodology is its comprehensive framework that simultaneously solves two distinct problems-optimal wire sizing and wiring layer assignment-using independent constraints on maximum repeater area for efficient design space exploration to optimize the area, power, frequency, and metal levels of a GSI logic megacell  相似文献   

5.
The memory hierarchy of high-performance and embedded processors has been shown to be one of the major energy consumers. For example, the Level-1 (L1) instruction cache (I-Cache) of the StrongARM processor accounts for 27% of the power dissipation of the whole chip, whereas the instruction fetch unit (IFU) and the I-Cache of Intel's Pentium Pro processor are the single most important power consuming modules with 14% of the total power dissipation [2]. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the I-Cache and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. We propose, implement, and evaluate five techniques for dynamic analysis of the program instruction access behavior, which is then used to proactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes. Results for these benchmarks indicate that more than 60% of the dissipated energy in the I-Cache subsystem can be saved  相似文献   

6.
Recently, the level of realism in PC graphics applications has been approaching that of high-end graphics workstations, necessitating a more sophisticated texture data cache memory to overcome the finite bandwidth of the AGP or PCI bus. This paper proposes a multilevel parallel texture cache memory to reduce the required data bandwidth on the AGP or PCI bus and to accelerate the operations of parallel graphics pipelines in PC graphics cards. The proposed cache memory is fabricated by 0.16-μm DRAM-based SOC technology. It is composed of four components: an 8-MB DRAM L2 cache, 8-way parallel SRAM L1 caches, pipelined texture data filters, and a serial-to-parallel loader. For high-speed parallel L1 cache data replacement, the internal bus bandwidth has been maximized up to 75 GB/s with a newly proposed hidden double data transfer scheme. In addition, the cache memory has a reconfigurable architecture in its line size for optimal caching performance in various graphics applications from three-dimensional (3-D) games to high-quality 3-D movies  相似文献   

7.
Face recognition systems based on Convolutional Neural Networks (CNNs) or convolutional architectures currently represent the state of the art, achieving an accuracy comparable to that of humans. Nonetheless, there are two issues that might hinder their adoption on distributed battery-operated devices (e.g., visual sensor nodes, smartphones, and wearable devices). First, convolutional architectures are usually computationally demanding, especially when the depth of the network is increased to maximize accuracy. Second, transmitting the output features produced by a CNN might require a bitrate higher than the one needed for coding the input image. Therefore, in this paper we address the problem of optimizing the energy-rate-accuracy characteristics of a convolutional architecture for face recognition. We carefully profile a CNN implementation on a Raspberry Pi device and optimize the structure of the neural network, achieving a 17-fold speedup without significantly affecting recognition accuracy. Moreover, we propose a coding architecture custom-tailored to features extracted by such model.  相似文献   

8.
9.
On-chip L1 and L2 caches represent a sizeable fraction of the total power consumption of microprocessors. In nanometer-scale technology, the subthreshold leakage power is becoming one of the dominant total power consumption components of those caches. In this study, we present optimization techniques to reduce the subthreshold leakage power of on-chip caches assuming that there are multiple threshold voltages, V/sub T/'s, available. First, we show a cache leakage optimization technique that examines the tradeoff between access time and subthreshold leakage power by assigning distinct V/sub T/'s to each of the four main cache components-address bus drivers, data bus drivers, decoders, and static random access memory (SRAM) cell arrays with sense amplifiers. Second, we show optimization techniques to reduce the leakage power of L1 and L2 on-chip caches without affecting the average memory access time. The key results are: 1) two additional high V/sub T/'s are enough to minimize leakage in a single cache-3 V/sub T/'s if we include a nominal low V/sub T/ for microprocessor core logic; 2) if L1 size is fixed, increasing L2 size can result in much lower leakage without reducing average memory access time; 3) if L2 size is fixed, reducing L1 size may result in lower leakage without loss of the average memory access time for the SPEC2K benchmarks; and 4) smaller L1 and larger L2 caches than are typical in today's processors result in significant leakage and dynamic power reduction without affecting the average memory access time.  相似文献   

10.
A heuristic for finding common subexpressions of given Boolean functions based on Shannon-type factoring is proposed. This heuristic limits the search space considerably by applying a top-down approach in which synthesis of a Boolean network flows from the primary outputs to the primary inputs. The common subexpressions and their complements in N variables are extracted before common subexpressions and their complements in (N-1) variables. This decomposition of the network depends on a permutation of Boolean variables and has a polynomial complexity for restricted extraction of complements. A multilevel logic optimization system, MULTI, has been implemented using this heuristic. Good results on several benchmark circuits show its effectiveness  相似文献   

11.
12.
In this letter, a genetic algorithm (GA) optimization technique is applied to determine the switching angles for a cascaded multilevel inverter which eliminates specified higher order harmonics while maintaining the required fundamental voltage. This technique can be applied to multilevel inverters with any number of levels. As an example, in this paper a seven-level inverter is considered, and the optimum switching angles are calculated offline to eliminate the fifth and seventh harmonics. These angles are then used in an experimental setup to validate the results.  相似文献   

13.
功耗对于面向低成本低功耗应用的微控制器(单片机)十分重要.研究表明,CPU由于取指对程序存储器的访问功耗,构成了微控制器整体功耗的重要组成部分,而微控制器应用程序的大部分执行时间被用于执行固定的循环代码.研究了集成循环代码cache,从中执行循环代码来降低存储器访问功耗的技术.  相似文献   

14.
We consider a task-operator-machine assignment problem where we seek to minimize the total execution time, to come as close as possible to a perfect load balance among the operators and not to exceed neither predefined inter-operator communication costs nor a prefixed number of resources. Besides, in an industrial environment where work force frequently changes, manufacturing systems need to be flexible and critical decisions have to be quickly taken. In this context, a fuzzy genetic multiobjective optimization algorithm is developed to solve a multilevel generalized assignment problem usually encountered in the clothing industry.  相似文献   

15.
This paper presents a novel control method to lower the harmonic components of the conventional voltage balancing control for the multilevel converters. The harmonic components in the voltage balancing control theory are analyzed in detail. A revised control method based on the line-line voltage redundancy of the three-phase inverter is proposed to lower the harmonic components of both the converter and inverter side. The proposed method can significantly reduce the harmonic components while keeping the switching frequency nearly the same. Simulation and experimental results are provided to show the validity of the theory.  相似文献   

16.
17.
The recent trends in portable computing technologies have established the need for energy efficient design strategies. To achieve minimum energy design goals, system designers need a technique to accurately model the energy consumption of their design alternatives without performing a full physical design and full-circuit simulation. This paper presents and compares five approaches for modeling the energy consumption of CMOS circuits. These five modeling approaches have been chosen to represent the various levels of model complexity and accuracy found in the current literature. These modeling approaches are applied to the energy consumption of SRAM's to provide examples of their use and to allow for the comparison of their modeling qualities. It was found that a mixed characterization model-using a CV2 prediction for digital subsections and fitted simulation results for the analog subsections-is satisfactory (within ±1 process variation) for predicting the absolute energy consumed per cycle. This same model is also very good (within 2%) for predicting an optimum organization for the internal structures of the SRAM. Several common architectures and circuit designs for SRAM's are analyzed with these models. This analysis shows that global, rather than local improvements, produce the largest energy savings  相似文献   

18.
Low-power design for embedded processors   总被引:1,自引:0,他引:1  
Minimization of power consumption in portable and battery powered embedded systems has become an important aspect of processor and system design. Opportunities for power optimization and tradeoffs emphasizing low power are available across the entire design hierarchy. A review of low-power techniques applied at many levels of the design hierarchy is presented, and an example of low-power processor architecture is described along with some of the design decisions made in implementation of the architecture  相似文献   

19.
20.
Agere(Lucent)Agere公司的芯片组,包括快 速图形处理器(FPP),路由交换处理器(RSP)和 Agere系统接口(ASI)。它是一种平台处理器的解决方案,可以处理多项第二层的协议,处理速度可达OC—48的水平。Agere处理器的结构并不是依据RISC的构造,而是完全为分组处理应用重新设计的。Agere公司宣称,和以RISC为基础的解决方案相比,Agere的解决方案只需要提高处理器的时钟速度,就可以很容易地将处理速度提高到OC—192的水平。 Lucent收购Agere以后,在平台N…  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号