首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 93 毫秒
1.
针对嵌入式处理器中数据Cache功耗显著的特点,提出了一种基于Load重用的低功耗数据Cache设计方法.通过保存Load指令从数据Cache中取回的数据,实现了随后Load指令对该数据的重新使用,从而减少了数据Cache的访问次数,有效降低了数据Cache的功耗.在SuperV_EF01DSP上的实验结果显示,采用该方法后,在处理器性能没有损失的情况下,数据Cache功耗平均降低29.48%,面积仅增加0.64%.  相似文献   

2.
Cache能够提高DSP处理器对外部存储器的存取速度,提高DSP的性能,设计高性能低功耗的Cache,对于提高DSP芯片的整体性能有着十分重大的意义。描述了DSP芯片中一种高性能低功耗的数据Cache。这种Cache可以通过增加具备重装功能的Line Buffer来减少处理器对Cache的访问频率,从而降低Cache功耗。通过FFT、AC3、FIR三种基准程序测试表明,Line Buffer可以降低35%的Cache访问频率,明显降低了数据Cache功耗。  相似文献   

3.
作为计算量最多的模块之一,运动补偿占用了解码器与片外数据存储器之间约70%的带宽,是实现超高清视频解码的瓶颈。通过所设计的基于Cache的HEVC运动补偿模块,在保证实时解码数据吞吐量的同时,有效减少了80%的带宽消耗。首先,利用由可复用滤波器构成的插值计算模块和2D Cache设计了可并行化流水线数据处理的运动补偿模块,满足计算过程中高数据吞吐量需求。其次,设计高效内部存储器RAM结构,并提出片内Cache功耗降低的有效解决方案。最后,利用了参考帧数据相关性,设计插值顺序重排,将Cache的硬件开销减少了87.5%。基于HM9.0的HEVC标准测试视频序列实验结构表明,该设计显著地减少了带宽消耗和硬件开销。  相似文献   

4.
李茂松 《微电子学》2005,35(2):203-205,209
描述了计算机集成制造(CIM)系统的体系结构及主要功能,提出了一种改进现有数据库管理系统的新型数据仓库设计技术。通过把在线事务处理数据库(OLTP)选择性地增量备份到一个临时的缓存(Cache)数据库,把Cache数据库中的数据增量提取到操作数据存储(ODS)数据仓库中,客户端通过水晶报表服务器或微软OLAP服务器,可直接访问ODS数据库中面向主题集成的各种事实型数据。通过Cache数据库,避免了OLTP数据库与ODS数据库的直接链接,减小了ODS数据仓库数据组合与计算对OLTP数据库的性能影响,最大限度地提升了OLTP数据库的效率,满足了生产线对OLTP数据库的高性能要求。  相似文献   

5.
非一致Cache体系结构(NUCA)几乎已经成为未来片上大容量Cache的发展方向。本文指出同构单芯片多处理器的设计主要有多级Cache设计的数据一致性问题,核间通信问题与外部总线效率问题,我们也说明多处理器设计上的相应解决办法。最后给出单核与双核在性能、功耗的比较,以及双核处理器的布局规划图。利用双核处理器,二级Cache控制器与AXI总线控制器等IP提出一个可供设计AXI总线SoC的非一致Cache体系结构平台。  相似文献   

6.
嵌入式CPU设计中Cache性能的全局优化   总被引:2,自引:2,他引:0  
论文针对嵌入式CPU设计方法的特点,提出了两个层次的Cache全局性能优化方法.一个是应用程序层次,即基于编译技术的以循环和数据变换理论为基础的优化数据位置的全局优化方法;另一个是系统层次,即优化Cache索引的全局优化方法.这些方法对嵌入式CPU的设计具有重要的指导作用,能有效地提高嵌入式系统的整体性能.  相似文献   

7.
ADI公司的DSP Blackfin是嵌入式多媒体终端理想的核心处理器,其性能与Cache和DMA的使用方式紧密联系。AD6532芯片是ADI公司推出的最新的一款双核(包含Blackfin核和ARM核)基带处理器,可用于GSM和TD—SCDMA的移动终端设备。本文阐述了AD6532的内存空间分配及其别名技术,并提出了基于该技术的数据操作方法,使得数据Cache和DMA能够同时使用同一块内存资源。实验表明该方法比传统的数据Cache失效方法性能优越。  相似文献   

8.
在多核环境下,对共享L2 Cache的优化显得尤为重要,因为当被访问的数据块不在L2 Cache中时(发生L2缺失),CPU需要花费几百个周期访问主存的代价是相当大的.在设计Cache时,替换算法是考虑的一个重要因素,替换算法的好坏直接影响Cache的性能和计算机的整体性能.虽然LRU替换算法已经被广泛应用在片上Cache中,但是也存在着一些不足:当Cache容量小于程序工作集时,容易产生冲突缺失;且LRU替换算法不考虑数据块被访问的频率.文中把冒泡替换算法应用到多核共享Cache中,同时考虑数据块被访问的频率和最近访问的信息.通过分析实验数据,与LRU替换算法相比,采用冒泡替换算法可以使MPKI(Misses per Kilo instructions)和L2 Cache命中率均有所改善.  相似文献   

9.
多处理机系统中数据Cache的一种优化设计   总被引:4,自引:0,他引:4  
目前Cache仍是高性能处理器解决CPU和存储器速度差异问题的有效措施之一。本文简要介绍了一种支持多机系统的32位RISC微处理器“龙腾”R2存储单元的体系结构,着重讨论了数据Cache的优化设计。包括为保证支持存储一致性的MEI协议的实现。仿真综合证明。该设计满足处理器的要求。  相似文献   

10.
涂卫平 《电声技术》2011,35(11):54-59
针对DSP上低码率语音编码器的实现和优化问题,研究了片上Cache的分配策略.根据指令Cache的大小,以及程序处理的数据量的大小,将程序分成大小合理的段,分阶段载入Cache中.对数据Cache的分配考虑了Cache结构和数据本身的特点,使有限的数据Cache得到充分的利用.全面考察数据的生命期,使已经载入数据Cac...  相似文献   

11.
In this paper, we present the characterization and design of energy-efficient, on chip cache memories. The characterization of power dissipation in on-chip cache memories reveals that the memory peripheral interface circuits and bit array dissipate comparable power. To optimize performance and power in a processor's cache, a multidivided module (MDM) cache architecture is proposed to conserve energy in the bit array as well as the memory peripheral circuits. Compared to a conventional, nondivided, 16-kB cache, the latency and power of the MDM cache are reduced by a factor of 1.9 and 4.6, respectively. Based on the MDM cache architecture, the energy efficiency of the complete memory hierarchy is analyzed with respect to cache parameters in a multilevel processor cache design. This analysis was conducted by executing the SPECint92 benchmark programs with the miss ratios for reduced instruction set computer (RISC) and complex instruction set computer (CISC) machines  相似文献   

12.
Traditional Java code generation and instruction fetch path is not efficient, as Java binary code is typically written into the data cache first, and then is loaded into the instruction cache through the shared L2 cache or memory, which takes both time and energy. In this paper, we study three hardware-based code caching strategies, which attempt to write and read the dynamically generated Java code faster and more energy-efficiently. Our experimental results indicate that with proper architectural support, writing code directly into the instruction cache can improve the performance for a variety of Java applications by 9.6% on average, with up to 42.9%. Also, the average energy dissipation of these Java programs can be reduced by 6% with efficient code caching.  相似文献   

13.
Deep-submicron CMOS designs maintain high transistor switching speeds by scaling down the supply voltage and proportionately reducing the transistor threshold voltage. Lowering the threshold voltage increases leakage energy dissipation due to subthreshold leakage current even when the transistor is not switching. Estimates suggest a five-fold increase in leakage energy in every future generation. In modern microarchitectures, much of the leakage energy is dissipated in large on-chip cache memory structures with high transistor densities. While cache utilization varies both within and across applications, modern cache designs are fixed in size resulting in transistor leakage inefficiencies. This paper explores an integrated architectural and circuit-level approach to reducing leakage energy in instruction caches (i-caches). At the architecture level, we propose the Dynamically ResIzable i-cache (DRI i cache), a novel i-cache design that dynamically resizes and adapts to an application's required size. At the circuit-level, we use gated-Vdd, a novel mechanism that effectively turns off the supply voltage to, and eliminates leakage in, the SRAM cells in a DRI i-cache's unused sections. Architectural and circuit-level simulation results indicate that a DRI i-cache successfully and robustly exploits the cache size variability both within and across applications. Compared to a conventional i-cache using an aggressively-scaled threshold voltage a 64 K DRI i-cache reduces on average both the leakage energy-delay product and cache size by 62%, with less than 4% impact on execution time. Our results also indicate that a wide NMOS dual-Vt gated-Vdd transistor with a charge pump offers the best gating implementation and virtually eliminates leakage energy with minimal increase in an SRAM cell read time area as compared to an i-cache with an aggressively-scaled threshold voltage  相似文献   

14.
The memory hierarchy of high-performance and embedded processors has been shown to be one of the major energy consumers. For example, the Level-1 (L1) instruction cache (I-Cache) of the StrongARM processor accounts for 27% of the power dissipation of the whole chip, whereas the instruction fetch unit (IFU) and the I-Cache of Intel's Pentium Pro processor are the single most important power consuming modules with 14% of the total power dissipation [2]. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip. In this paper, we propose a technique that uses an additional mini cache, the LO-Cache, located between the I-Cache and the CPU core. This mechanism can provide the instruction stream to the data path and, when managed properly, it can effectively eliminate the need for high utilization of the more expensive I-Cache. We propose, implement, and evaluate five techniques for dynamic analysis of the program instruction access behavior, which is then used to proactively guide the access of the LO-Cache. The basic idea is that only the most frequently executed portions of the code should be stored in the LO-Cache since this is where the program spends most of its time. We present experimental results to evaluate the effectiveness of our scheme in terms of performance and energy dissipation for a series of SPEC95 benchmarks. We also discuss the performance and energy tradeoffs that are involved in these dynamic schemes. Results for these benchmarks indicate that more than 60% of the dissipated energy in the I-Cache subsystem can be saved  相似文献   

15.
在一款采用改进HARVARD总线结构的通用DSP中,通过设置一个小型指令CACHE来缓解流水线上的资源冲突。它采用两路组相连结构,仅在流水线上发生资源冲突时才会被访问。出于减小CACHE的面积和功耗考虑,该CACHE采用了单地址端口的设计,也就意味着在同一时钟周期内,CACHE只能完成一次读或写的操作。当读写请求同时发生的时候,必须采用一定的优先策略。本文结合DSP的结构特点,对一些优先策略进行了分析.最后对比了各种策略所付出的代价以及在一些benchmark下的性能.从结果可以看出,通过采取某些策略.诙单端口指令CACHE可以获得与双端口CACHE几乎相同的命中率.  相似文献   

16.
In this paper, we describe a procedure for memory design and exploration for low power embedded systems. Our system consists of an instruction cache and a data cache on-chip, and a large memory off-chip. In the first step, we try to reduce the power consumption due to memory traffic by applying memory-optimizing transformations such as loop transformations. Next we use a memory exploration procedure to choose a cache configuration (cache size and line size) that satisfies the system requirements of area, number of cycles and energy consumption. We include energy in the performance metrics, since for different cache configurations, the variation in energy consumption is quite different from the variation in the number of cycles. The memory exploration procedure is very efficient since it exploits the trends in the cycles and energy characteristics to reduce the search space significantly.  相似文献   

17.
This paper presents a new data cache design, cache-processor coupling, which tightly binds an on-chip data cache with a microprocessor. Parallel architectures and high-speed circuit techniques are developed for speeding address handling process associated with accessing the data cache. The address handling time has been reduced by 51% by these architectures and circuit techniques. On the other hand, newly proposed instructions increase data cache bandwidth by eight times. Excessive power consumption due to the wide-bandwidth data transfer is carefully avoided by newly developed circuit techniques, which reduce dissipation power per bit to 1/26. Simulation study of the proposed architecture and circuit techniques yields a 1.8 ns delay each for address handling, cache access, and register access for a 16 kilobyte direct mapped cache with a 0.4 μm CMOS design rule  相似文献   

18.
曹向荣  张晓林 《电子学报》2014,42(5):982-986
本文提出一种兼顾性能与功耗的cache最优参数检索算法.通过运行时反馈的cache评价指数,预测校正cache参数检索空间与检索顺序,在保证检索效率的同时,提高结果的准确率.该算法可以减少穷举法近80%的迭代次数;同时以损失部分效率为代价,提高降维检索法13.4%的全参数准确率以及40%的容量参数准确率.  相似文献   

19.
A 1.5-ns-access 500-MHz synonym hit RAM has been developed using 0.25-μm CMOS technology, which is the macro-cell to be used in microprocessor chips. We have proposed a virtual cache system with a synonym hit RAM, which achieves both high speed and large capacity because it solves the synonym problem that occurs with large-capacity cache systems. In this system, the RAM macro needs 576-bit parallel comparison and parity check functions. The configuration used achieves testability and low-power dissipation of large 576-bit data output. Moreover, the dynamic-NOR with a dynamic-inverter and sense-amplifier activation pulse generator are essential for reducing the comparison delay  相似文献   

20.
A 64-kbyte snoopy cache memory was developed. The modified double word-line architecture with word-line buffers resulted in a large-size memory and a time-multiplex snoop operation by the pseudo-two-port method with a single-port cell. The flexible expandability was achieved by cascading multiple cache memories. The device was successfully implemented with 1.0-μm double-polysilicon and double-metal CMOS technology. Low-power sense amplifiers and comparators limited power dissipation to 0.5 W at 40 MHz  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号