首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 750 毫秒
1.
本文提出了一种无Cache情况下基于嵌套循环指令分析的片上存储器(On-chip memory)分配策略.该策略分析程序中循环指令,控制划分粒度将所有函数切割成块,然后使用背包算法和优先级算法组合的分配算法,选择合适块放入片上存储器,达到优化程序性能的目的.实验结果表明,该策略能够显著提高程序性能,平均提高一倍,甚至更高,同时它能够预知优化后程序执行时间的变化,最高误差为2%.  相似文献   

2.
在JPEG2000专用集成电路设计中,DWT和Tierl编码之间的接口存储器设计直接影响DWT变换的系数存储和LL子带数据的再读取,及为给Tierl的位平面编码器提供码块数据。本文使用了二块片内DPRAM实现上述存储,提出了一种简单而高效的读写策略,并实现了该方法的FPGA仿真,性能分析表明DWT和位平面编码器并行性接近90%。  相似文献   

3.
针对合成孔径雷达(SAR)成像回波数据量巨大和存储器资源利用率偏低的问题,提出一种双倍数据速率动态随机存储器(DDR)的存储装置,并设计一种原位转置的存储方法。该方法首先比较雷达回波数据距离向和方位向的长度,通过预留出距离向和方位向中较长数据的单行或者单列所占用的存储空间来提高存储器读地址和写地址逻辑映射的灵活性,有效地实现了大数据量的片内转置操作,提高了双倍数据速率动态随机存储器(DDR)的资源利用率,并将分块子矩阵地址映射方法和跨页地址映射方法应用于原位转置中,有效地提高了SAR回波数据转置的访问效率。仿真实验结果表明,该数据存储系统在满足成像实时性的同时降低了一半的DDR存储器的用量,提高了DDR的资源利用率,降低了成本,目前已成功应用于多种模式的SAR成像处理中。  相似文献   

4.
涂卫平 《电声技术》2011,35(11):54-59
针对DSP上低码率语音编码器的实现和优化问题,研究了片上Cache的分配策略.根据指令Cache的大小,以及程序处理的数据量的大小,将程序分成大小合理的段,分阶段载入Cache中.对数据Cache的分配考虑了Cache结构和数据本身的特点,使有限的数据Cache得到充分的利用.全面考察数据的生命期,使已经载入数据Cac...  相似文献   

5.
图卷积神经网络(GCN)在社交网络、电子商务、分子结构推理等任务中的表现远超传统人工智能算法,在近年来获得广泛关注。与卷积神经网络(CNN)数据独立分布不同,图卷积神经网络更加关注数据之间特征关系的提取,通过邻接矩阵表示数据关系,因此其输入数据和操作数相比卷积神经网络而言都更加稀疏且存在大量数据传输,所以实现高效的GCN加速器是一个挑战。忆阻器(ReRAM)作为一种新兴的非易失性存储器,具有高密度、读取访问速度快、低功耗和存内计算等优点。利用忆阻器为CNN加速已经被广泛研究,但是图卷积神经网络极大的稀疏性会导致现有加速器效率低下,因此该文提出一种基于忆阻器交叉阵列的高效图卷积神经网络加速器,首先,该文分析GCN中不同操作数的计算和访存特征,提出权重和邻接矩阵到忆阻器阵列的映射方法,有效利用两种操作数的计算密集特征并避免访存密集的特征向量造成过高开销;进一步地,充分挖掘邻接矩阵的稀疏性,提出子矩阵划分算法及邻接矩阵的压缩映射方案,最大限度降低GCN的忆阻器资源需求;此外,加速器提供对稀疏计算支持,支持压缩格式为坐标表(COO)的特征向量输入,保证计算过程规则且高效地执行。实验结果显示,该文加速器相比CPU有483倍速度提升和1569倍能量节省;相比GPU也有28倍速度提升和168倍能耗节省。  相似文献   

6.
基于扩展控制流图的片上存储器分配策略   总被引:1,自引:0,他引:1       下载免费PDF全文
王学香  浦汉来  杨军 《电子学报》2007,35(8):1558-1562
本文提出一种基于扩展控制流图(ECFG)的片上存储器(Scratch-Pad Memory,SPM)分配策略,该策略首先把程序划分为全局变量、全局堆栈、指令块等节点,用包含节点和节点间关系的ECFG来描述应用程序,接着采用考虑了节点间关系的改进的背包算法把选中的节点分配到SPM中.实验表明该策略比采用单纯背包算法的SPM分配策略减少应用程序执行时间11%,比不使用SPM时减少56%,大大提高了SoC存储子系统的性能.  相似文献   

7.
提出一种基于存储器交织架构的FFT处理器设计方法,并且针对基-8FFT提出一种无冲突地址生成算法,数据按帧进行操作。每个存储器均划分为8个独立的存储体,通过对循环移位寄存器译码,蝶式运算单元并行无冲突读写操作数,8通道输入数据进行并行的复数乘法运算。每级运算引入完全流水,减少了运算的时钟周期开销,同时推导出局部流水线设计必须满足的不等式条件。输入、输出存储器采用乒乓操作,按帧轮换,FFT运算连续输入、输出,采样频率与系统工作频率一致,具有很好的实时性,运算精度通过块浮点得到保证。该设计方法可以扩展至基-16FFT处理器设计。  相似文献   

8.
在以SDRAM为主的存储系统中,SDRAM的换行访问产生了大量的功耗开销,减少换行次数可以降低存储系统功耗。本文提出了引入片上存储器来降低SDRAM换行次数的低功耗设计策略。该策略首先对指令执行流进行分析,并统计出在对堆栈和全局变量的访问时产生了频繁换行;然后将堆栈放入片上堆栈存储器;同时借助有芯片面积约束的贪婪算法确定了片上数据存储器的大小和所存放的全局变量。实验结果表明,引入较小的片上存储器就使得换行次数大大降低,功耗显著下降,减少换行访问的功耗平均下降了24%。  相似文献   

9.
管茂林  何义  杨乾明  张春元  伍楠 《电子学报》2012,40(7):1379-1385
针对流体系结构中VLIW代码体积对指令存储器的容量和功耗带来的问题,本文通过分析流处理器的指令特征,提出了一种新的VLIW分域压缩技术.在此基础上,本文为流体系结构设计了分布式的片上指令存储器,并提出了SIMD流水的执行模式.实验结果证明,该技术减少了38%的片外指令访存,降低约65%的片上指令存储器空间需求;分布式指令存储器减少了约37%的片上指令存储器面积,使得MASA的系统面积降低了8.92%,并降低了61%的指令存储器功耗.  相似文献   

10.
在EOS(Ethernet over SDH)应用设计中,以太网侧收到的媒体访问控制(MAC)帧要先存储再由成帧器将数据取走。文章介绍了其实现过程,采用同步动态随机存储器(SDRAM)作为片外的存储器,利用SDRAM的特征,发挥其高速、大容量的优点,可确保专用集成电路(ASIC)芯片上缓存面积减少,并可将以太网数据准确映射到SDH传送网中。该设计已在现场可编程门阵列(FPGA)上得到验证,并通过网络分析仪进行了测试。  相似文献   

11.
Demands have been placed on dynamic random access memory (DRAM) to not only increase memory capacity and data transfer speed but also to reduce operating and standby currents. When a system uses DRAM, the restricted data retention time necessitates a refresh operation because each bit of the DRAM is stored as an amount of electrical charge in a storage capacitor. Power consumption for the refresh operation increases in proportion to memory capacity. A new method is proposed to reduce the refresh power consumption dynamically, when full memory capacity is not required, by effectively extending the memory cell retention time. Conversion from 1 cell/bit to 2N cells/bit reduces the variation of retention times among memory cells. The proposed method reduces the frequency of disturbance and power consumption by two orders of magnitude. Furthermore, the conversion itself can be realized very simply from the structure of the DRAM array circuit, while maintaining all conventional functions and operations in the full array access mode.  相似文献   

12.
A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9% lower energy (on average) and improves performance by 40.0% (on average) when compared to a traditional cache system which is identical in size.  相似文献   

13.
In this paper we describe a multi-module, multi-port memory design procedure that satisfies area and/or energy constraints for embedded applications. Our procedure consists of application of loop transformations and reordering of array accesses to reduce the memory bandwidth followed by memory allocation and assignment procedures based on ILP models and heuristic-based algorithms. The specific problems include determination of (a) the memory configuration with minimum area, given the energy bound, (b) the memory configuration with minimum energy, given the area bound, (c) array allocation such that the energy consumption is minimum for a given memory configuration (number of modules, size and number of ports per module). The results obtained by the heuristics match well with those obtained by the ILP methods.  相似文献   

14.
Modern digital signal processors (DSPs) provide dedicated address generation units (AGUs) which support data memory access by indirect addressing with automatic address modification in parallel to other machine operations. There is no address computation overhead if the next address is within the auto-modify range. Typically, optimization of data memory layout and address register assignment allows to reduce both execution time and code size of DSP programs. In this paper, we present an optimization technique for integrated data memory layout generation and address register assignment. We use a generic AGU model which captures important addressing capabilities of DSPs such as linear addressing, modulo addressing, auto-modifying, and indexing within a given auto-modify range. Experimental results demonstrate that the proposed technique significantly outperforms existing optimization strategies. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

15.
刘军  朱承强  吴玺  王伟  任福继 《电子学报》2018,46(3):629-635
存储裸片堆叠方案和冗余共享策略对提高三维存储器成品率有重要影响.为提高三维存储器的成品率并且减少行列冗余所需的TSVs数量,提出了一种相邻层冗余共享结构.该冗余共享结构使得每层存储裸片的行列冗余不仅能被本层使用,而且能被相邻层使用.并在此结构的基础上,提出了一种新的存储裸片堆叠方案.该方案通过构建存储裸片的选择限制条件,每次选中适合的存储裸片来堆叠三维存储器以充分利用行列冗余.实验结果表明,与国际上同类方法相比,所提方案有效地提高了三维存储器的成品率,并且减少了行列冗余所需的TSVs数量.  相似文献   

16.
本文给出了一种用于MPEG2 MP@ML视频编码器单片集成的低功耗运动估值器实现,它基于一种低复杂性的改进的多分辨率望远镜搜索算法,在粗分辨率运动矢量搜索过程中,采用以运动跟踪搜索和自适应望远镜搜索组成的两阶段可提前中断搜索来替换单一的传统望远镜搜索.模拟结果表明新算法的计算负载仅为传统算法的30%左右,却可以保持与之相同的视频解码图质量.本文采用心动阵列和改进的树结构阵列来分别加速运动矢量的粗分辨率搜索和优化搜索,并通过门控时钟和屏蔽操作数等方法来实现低功耗.  相似文献   

17.
Three‐dimensional (3D) memories using through‐silicon vias (TSVs) as vertical buses across memory layers will likely be the first commercial application of 3D integrated circuit technology. The memory dies to stack together in a 3D memory are selected by a die‐selection method. The conventional die‐selection methods do not result in a high‐enough yields of 3D memories because 3D memories are typically composed of known‐good‐dies (KGDs), which are repaired using self‐contained redundancies. In 3D memory, redundancy sharing between neighboring vertical memory dies using TSVs is an effective strategy for yield enhancement. With the redundancy sharing strategy, a known‐bad‐die (KBD) possibly becomes a KGD after bonding. In this paper, we propose a novel die‐selection method using KBDs as well as KGDs for yield enhancement in 3D memory. The proposed die‐selection method uses three search‐space conditions, which can reduce the search space for selecting memory dies to manufacture 3D memories. Simulation results show that the proposed die‐selection method can significantly improve the yield of 3D memories in various fault distributions.  相似文献   

18.
在设计初期,估计粗粒度可重构结构的性能,对粗粒度可重构结构设计具有指导意义.在考虑局部数据存储器结构以及局部数据存储器与可重构阵列的接口结构的情况下,建立了粗粒度可重构结构的参数模型,使用改进的螺旋形绑定策略将应用算法DFG(Data Flow Graph)中的算子绑定到可重构阵列的处理单元上,提出了一种粗粒度可重构结构的性能估计方法.应用实例表明,在设计初期,该方法能得到周期精确的估计结果,有效地指导粗粒度可重构结构的设计.  相似文献   

19.
Energy consumption is one of the important parameters to be optimized during the design of portable embedded systems. Thus, most of the contemporary portable devices feature low-power processors coupled with on-chip memories (e.g., caches, scratchpads). Scratchpads are better than traditional caches in terms of power, performance, area, and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime. We demonstrate that the problem of overlaying scratchpad is an extension of the Global Register Allocation problem. We present optimal and near-optimal approaches for solving the scratchpad overlay problem. The near-optimal scratchpad overlay approach achieves close to the optimal results and is significantly faster than the optimal approach. Our approaches improve upon the previously known static allocation technique for assigning both variables and code segments onto the scratchpad. The evaluation of the approaches for ARM7 processor reports, average energy, and execution time reductions of 26% and 14% over the static approach, respectively. Additional experiments comparing the overlayed scratchpads against unified caches of the same size, report average energy, and execution time savings of 20% and 10%, respectively. We also report data memory energy reductions of 45%-57% due to the insertion of a 1024-bytes scratchpad memory in the memory hierarchy of a digital signal processor (DSP).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号