共查询到19条相似文献,搜索用时 750 毫秒
1.
2.
在JPEG2000专用集成电路设计中,DWT和Tierl编码之间的接口存储器设计直接影响DWT变换的系数存储和LL子带数据的再读取,及为给Tierl的位平面编码器提供码块数据。本文使用了二块片内DPRAM实现上述存储,提出了一种简单而高效的读写策略,并实现了该方法的FPGA仿真,性能分析表明DWT和位平面编码器并行性接近90%。 相似文献
3.
针对合成孔径雷达(SAR)成像回波数据量巨大和存储器资源利用率偏低的问题,提出一种双倍数据速率动态随机存储器(DDR)的存储装置,并设计一种原位转置的存储方法。该方法首先比较雷达回波数据距离向和方位向的长度,通过预留出距离向和方位向中较长数据的单行或者单列所占用的存储空间来提高存储器读地址和写地址逻辑映射的灵活性,有效地实现了大数据量的片内转置操作,提高了双倍数据速率动态随机存储器(DDR)的资源利用率,并将分块子矩阵地址映射方法和跨页地址映射方法应用于原位转置中,有效地提高了SAR回波数据转置的访问效率。仿真实验结果表明,该数据存储系统在满足成像实时性的同时降低了一半的DDR存储器的用量,提高了DDR的资源利用率,降低了成本,目前已成功应用于多种模式的SAR成像处理中。 相似文献
4.
针对DSP上低码率语音编码器的实现和优化问题,研究了片上Cache的分配策略.根据指令Cache的大小,以及程序处理的数据量的大小,将程序分成大小合理的段,分阶段载入Cache中.对数据Cache的分配考虑了Cache结构和数据本身的特点,使有限的数据Cache得到充分的利用.全面考察数据的生命期,使已经载入数据Cac... 相似文献
5.
图卷积神经网络(GCN)在社交网络、电子商务、分子结构推理等任务中的表现远超传统人工智能算法,在近年来获得广泛关注。与卷积神经网络(CNN)数据独立分布不同,图卷积神经网络更加关注数据之间特征关系的提取,通过邻接矩阵表示数据关系,因此其输入数据和操作数相比卷积神经网络而言都更加稀疏且存在大量数据传输,所以实现高效的GCN加速器是一个挑战。忆阻器(ReRAM)作为一种新兴的非易失性存储器,具有高密度、读取访问速度快、低功耗和存内计算等优点。利用忆阻器为CNN加速已经被广泛研究,但是图卷积神经网络极大的稀疏性会导致现有加速器效率低下,因此该文提出一种基于忆阻器交叉阵列的高效图卷积神经网络加速器,首先,该文分析GCN中不同操作数的计算和访存特征,提出权重和邻接矩阵到忆阻器阵列的映射方法,有效利用两种操作数的计算密集特征并避免访存密集的特征向量造成过高开销;进一步地,充分挖掘邻接矩阵的稀疏性,提出子矩阵划分算法及邻接矩阵的压缩映射方案,最大限度降低GCN的忆阻器资源需求;此外,加速器提供对稀疏计算支持,支持压缩格式为坐标表(COO)的特征向量输入,保证计算过程规则且高效地执行。实验结果显示,该文加速器相比CPU有483倍速度提升和1569倍能量节省;相比GPU也有28倍速度提升和168倍能耗节省。 相似文献
6.
7.
提出一种基于存储器交织架构的FFT处理器设计方法,并且针对基-8FFT提出一种无冲突地址生成算法,数据按帧进行操作。每个存储器均划分为8个独立的存储体,通过对循环移位寄存器译码,蝶式运算单元并行无冲突读写操作数,8通道输入数据进行并行的复数乘法运算。每级运算引入完全流水,减少了运算的时钟周期开销,同时推导出局部流水线设计必须满足的不等式条件。输入、输出存储器采用乒乓操作,按帧轮换,FFT运算连续输入、输出,采样频率与系统工作频率一致,具有很好的实时性,运算精度通过块浮点得到保证。该设计方法可以扩展至基-16FFT处理器设计。 相似文献
8.
在以SDRAM为主的存储系统中,SDRAM的换行访问产生了大量的功耗开销,减少换行次数可以降低存储系统功耗。本文提出了引入片上存储器来降低SDRAM换行次数的低功耗设计策略。该策略首先对指令执行流进行分析,并统计出在对堆栈和全局变量的访问时产生了频繁换行;然后将堆栈放入片上堆栈存储器;同时借助有芯片面积约束的贪婪算法确定了片上数据存储器的大小和所存放的全局变量。实验结果表明,引入较小的片上存储器就使得换行次数大大降低,功耗显著下降,减少换行访问的功耗平均下降了24%。 相似文献
9.
10.
在EOS(Ethernet over SDH)应用设计中,以太网侧收到的媒体访问控制(MAC)帧要先存储再由成帧器将数据取走。文章介绍了其实现过程,采用同步动态随机存储器(SDRAM)作为片外的存储器,利用SDRAM的特征,发挥其高速、大容量的优点,可确保专用集成电路(ASIC)芯片上缓存面积减少,并可将以太网数据准确映射到SDH传送网中。该设计已在现场可编程门阵列(FPGA)上得到验证,并通过网络分析仪进行了测试。 相似文献
11.
Demands have been placed on dynamic random access memory (DRAM) to not only increase memory capacity and data transfer speed but also to reduce operating and standby currents. When a system uses DRAM, the restricted data retention time necessitates a refresh operation because each bit of the DRAM is stored as an amount of electrical charge in a storage capacitor. Power consumption for the refresh operation increases in proportion to memory capacity. A new method is proposed to reduce the refresh power consumption dynamically, when full memory capacity is not required, by effectively extending the memory cell retention time. Conversion from 1 cell/bit to 2N cells/bit reduces the variation of retention times among memory cells. The proposed method reduces the frequency of disturbance and power consumption by two orders of magnitude. Furthermore, the conversion itself can be realized very simply from the structure of the DRAM array circuit, while maintaining all conventional functions and operations in the full array access mode. 相似文献
12.
Janapsatya A. Ignjatovic A. Parameswaran S. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(8):816-829
A method to both reduce energy and improve performance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruction cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9% lower energy (on average) and improves performance by 40.0% (on average) when compared to a traditional cache system which is identical in size. 相似文献
13.
In this paper we describe a multi-module, multi-port memory design procedure that satisfies area and/or energy constraints for embedded applications. Our procedure consists of application of loop transformations and reordering of array accesses to reduce the memory bandwidth followed by memory allocation and assignment procedures based on ILP models and heuristic-based algorithms. The specific problems include determination of (a) the memory configuration with minimum area, given the energy bound, (b) the memory configuration with minimum energy, given the area bound, (c) array allocation such that the energy consumption is minimum for a given memory configuration (number of modules, size and number of ports per module). The results obtained by the heuristics match well with those obtained by the ILP methods. 相似文献
14.
Bernhard Wess 《Design Automation for Embedded Systems》1999,4(2-3):167-185
Modern digital signal processors (DSPs) provide dedicated address generation units (AGUs) which support data memory access
by indirect addressing with automatic address modification in parallel to other machine operations. There is no address computation
overhead if the next address is within the auto-modify range. Typically, optimization of data memory layout and address register
assignment allows to reduce both execution time and code size of DSP programs. In this paper, we present an optimization technique
for integrated data memory layout generation and address register assignment. We use a generic AGU model which captures important
addressing capabilities of DSPs such as linear addressing, modulo addressing, auto-modifying, and indexing within a given
auto-modify range. Experimental results demonstrate that the proposed technique significantly outperforms existing optimization
strategies.
This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献
15.
存储裸片堆叠方案和冗余共享策略对提高三维存储器成品率有重要影响.为提高三维存储器的成品率并且减少行列冗余所需的TSVs数量,提出了一种相邻层冗余共享结构.该冗余共享结构使得每层存储裸片的行列冗余不仅能被本层使用,而且能被相邻层使用.并在此结构的基础上,提出了一种新的存储裸片堆叠方案.该方案通过构建存储裸片的选择限制条件,每次选中适合的存储裸片来堆叠三维存储器以充分利用行列冗余.实验结果表明,与国际上同类方法相比,所提方案有效地提高了三维存储器的成品率,并且减少了行列冗余所需的TSVs数量. 相似文献
16.
本文给出了一种用于MPEG2 MP@ML视频编码器单片集成的低功耗运动估值器实现,它基于一种低复杂性的改进的多分辨率望远镜搜索算法,在粗分辨率运动矢量搜索过程中,采用以运动跟踪搜索和自适应望远镜搜索组成的两阶段可提前中断搜索来替换单一的传统望远镜搜索.模拟结果表明新算法的计算负载仅为传统算法的30%左右,却可以保持与之相同的视频解码图质量.本文采用心动阵列和改进的树结构阵列来分别加速运动矢量的粗分辨率搜索和优化搜索,并通过门控时钟和屏蔽操作数等方法来实现低功耗. 相似文献
17.
Three‐dimensional (3D) memories using through‐silicon vias (TSVs) as vertical buses across memory layers will likely be the first commercial application of 3D integrated circuit technology. The memory dies to stack together in a 3D memory are selected by a die‐selection method. The conventional die‐selection methods do not result in a high‐enough yields of 3D memories because 3D memories are typically composed of known‐good‐dies (KGDs), which are repaired using self‐contained redundancies. In 3D memory, redundancy sharing between neighboring vertical memory dies using TSVs is an effective strategy for yield enhancement. With the redundancy sharing strategy, a known‐bad‐die (KBD) possibly becomes a KGD after bonding. In this paper, we propose a novel die‐selection method using KBDs as well as KGDs for yield enhancement in 3D memory. The proposed die‐selection method uses three search‐space conditions, which can reduce the search space for selecting memory dies to manufacture 3D memories. Simulation results show that the proposed die‐selection method can significantly improve the yield of 3D memories in various fault distributions. 相似文献
18.
19.
Verma M. Marwedel P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(8):802-815
Energy consumption is one of the important parameters to be optimized during the design of portable embedded systems. Thus, most of the contemporary portable devices feature low-power processors coupled with on-chip memories (e.g., caches, scratchpads). Scratchpads are better than traditional caches in terms of power, performance, area, and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime. We demonstrate that the problem of overlaying scratchpad is an extension of the Global Register Allocation problem. We present optimal and near-optimal approaches for solving the scratchpad overlay problem. The near-optimal scratchpad overlay approach achieves close to the optimal results and is significantly faster than the optimal approach. Our approaches improve upon the previously known static allocation technique for assigning both variables and code segments onto the scratchpad. The evaluation of the approaches for ARM7 processor reports, average energy, and execution time reductions of 26% and 14% over the static approach, respectively. Additional experiments comparing the overlayed scratchpads against unified caches of the same size, report average energy, and execution time savings of 20% and 10%, respectively. We also report data memory energy reductions of 45%-57% due to the insertion of a 1024-bytes scratchpad memory in the memory hierarchy of a digital signal processor (DSP). 相似文献