首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
多媒体系统芯片(M-SoC)是一种典型的多任务系统芯片.芯片内部众多的数据请求源都要通过总线访问单一的片外存储器,合理调度这些总线请求成为系统设计的关键.本文通过详细分析总线上片内外数据通道的特点和数据流量,给出了一种基于多通道DMA的总线调度策略,并将该策略成功运用于单芯片音视频解码系统芯片的总线设计中.该策略有效地融合了DMA请求和总线总裁问题,普遍适用于片级总线多请求的多媒体系统芯片.  相似文献   

2.
Techniques for interconnect power consumption reduction in realizations of sum-of-products computations are presented. The proposed techniques reorder the sequence of accesses of the coefficient and data memories to minimize power-costly address and data bus bit switching. The reordering problem is systematically formulated by mapping into the traveling salesman's problem (TSP) for both single and multiple functional unit architectures. The cost function driving the memory accesses reordering procedure explicitly takes into consideration the static information related to algorithms' coefficients and storage addresses and data-related dynamic information. Experimental results from several typical digital signal-processing algorithms prove that the proposed techniques lead to significant bus switching activity savings. The power consumption in the data paths is reduced in most cases as well.  相似文献   

3.
A nonvolatile ferroelectric memory-based eight-context dynamically programmable gate array (DPGA) enables low-cost field programmable systems by the elimination of off-chip nonvolatile memories as well as the multicontext architecture. Since read and program sequences of configuration data loading from/to the DPGA are securely protected, unauthorized users cannot access the stored configuration data. The associated configuration memory consists of a SRAM-based six-transistor and 4-ferroelectric capacitor cell. The developed configuration memory achieves access time of 4ns, comparable to standard SRAM, which is 20 times faster than conventional ferroelectric memory; furthermore, it features a nondestructive read operation and a stable data recall scheme. The employed logic block circuit can effectively improve the available number of logic gates for the multicontext scheme with minimum area overhead. The prototype nonvolatile DPGA is fabricated in a 0.35-/spl mu/m CMOS with ferroelectric memory technology, and the implementation result of the Data Encryption Standard (DES) encryption/decryption functions on this DPGA presents proper operation up to 51 MHz at 3.3V. The nonvolatile storage of configuration memory is verified for power-supply voltage as low as 1.5 V at room temperature, which is the lowest operation voltage ever reported for PbZrTiO/sub 3/ (PZT)-based ferroelectric memories.  相似文献   

4.
与2D存储器相比,3D存储器能够提供更大的容量、更高的带宽、更低的延迟和功耗,但成品率低。为了解决这个问题,提出一种有效的3D存储器内建自修复方案。将存储阵列的每一行或每一列划分成几个行块或列块,在不同层的行块或列块之间进行故障单元的映射,使不同层同一行或同一列的故障在逻辑上映射到同一层中,从而使一个冗余行或冗余列能够修复更多的故障,大大增加了冗余资源利用率和故障修复率。实验结果表明,与其他修复方案相比,该方案的修复率更高,实现相同修复率所需的冗余资源更少,增加的面积开销几乎可忽略不计。  相似文献   

5.
Nowadays, multibit error correction codes (MECCs) are effective approaches to mitigate multiple bit upsets (MBUs) in memories. As technology scales, combinational circuits have become more susceptible to radiation induced single event transient (SET). Therefore, transient faults in encoding and decoding circuits are more frequent than before. Firstly, this paper proposes a new MECC, which is called Mix code, to mitigate MBUs in fault-secure memories. Considering the structure characteristic of MECC, Euclidean Geometry Low Density Parity Check (EG-LDPC) codes and Hamming codes are combined in the proposed Mix codes to protect memories against MBUs with low redundancy overheads. Then, the fault-secure scheme is presented, which can tolerate transient faults in both the storage cell and the encoding and decoding circuits. The proposed fault-secure scheme has remarkably lower redundancy overheads than the existing fault-secure schemes. Furthermore, the proposed scheme is suitable for ordinary accessed data width (e.g., 2n bits) between system bus and memory. Finally, the proposed scheme has been implemented in Verilog and validated through a wide set of simulations. The experiment results reveal that the proposed scheme can effectively mitigate multiple errors in whole memory systems. They can not only reduce the redundancy overheads of the storage array but also improve the performance of MECC circuits in fault-secure memory systems.  相似文献   

6.
With the increasing design complexity and performance requirement, data arrays in behavioral specification are usually mapped to memories in behavioral synthesis. This paper describes a new algorithm that overcomes two limitations of the previous works on the problem of memory-allocation and array-mapping to memories. Specifically, its key features are a tight link to the scheduling effect, which was totally or partially ignored by the existing memory synthesis systems, and supporting nonuniform access speeds among the ports of memories, which greatly diversify the possible (practical) memory configurations. Experimental data on a set of benchmark filter designs are provided to show the effectiveness of the proposed exploration strategy in finding globally best memory configurations.  相似文献   

7.
针对JPEG2000硬件实现中小波变换与编码之间占用大量存储的问题,该文提出一种基于码块的存储方案。通过对码块大小片内存储最大程度的复用以及对其高效简单的调度控制,从面积和功耗两方面减小了硬件实现的开销。在实现中,采用基于行的提升变换结构和比特平面并行的编码方式,提高了效率,确保整个过程的实时处理。实验结果表明:在实时编码要求下,对分辨率为512512的图像分片进行四级9/7或者5/3小波分解,码块大小为3232,采用本文结构所用的存储量与直接使用外部存储器的方法相比可减少80%以上。整个结构已通过FPGA验证,且系统时钟可以工作在100MHz。  相似文献   

8.
曹炜  林争辉 《微电子学》2000,30(6):395-398
用VHDL语言描述的数字系统中,经常使用大量的数组对应于真实系统中的存储器,减少存储器的操作时间对于提高整个系统的速度是一个非常有意义的问题,而改进存储器的地址生成技术是解决这个问题的途径之一。文章研究了一些与此相关的新技术^「1」,但这些新技术的使用将增加一些 余的存储单元,特别是对于多数组问题。为此,提出了一种多启发式组合算法,以求同时达到尽量减小冗余量和提高计算速度的目的。  相似文献   

9.
In modern multimedia applications, memory bottleneck can be alleviated with special stride data accesses. Data elements in stride access can be retrieved in parallel with parallel memories, in which the idea is to increase memory bandwidth with several memory modules working in parallel and feed the processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the accessed data element count equals to the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively. The complexity of the proposed system is shown to be a trade-off between application specific and highly configurable parallel memory system.  相似文献   

10.
The aim of this paper is to develop a testing scheme for EPROM memories. The starting point is the assumed general model of EPROM memory logic structure. For this model, an adequate fault model is developed. The class of faults taken into consideration includes faults in input-output buffers, faults in address decoding circuitry, and faults in memory cell arrays. The proposed testing scheme makes possible the detection of all faults included in the assumed fault model. This scheme takes into account technological and economic aspects. The method proposed is illustrated by detailed solutions for the 2716 EPROM memory.  相似文献   

11.
An ultrahigh-speed 4.5-Mb CMOS SRAM with 1.8-ns clock-access time, 1.8-ns cycle time, and 9.84-μm2 memory cells has been developed using 0.25-μm CMOS technology. Three key circuit techniques for achieving this high speed are a decoder using source-coupled-logic (SCL) circuits combined with reset circuits, a sense amplifier with nMOS source followers, and a sense-amplifier activation-pulse generator that uses a duplicate memory-cell array. The proposed decoder can reduce the delay time between the address input and the word-line signal of the 4.5-Mb SRAM to 68% of that of an SRAM with conventional circuits. The sense amplifier with nMOS source followers can reduce not only the delay time of the sense amplifier but also the power dissipation. In the SRAM, the sense-amplifier activation pulse must be input into the sense amplifier after the signal from the memory cell is input into the sense amplifier. A large timing margin required between these signals results in a large access time in the conventional SRAM. The sense-amplifier activation pulse generator that uses a duplicate memory-cell array can reduce the required timing margin to less than half of the conventional margin. These three techniques are especially useful for realizing ultrahigh-speed SRAM's, which will be used as on-chip or off-chip cache memories in processor systems  相似文献   

12.
Systolic arrays (SAs) are very efficient architectures for multimedia processing, database management, and scientific computing applications that are characterized by a high number of data access. However, in these data transfer and storage intensive applications, memory access is often the limiting factor to the computation speed. Since the memory subsystem dominates the cost (area), performance and power consumption of the SA, we have to pay a special attention to how memory subsystem can benefit from customization. In this paper we consider memory organization of linear systolic array with bi-directional links (called BLSA) suitable for implementation of broad class of algorithms. We assume that memory is organized into distributed smaller physical memory modules. In order to provide high bandwidth in data access we have designed special hardware, called address generator unit (AGU). The function of AGU is threefold. First, during the initialization, it transforms host address space into BLSA address space. Second, provides efficient memory data access during BLSA operation. Third, performs fast data transfer between BLSA and host at the end of the computation. In this article, we examine the impact on area and performance of memory access related circuity in eliminating computational intensive offset address calculations performed in software by implementing the needed address transformations with the AGUs. By involving hardware AGUs we achieved a speedup of approximately two, compared to the software implementation of address calculation, with a hardware overhead of only 7.6% in the worst case.  相似文献   

13.
We present a novel coding scheme for reducing bus power dissipation. The presented approach is well suited to driving off-chip buses, where the line capacitance is a dominant factor. A distinctive feature of the technique is the dynamic reordering of bus line positions, in order to minimize the toggling activity on physical bus wires. The effectiveness of the approach is demonstrated through cycle-accurate simulation of industrial benchmarks in conjunction with post-layout evaluation of speed, power and area overhead.  相似文献   

14.
We introduce a novel memory architecture that can count the occurrences of patterns on a system's bus, a task known as profiling. Such profiling can serve a variety of purposes, like detecting a microprocessor's software hot spots or frequently used data values, which can be used to optimize various aspects of the system. The memory, which we call ProMem, is based on a pipelined binary search tree structure, yielding several beneficial features, including nonintrusiveness, accurate counts, excellent size and power efficiency, very fast access times, and the use of standard memories with only simple additional logic. The main limitation is that the set of potential patterns must be preloaded into the memory. We describe the ProMem architecture, and show excellent size and performance advantages compared with content-addressable memory (CAM) based designs.  相似文献   

15.
This paper presents a novel approach to the synthesis of interleaved memory systems that is especially suited for application-specific processors. Our synthesis system generates the optimized interleaved memories for a specific algorithm and finds the best mapping of arrays in that algorithm onto the memory system to achieve high performance. The design space is four-dimensional (4-D) and comprises the number of memory banks, the type of memory components, the storage scheme, and the range of clock period in the system. Optimal designs are found among the Pareto points (a set of nondominated points in the design space) computed for our memory model under the performance and cost criteria set by the designer. The memory model includes all the components of an interleaved memory system and covers a lookup table-based address generation with data alignment. The synthesis is based on a general periodic storage scheme, which enables efficient handling of irregular and overlapped access patterns. The synthesis process is the exhaustive search of the heavily pruned design space, and the pruning is based on mathematically proven properties of periodic storage schemes. This paper presents the theorems, the synthesis algorithm, and the methods of effective word and bank address generation. Examples are given to illustrate the effectiveness of our method  相似文献   

16.
This paper proposes a code placement problem, its ILP formulation, and a heuristic algorithm for reducing the total energy consumption of embedded processor systems including a CPU core, on-chip and off-chip memories. Our approach exploits a non-cacheable memory region for an effective use of a cache memory and as a result, reduces the number of off-chip accesses. Our algorithm simultaneously finds a code layout for a cacheable region, a scratchpad region, and the other non-cacheable region of the address space so as to minimize the total energy consumption of the processor system. Experiments using a commercial embedded processor and an off-chip SDRAM demonstrate that our algorithm reduces the energy consumption of the processor system by 23% without any performance degradation compared to the best result achieved by the conventional approach.  相似文献   

17.
Cache作为处理器和系统总线之间的桥梁,是芯片功耗的主要来源,低功耗Cache设计在嵌入式芯片设计中具有重要意义.传统Cache设计一般依赖于特定体系结构,难以在不同的系统中进行集成,通用性差.本文提出了一种低功耗高效率的AHB-AXI双总线结构联合Cache的IP设计.实验结果显示,本设计可以显著降低Cache功耗和提高系统性能.  相似文献   

18.
Caches, which are comprised much of a CPU chip area and transistor counts, are reasonable targets for transient single and multiple faults induced from energetic particles. This paper presents: (1) a new fault detection scheme for tag arrays of cache memories and (2) an architectural cache to improve performance as well as dependability. In this architecture, cache space is divided into sets of different sizes and different tag lengths. Using the proposed fault detection scheme, i.e., GParity, when single and multiple errors are detected in a word, the word is rewritten by its correct data from memory and its GParity code is recomputed. The error detection scheme and the cache architecture have been evaluated using a trace driven simulation with soft error injection and SPEC 2000 applications. Moreover, reliability and mean-time-to-failure (MTTF) equations are derived and estimated. The results of GParity code are compared with those of other protection codes and memory systems without redundancies and with single parity codes. The results show that error detection improvement varies between 66% and 96% as compared with the already available single parity in microprocessors.  相似文献   

19.
Hardware synthesis from dataflow graphs of signal processing systems is a growing research area as focus shifts to high level design methodologies. For data intensive systems, dataflow based synthesis can lead to an inefficient usage of memory due to the restrictive nature of synchronous dataflow and its inability to easily model data reuse. This paper explores how dataflow graph changes can be used to drive both the on-chip and off-chip memory organisation and how these memory architectures can be mapped to a hardware implementation. By exploiting the data reuse inherent to many image processing algorithms and by creating memory hierarchies, off-chip memory bandwidth can be reduced by a factor of a thousand from the original dataflow graph level specification of a motion estimation algorithm, with a minimal increase in memory size. This analysis is verified using results gathered from implementation of the motion estimation algorithm on a Xilinx Virtex-4 FPGA, where the delay between the memories and processing elements drops from 14.2 ns down to 1.878 ns through the refinement of the memory architecture. Care must be taken when modeling these algorithms however, as inefficiencies in these models can be easily translated into overuse of hardware resources.  相似文献   

20.
This paper describes an approach to a low power and high speed data transfer scheme in the internal data bus of an AS-Memory which has ASIC circuitry and memory array. Pulse width modulation, which is operated asynchronously, is applied to the wide internal data bus. An automatic gain controlled amplifier which amplifies many small signals from the memory array is also newly developed to achieve a fast data output. Applying this architecture to an AS-Memory, the area and power consumption of the internal data bus interface can be reduced to 25% and 36%, respectively  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号