首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
实时片上热信号是集成电路芯片动态热管理(DTM)的重要信息。提出了一种基于Voronoi图的非均匀采样重构芯片热信号的方法,该方法在16核的处理器上每核9个传感器的情况下给出了最小2.25%的平均误差,并且通过与距离倒数加权平均法的平均运行时间的对比验证了该种方法的优越性。  相似文献   

2.
Power consumption, particularly runtime leakage, in long on-chip buses has grown to be an unacceptable portion of the total power budget due to heavy buffer insertion used to combat RC delays. In this paper, we propose a new bus encoding algorithm and circuit scheme for on-chip buses that eliminates capacitive crosstalk while simultaneously reducing total power. We utilize a buffer design approach with a selective use of high-threshold voltage transistors and couple this buffer design with a novel bus encoding scheme. The proposed encoding scheme significantly reduces total power by 26% and runtime leakage power by 42% while also eliminating capacitive crosstalk. In addition, the proposed encoding is specifically optimized to reduce the complexity of the encoding logic, allowing for a significant reduction in overhead which has not been considered in previous bus encoding work.  相似文献   

3.
在以SDRAM为主的存储系统中,SDRAM的换行访问产生了大量的功耗开销,减少换行次数可以降低存储系统功耗。本文提出了引入片上存储器来降低SDRAM换行次数的低功耗设计策略。该策略首先对指令执行流进行分析,并统计出在对堆栈和全局变量的访问时产生了频繁换行;然后将堆栈放入片上堆栈存储器;同时借助有芯片面积约束的贪婪算法确定了片上数据存储器的大小和所存放的全局变量。实验结果表明,引入较小的片上存储器就使得换行次数大大降低,功耗显著下降,减少换行访问的功耗平均下降了24%。  相似文献   

4.
Energy consumption is one of the important parameters to be optimized during the design of portable embedded systems. Thus, most of the contemporary portable devices feature low-power processors coupled with on-chip memories (e.g., caches, scratchpads). Scratchpads are better than traditional caches in terms of power, performance, area, and predictability. However, unlike caches they depend upon software allocation techniques for their utilization. In this paper, we present scratchpad overlay techniques which analyze the application and insert instructions to dynamically copy both variables and code segments onto the scratchpad at runtime. We demonstrate that the problem of overlaying scratchpad is an extension of the Global Register Allocation problem. We present optimal and near-optimal approaches for solving the scratchpad overlay problem. The near-optimal scratchpad overlay approach achieves close to the optimal results and is significantly faster than the optimal approach. Our approaches improve upon the previously known static allocation technique for assigning both variables and code segments onto the scratchpad. The evaluation of the approaches for ARM7 processor reports, average energy, and execution time reductions of 26% and 14% over the static approach, respectively. Additional experiments comparing the overlayed scratchpads against unified caches of the same size, report average energy, and execution time savings of 20% and 10%, respectively. We also report data memory energy reductions of 45%-57% due to the insertion of a 1024-bytes scratchpad memory in the memory hierarchy of a digital signal processor (DSP).  相似文献   

5.
AVS1-P2 is the newest video standard of Audio Video coding Standard (AVS) workgroup of China, which provides close performance to H.264/AVC main profile with lower complexity. In this paper, a platform-independent software package with macroblock-based (MB-based) architecture is proposed to facilitate AVS video standard implementation on embedded system. Compared with the frame-based architecture, which is commonly utilized for PC platform oriented video applications, the MB-based decoder performs all of the decoding processes, except the high-level syntax parsing, in a set of MB-based buffers with adequate size for saving the information of the current MB and the neighboring reference MBs to minimize the on-chip memory and to save the time consumed in on-chip/off-chip data transfer. By modifying the data flow and decoding hierarchy, simulating the data transfer between the on-chip memory and the off-chip memory, and modularizing the buffer definition and management for low-level decoding kernels, the MB-based system architecture provides over 80% reduction in on-chip memory compared to the frame-based architecture when decoding 720p sequences. The storage complexity is also analyzed by referencing the performance evaluation of the MB-based decoder. The MB-based decoder implementation provides an efficient reference to facilitate development of AVS applications on embedded system. The complexity analysis provides rough storage complexity requirements for AVS video standard implementation and optimization.  相似文献   

6.
We propose a method to estimate the data bus width to the requirements of an application that is to run on a custom processor. The proposed estimation method is a simulation-based tool that uses Extreme Value Theory to estimate the width of an off-chip or on-chip data bus based on the characteristics of the application. It finds the minimum number of bus lines needed for the bus connecting the custom processor to other units so that the probability of a multicycle data transfer on the bus is extremely unlikely. The potential target platforms include embedded systems where a custom processor (i.e. an ASIC or a FPGA) in a system-on-a-chip or a system-on-a-board is connected to memory, I/O and other processors through a shared bus or through point-to-point links. Our experimental and analytical results show that our estimation method can reduce the data bus width and cost by up to 66% with an average of 38% for nine benchmarks. The narrower data bus allows us to increase the spacing between the bus lines using the silicon area freed from the eliminated bus lines. This reduces interwire capacitance, which in turn leads to a significant reduction of bus energy consumption. Bus energy can potentially be reduced up to 89% for on-chip data buses with an average of 74% for seven benchmarks. Also, reduction in the interwire capacitance improves the bus propagation delay and on-chip bus propagation delay can be reduced up to 68% with an average of 51% for seven benchmarks using a narrower custom data bus.  相似文献   

7.
针对现有无线传感器网络(Wireless Sensor Networks,WSNs)节点片上RAM(随机存储器)利用率低的特点,设计了一种基于链表的改进型内存管理方案。该方案以事件驱动开发模式为程序运行的前提,在将RAM划分为静态内存空间和动态内存空间之后,通过内存隔离技术,实现内存管理结构与内存空间在实体内存中的分离,从而达到提高节点内存利用率的目的。经测试,写内存的平均速率能够达到500kb/s,而在开启内存交换功能时,实际内存的使用率接近80%。最终为提高节点内存利用率提供了一种良好的解决方案。  相似文献   

8.
In this paper, we propose a new early ${z}$-test which requires a minimized internal memory while removing redundant ${z}$ and color reads as well as texture reads. The proposed method determines whether a pixel is screened by a certain mask plane which is containing the history of a pixel's appearance in front of it. If a pixel is screened by the plane, it can be removed. Given an initial position, the method adaptively updates the plane position to maximize the rejected pixels. As a result, on average 39.9% of the memory bandwidth and 21.2% of total power consumption is saved with only a 256 B on-chip memory. The proposed method was implemented into a multimedia system-on-chip with 61 k application-specific integrated circuit (ASIC) gates using a 0.13-$mu{hbox {m}}$ CMOS process technology.   相似文献   

9.
We propose a new CAM architecture for the large-scale integration and low-power operation of a network router application. This CAM reduces entry count by an average of 52%, using a newly developed one-hot-spot block code. This code eliminates redundancy in a memory cell and improves the efficiency of IP address compression. To implement the proposed code, a hierarchical match-line structure and an on-chip entry compression/extraction scheme are introduced. With this architecture, a search-depth control scheme deactivates unnecessary search lines and reduces power consumption by 45%. Using a DRAM cell, our new content addressable memory (CAM) can achieve 1.5 million entries in 0.13-/spl mu/m technology, which is six times more than a conventional static ternary CAM.  相似文献   

10.
The performance of the processor core depends on the configuration parameters and utilization of on-chip memory in multimedia applications such as image, video and audio processing. The design of the on-chip memory architecture is critical for power and area efficient design without compromising quality in data-intensive computing applications. This paper proposes a design of high speed, area, and energy efficient Static Segment On-Chip (SSOC) memory for error-tolerant applications. In this static segment method, n-bit data array is reduced by m-bit data array for significant value of input data to achieve balanced design metrics at the cost of accuracy. The proposed m-bit static segmentation algorithm is implemented and verified in Single Port Static Random Access Memory (SP SRAM) architecture for the approximate computing applications. From the overall simulation results, the proposed 4-bit SSOC SP SRAM design provides 49.02% area savings, 50.62% power reduction and 16.92% speed improvement at the cost of 0.64% Peak Signal to Noise Ratio (PSNR) and exhibits same visual quality in comparison with the existing 8-bit conventional on-chip SP SRAM design in the image processing applications.  相似文献   

11.
Multiview video coding (MVC) plays an important role in a 3-D video system. In addition, the resolution of HDTV is increasing to present more vivid perception for users. To realize real-time processing of dozens of TOPS, VLSI solution is necessary. However, ultra high computational complexity, a large amount of external memory bandwidth and on-chip SRAM size, and complex MVC prediction structures are three main design challenges of implementation of MVC hardware architecture. In this paper, an MVC single-chip encoder is proposed for H.264/AVC Multiview High Profile and High Profile for 3-D and quad full high definition (QFHD) TV applications, respectively. The 4096 × 2160 p multiview video encoder chip is implemented on a 11.46 mm2 die with 90 nm CMOS technology. An eight-stage macroblock pipelined architecture with proposed system scheduling and cache-based prediction core supports real-time processing from one-view 4096 × 2160 p to seven-view 720 p videos. The 212 Mpixels/s throughput is 3.4 to 7.7 times higher than previous work. The 407 Mpixels/W power efficiency is achieved, and 94% on-chip SRAM size and 79% external memory bandwidth are saved by the proposed techniques.  相似文献   

12.
Reliability evaluation methodologies have become important in circuit design. In this paper, we focus on the probabilistic transfer matrix (PTM), which has proven to be a gate-level approach for accurately assess the reliability of a combinational circuit with penalty in simulation runtime and memory usage. In order to improve its efficiency, several methodologies based on traditional PTM are proposed. A general tool is developed to calculate the reliability of a circuit with efficient computation methods based on an optimized PTM (denoted as ECPTM), which achieves runtime and memory usage improvement. Experiments demonstrate how the proposed simulation framework, combined with traditional PTM method, can provide significant reduction in computation runtime and memory usage with different benchmark circuits.  相似文献   

13.
FFT algorithms have memory access patterns that prevent many architectures from achieving high computational utilization, particularly when parallel processing is required to achieve the desired levels of performance. Starting with a highly efficient hybrid linear algebra/FFT core, we co-design the on-chip memory hierarchy, on-chip interconnect, and FFT algorithms for a multicore FFT processor. We show that it is possible to to achieve excellent parallel scaling while maintaining power and area efficiency comparable to that of the single-core solution. The result is an architecture that can effectively use up to 16 hybrid cores for transform sizes that can be contained in on-chip SRAM. When configured with 12MiB of on-chip SRAM, our technology evaluation shows that the proposed 16-core FFT accelerator should sustain 388 GFLOPS of nominal double-precision performance, with power and area efficiencies of 30 GFLOPS/W and 2.66 GFLOPS/mm2, respectively.  相似文献   

14.
Ensuring the integrity of the power supply in the power distribution networks (PDNs) of a chip is essential for building reliable high-performance chips. To ensure the power integrity, accurate, and memory- and time-efficient simulation approaches for simulating the power-supply noise in the on-chip PDN are essential. In this paper, a finite-difference formulation based on the latency insertion method (LIM) has been employed for simulating the power-supply noise in the on-chip PDN. A new common-mode type equivalent circuit has been proposed. In this equivalent circuit, a capacitance to ideal ground may not be present at all the nodes. Further, the nodes can be capacitively coupled to each other. To avoid inverting a large nonbanded matrix, a small capacitance to ground is added to a node that did not have any capacitance to ground, and a small series inductance is added to any floating capacitor that did not have any series inductance. Approximate closed-form expressions to compute the values of these capacitances to ground and series inductances have been proposed. The accuracy of the LIM-enabled transient simulation and the accuracy of the proposed closed-form expressions have been demonstrated. The memory and time complexity of the simulation for each time step have been shown to be O(Nn) each, where Nn is the number of nodes in the equivalent circuit. Stability condition is derived for the first time for multidimensional inhomogeneous RLC circuit. A upper bound of the time step is derived from the stability condition. Using this bound on the time step, the runtime of the overall transient simulation has been estimated to be approximately proportional to Nn 2-2.5 for Nn in the order of millions.  相似文献   

15.
A fast fractal image coding based on kick-out and zero contrast conditions   总被引:15,自引:0,他引:15  
A fast algorithm for fractal image coding based on a single kick-out condition and the zero contrast prediction is proposed in this paper. The single kick-out condition can avoid a large number of range-domain block matches when finding the best matched domain block. An efficient method for zero contrast prediction is also proposed, which can determine whether the contrast factor for a domain block is zero or not, and compute the corresponding difference between the range block and the transformed domain block efficiently and exactly. The proposed algorithm can achieve the same reconstructed image quality as the exhaustive search, and can greatly reduce the required computation or runtime. In addition, this algorithm does not need any pre-processing step or additional memory for its implementation, and can combine with other fast fractal algorithms to further improve the speed. Experimental results show that the runtime is reduced by about 50% of that of the exhaustive search method. When combined with the DCT Inner Product algorithm, the required runtime for the algorithm can be further reduced by about 50%. The proposed algorithm was also compared to two other fast fractal algorithms. Experimental results also show that our algorithm achieves a better efficiency and requires a much smaller amount of memory for implementation.  相似文献   

16.
On-chip L1 and L2 caches represent a sizeable fraction of the total power consumption of microprocessors. In nanometer-scale technology, the subthreshold leakage power is becoming one of the dominant total power consumption components of those caches. In this study, we present optimization techniques to reduce the subthreshold leakage power of on-chip caches assuming that there are multiple threshold voltages, V/sub T/'s, available. First, we show a cache leakage optimization technique that examines the tradeoff between access time and subthreshold leakage power by assigning distinct V/sub T/'s to each of the four main cache components-address bus drivers, data bus drivers, decoders, and static random access memory (SRAM) cell arrays with sense amplifiers. Second, we show optimization techniques to reduce the leakage power of L1 and L2 on-chip caches without affecting the average memory access time. The key results are: 1) two additional high V/sub T/'s are enough to minimize leakage in a single cache-3 V/sub T/'s if we include a nominal low V/sub T/ for microprocessor core logic; 2) if L1 size is fixed, increasing L2 size can result in much lower leakage without reducing average memory access time; 3) if L2 size is fixed, reducing L1 size may result in lower leakage without loss of the average memory access time for the SPEC2K benchmarks; and 4) smaller L1 and larger L2 caches than are typical in today's processors result in significant leakage and dynamic power reduction without affecting the average memory access time.  相似文献   

17.
The architecture, implementation, and applications of the TMS32020, a second-generation VLSI digital signal processor, are described. The processor has many special features which provide a significant advance over previous VLSI digital signal processors. Its multiprocessor capabilities further distinguish it, allowing for much more flexibility in overall system design. The architecture of the device allows a dual bus structure to be maintained on-chip, while external bus hardware requirements are minimized via the multiplexing of these buses externally. Some of the notable features incorporated onto the device include two large on-chip RAM blocks, large external program/data address spaces, single-cycle multiply/accumulate instructions, hardware and instructions for efficient memory management, and a versatile multiprocessor interface.  相似文献   

18.
A compact on-chip error correcting circuit (ECC) for low cost flash memories has been developed. The total increase in chip area is 2%, including all cells, sense amplifiers, logic, and wiring associated with the ECC. The proposed on-chip ECC, employing 10 check bits for 512 data bits, has been implemented on an experimental 64M-bit NAND flash memory. The cumulative sector error rate has been improved from 10-4 to 10-10. By transferring read data from the sense amplifiers to the ECC twice, 522-Byte temporary buffers, which are required for the conventional ECC and occupy a large part of the ECC area, have been eliminated. As a result, the area for the circuit has been drastically reduced by a factor of 25. The proposed on-chip ECC has been optimized in consideration of balance between the reliability improvement and the cell area overhead. The power increase has been suppressed to less than 1 mA  相似文献   

19.
Efficient exploration of bus-based system-on-chip architectures   总被引:1,自引:0,他引:1  
Separation between computation and communication in system design allows system designers to explore the communication architecture independently after component selection and mapping decision is made. In this paper, we present an iterative two-step exploration methodology for bus-based on-chip communication architecture for multitask applications. We assume that the memory traces from the processing components are given. The proposed methodology uses a static performance estimation technique extended for multitask applications to reduce the design space quickly and drastically and applies a trace-driven simulation to the reduced set of design candidates for accurate performance estimation. For the case that local memory traffics as well as shared memory traffics are involved in bus contention, memory allocation is considered as an important axis of the design space in our technique. Experimental results show that the proposed methodology achieves significant performance gain by optimizing on-chip communication only, up to almost 100% compared with an initial single shared bus architecture, in both two real-life examples, a four-Channel digital video recorder and an equalizer for OFDM DVB-T receiver.  相似文献   

20.
With high clock frequencies, faster transistor rise/fall time, wider wires, and the use of Cu material interconnects, interconnect inductive noise is becoming an important design metric in digital circuits. An efficient technique to reduce the inductive noise of on-chip interconnects is to insert shields among signal wires. An efficient solution for the min-area shield insertion problem to satisfy given explicit noise bounds in multiple coupled nets is provided. The proposed algorithm determines the locations and number of shields needed to satisfy certain noise constraints. Experimental results show that the proposed approach minimizes the number of shields required to satisfy the noise constraints and uses less runtime than the best alternative reported approach.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号