共查询到20条相似文献,搜索用时 15 毫秒
1.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(12):1696-1707
2.
Cesare Ferri Dimitra Papagiannopoulou R. Iris Bahar Andrea Calimera 《Journal of Electronic Testing》2012,28(3):349-363
The push to embed reliable and low-power memories architectures into modern systems-on-chip is driving the EDA community to develop new design techniques and circuit solutions that can concurrently optimize aging effects due to Negative Bias Temperature Instability (NBTI), and static power consumption due to leakage mechanisms. While recent works have shown how conventional leakage optimization techniques can help mitigate NBTI-induced aging effects on cache memories, in this paper we focus specifically on scratchpad memory (SPM) and present novel software approaches as a means of alleviating the NBTI-induced aging effects. In particular, we demonstrate how intelligent software directed data allocation strategies can extend the lifetime of partitioned SPMs by means of distributing the idleness across the memory sub-banks. 相似文献
3.
Yuan Xie Wolf W. Lekatsas H. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(8):975-980
Code size "bloating" in embedded very long instruction word (VLIW) processors is a major concern for embedded systems since memory is one of the most restricted resources. In this paper, we describe a code compression algorithm based on arithmetic coding, discuss how to design decompression architecture, and illustrate the tradeoffs between compression ratio and decompression overhead, by using different probability models. Experimental results for a VLIW embedded processor TMS320C6x show that compression ratios between 67% and 80% can be achieved, depending on the probability models used. A precache decompression unit design is implemented in TSMC 0.25 mum and a test chip is fabricated. 相似文献
4.
A system-on-chip (SOC) usually consists of many memory cores with different sizes and functionality, and they typically represent a significant portion of the SOC and therefore dominate its yield. Diagnostics for yield enhancement of the memory cores thus is a very important issue. In this paper we present two data compression techniques that can be used to speed up the transmission of diagnostic data from the embedded RAM built-in self-test (BIST) circuit that has diagnostic support to the external tester. The proposed syndrome-accumulation approach compresses the faulty-cell address and March syndrome to about 28% of the original size on average under the March-17N diagnostic test algorithm. The key component of the compressor is a novel syndrome-accumulation circuit, which can be realized by a content-addressable memory. Experimental results show that the area overhead is about 0.9% for a 1Mb SRAM with 164 faults. A tree-based compression technique for word-oriented memories is also presented. By using a simplified Huffman coding scheme and partitioning each 256-bit Hamming syndrome into fixed-size symbols, the average compression ratio (size of original data to that of compressed data) is about 10, assuming 16-bit symbols. Also, the additional hardware to implement the tree-based compressor is very small. The proposed compression techniques effectively reduce the memory diagnosis time as well as the tester storage requirement. 相似文献
5.
Jun Zhang Tan Deng Qiuyan Gao Qingfeng Zhuge Edwin H.-M. Sha 《Journal of Signal Processing Systems》2013,72(3):151-164
Strict real-time processing and energy efficiency are required by high-performance Digital Signal Processing (DSP) applications. Scratch-Pad Memory (SPM), a software-controlled on-chip memory with small area and low energy consumption, has been widely used in many DSP systems. Various data placement algorithms are proposed to effectively manage data on SPMs. However, none of them can provide optimal solution of data placement problem for array data in loops. In this paper, we study the problem of how to optimally place array data in loops to multiple types of memory units such that the energy and time costs of memory accesses can be minimized. We design a dynamic programming algorithm, Iterational Optimal Data Placement (IODP), to solve data placement problem for loops for processor architectures with multiple types of memory units. According to the experimental results, the IODP algorithm reduced the energy consumption by 20.04 % and 8.98 % compared with a random memory placement method and a greedy algorithm, respectively. It also reduced the memory access time by 19.01 % and 8.62 % compared with a random memory placement method and a greedy approach. 相似文献
6.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(7):877-881
In this paper, we make the case for building high-performance asymmetric-cell caches (ACCs) that employ recently-proposed asymmetric SRAMs to reduce leakage proportionally to the number of resident zero bits. Because ACCs target memory value content (independent of cell activity and access patterns), they complement prior proposals for reducing cache leakage that target memory access characteristics. Through detailed simulation and leakage estimation using a commercial 0.13-$mu$ m CMOS process model, we show that: 1) on average 75% of resident data cache bits and 64% of resident instruction cache bits are zero; 2) while prior research carefully evaluated the fraction of accessed zero bytes, we show that a high fraction of accessed zero bytes is neither a necessary nor a sufficient condition for a high fraction of resident zero bits; 3) the zero-bit program behavior persists even when we restrict our attention to live data, thereby complementing prior leakage-saving techniques that target inactive cells; and 4) ACCs can reduce leakage on the average by 4.3$times$ compared to a conventional data cache without any performance loss, and by 9$times$ at the cost of a 5% increase in overall cache access latency. 相似文献
7.
Gauni Sabitha C. T. Manimegalai Srinivasa Raghavan Karthick Narayanan Nandanamudi Doraditya 《Wireless Personal Communications》2017,97(4):5089-5098
Appearance of highly intelligent and advanced robots that react and recognize to their environment has lead to the increment in the complexity of the embedded systems that govern their reaction. This developing multifaceted nature, over a timeframe, has influenced the general responsiveness and battery life of the robot. From controlling a drone to an internet controlled coffee maker, an on-board processor for the calculation of all control signs is required for setting off a reaction to the stimuli. At the point when the unpredictability of the embedded system rises, more convoluted computation is required to react to those inputs. This has not only reduced the general responsiveness and battery life of the robot but also has additionally prompted the need of overhauling the equipment to suit the upsurge in the calculation required. The solution to this problem is by offloading the critical tasks like control signal computation or image processing to a centralized server which would eliminate the requirement of complex on-board processors. A comprehensive analysis is presented on the feasibility of such code offloading in embedded robotic systems and its applicability to other domains is discussed. The obtained results will be compared to the results obtained in other domains involving code offloading and analysed to strike out any patterns in their results. 相似文献
8.
Ashok Sudarsanam Sharad Malik Masahiro Fujita 《Design Automation for Embedded Systems》1999,4(2-3):187-206
We address the problem of code generation for embedded DSP systems. Such systems devote a limited quantity of silicon to program
memory, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various
high-performance constraints. Unfortunately, current compiler technology is unable to generate dense, high-performance code
for DSPs, due to the fact that it does not provide adequate support for the specialized architectural features of DSPs via
machine-dependent code optimizations. Thus, designers often program the embedded software in assembly, a very time-consuming
task. In order to increase productivity, compilers must be developed that are capable of generating high-quality code for
DSPs. The compilation process must also be made retargetable, so that a variety of DSPs may be efficiently evaluated for potential
use in an embedded system. We present a retargetable compilation methodology that enables high-quality code to be generated
for a wide range of DSPs. Previous work in retargetable DSP compilation has focused on complete automation, and this desire
for automation has limited the number of machine-dependent optimizations that can be supported. In our efforts, we have given
code quality higher priority over complete automation. We demonstrate how by using a library of machine-dependent optimization
routines accessible via a programming interface, it is possible to support a wide range of machine-dependent optimizations,
albeit at some cost to automation. Experimental results demonstrate the effectiveness of our methodology, which has been used
to build good-quality compilers for three fixed-point DSPs.
This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献
9.
This paper introduces a color-aware instruction set extension that enhances the performance and efficiency in the processing
of color images and video. Traditional multimedia extensions (e.g., MMX, VIS, and MDMX) depend solely on generic subword parallelism
whereas the proposed color-aware instruction set (CAX) supports parallel operations on two-packed 16-bit (6:5:5) YCbCr (luminance-chrominance)
values. A 16-bit YCbCr representation reduces storage requirements by 33% over the baseline 24-bit YCbCr representation while
maintaining satisfactory image quality. Experimental results on an identically configured, dynamically scheduled superscalar
processor indicate that CAX outperforms MDMX (a representative MIPS multimedia extension) in terms of speedup (3.9× baseline
ISA with CAX versus 2.1× with MDMX) and energy reduction (75.8% reduction over baseline with CAX versus 54.8% reduction with
MDMX). CAX can improve the performance and efficiency of future embedded color imaging products. 相似文献
10.
Frustaci F. Corsonello P. Perri S. Cocorullo G. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(11):1238-1249
The techniques known in literature for the design of SRAM structures with low standby leakage typically exploit an additional operation mode, named the sleep mode or the standby mode. In this paper, existing low leakage SRAM structures are analyzed by several SPEC2000 benchmarks. As expected, the examined SRAM architectures have static power consumption lower than the conventional 6-T SRAM cell. However, the additional activities performed to enter and to exit the sleep mode also lead to higher dynamic energy. Our study demonstrates that, due to this, the overall energy consumption achieved by the known low-leakage techniques is greater than the conventional approach. In the second part of this paper, a novel low-leakage SRAM cell is presented. The proposed structure establishes when to enter and to exit the sleep mode, on the basis of the data stored in it, without introducing time and energy penalties with respect to the conventional 6-T cell. The new SRAM structure was realized using the UMC 0.18-mum, 1.8-V, and the ST 90-nm 1-V CMOS technologies. Tests performed with a set of SPEC2000 benchmarks have shown that the proposed approach is actually energy efficient 相似文献
11.
《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(10):1259-1267
12.
《Circuits and Systems II: Express Briefs, IEEE Transactions on》2006,53(9):817-821
In this brief, we propose an energy-efficient branch target buffer (BTB) lookup scheme for the embedded processors. Unlike the traditional scheme in which the BTB has to be looked up every instruction fetch, in our design, the BTB is only looked up when the instruction is likely to be a taken branch. By dynamically profiling the taken traces during program execution, the new scheme can achieve the goal of one BTB lookup per taken trace. The experimental results show that, by filtering out the redundant lookups, our design can reduce the total processor energy consumption by about 5.24% on average for the MediaBench applications. 相似文献
13.
mc211vm is a process-level ARM-to-x86 binary translator developed in our lab in the past several years. Currently, it is able to emulate singlethreaded programs. We extend mc211vm to emulate multi-threaded programs. Our main task is to reconstruct its architecture for multi-threaded programs. Register mapping, code cache management, and address mapping in mc2llvm have all been modified. In addition, to further speed up the emulation, we collect hot paths, aggressively optimize and generate code for them at run time. Additional threads are used to alleviate the overhead. Thus, when the same hot path is walked through again, the corresponding optimized native code will be executed instead. In our experiments, our system is 8.8X faster than QEMU (quick emulator) on average when emulating the specified benchmarks with 8 guest threads. 相似文献
14.
蒋明 《微电子学与计算机》2000,17(2):61-64
文章介绍用Intel公司新的80C196UN单片机来设计数字滤波器。用此嵌入式数字信号处理器设计的滤波器的幅值、相位和响应较为精确,消除了多个或几个器件所引起的噪声,电压等误差。同时,文章还对C196UN硬件结构、软件设计及执行信号处理充集中的关键语句做了较好的概述。 相似文献
15.
针对嵌入式应用的特点,设计了一种基于RAM比较TAG的分支目标缓冲器(BTB),并通过硬件模拟方法(BTB控制逻辑用RTL实现,存储体用定制逻辑实现)研究BTB结构参数对BTB的性能、能耗以及对整个处理器系统的性能和能耗的影响,根据仿真结果选取应用于嵌入式处理器的最优BTB结构参数.根据该参数,进一步设计基于CAM比较TAG的BTB,经SPEC2000评测,相对于基于RAM比较TAG的BTB,基于CAM比较TAG的BTB可使功耗降低37.17%. 相似文献
16.
Insoo Lee Jinmo Kwon Jangwon Park Jongsun Park 《Journal of Signal Processing Systems》2013,73(2):123-136
With aggressive supply voltage scaling, SRAM bit-cell failures in the embedded memory of the H.264 system result in significant degradation to video quality. Error Correction Coding (ECC) has been widely used in the embedded memories in order to correct these failures, however, the conventional ECC approach does not consider the differences in the importance of the data stored in the memory. This paper presents a priority based ECC (PB-ECC) approach, where the more important higher order bits (HOBs) are protected with higher priority than the less important lower order bits (LOBs) since the human visual system is less sensitive to LOB errors. The mathematical analysis regarding the error correction capability of the PB-ECC scheme and its resulting peak signal-to-noise ratio(PSNR) degradation in H.264 system are also presented to help the designers to determine the bit-allocation of the higher and lower priority segments of the embedded memory. We designed and implemented three PB-ECC cases (Hamming only, BCH only, and Hybrid PB-ECC) using 90 nm CMOS technology. With the supply voltage at 900 mV or below, the experiment results delivers up to 6.0 dB PSNR improvement with a smaller circuit area compared to the conventional ECC approach. 相似文献
17.
Embedded content addressable memories (CAMs) are important components in many system chips where most CAMs are customized and have wide words. This poses challenges on testing and diagnosis. In this paper two efficient March-like test algorithms are proposed first. In addition to typical RAM faults, they also cover CAM-specific comparison faults. The first algorithm requires 9N Read/Write operations and 2(N + W) Compare operations to cover comparison and RAM faults (but does not fully cover the intra-word coupling faults), for an N × W-bit CAM. The second algorithm uses 3N log2
W Write and 2W log2
W Compare operations to cover the remaining intra-word coupling faults. Compared with the previous algorithms, the proposed algorithms have higher fault coverage and lower time complexity. Moreover, it can test the CAM even when its comparison result is observed only by the Hit output or the priority encoder output. We also present the algorithms that can locate the cells with comparison faults. Finally, a CAM BIST design is briefly described. 相似文献
18.
19.
This paper proposes a software pipelining framework, CALiBeR (ClusterAware Load Balancing Retiming Algorithm), suitable for compilers targetingclustered embedded VLIW processors. CALiBeR can be used by embedded systemdesigners to explore different code optimization alternatives, that is, high-qualitycustomized retiming solutions for desired throughput and program memory sizerequirements, while minimizing register pressure. An extensive set of experimentalresults is presented, demonstrating that our algorithm compares favorablywith one of the best state-of-the-art algorithms, achieving up to 50% improvementin performance and up to 47% improvement in register requirements. In orderto empirically assess the effectiveness of clustering for high ILP applications,additional experiments are presented contrasting the performance achievedby software pipelined kernels executing on clustered and on centralized machines. 相似文献
20.
Guillermo Talavera Murali Jayapala Jordi Carrabina Francky Catthoor 《Journal of Signal Processing Systems》2008,53(3):271-284
Nowadays embedded systems are growing at an impressive rate and provide more and more sophisticated applications characterized
by having a complex array index manipulation and a large number of data accesses. Those applications require high performance
specific computation that general purpose processors can not deliver at a reasonable energy consumption. Very long instruction
word architectures seem a good solution providing enough computational performance at low power with the required programmability
to speed up the time to market. Those architectures rely on compiler effort to exploit the available instruction and data
parallelism to keep the data path busy all the time. With the density of transistors doubling each 18 months, more and more
sophisticated architectures with a high number of computational resources running in parallel are emerging. With this increasing
parallel computation, the access to data is becoming the main bottleneck that limits the available parallelism. To alleviate
this problem, in current embedded architectures, a special unit works in parallel with the main computing elements to ensure
efficient feed and storage of the data: the address generator unit, which comes in many flavors. Future architectures will
have to deal with enormous memory bandwidth in distributed memories and the development of address generators units will be
crucial for effective next generation of embedded processors where global trade-offs between reaction-time, bandwidth, energy
and area must be achieved. This paper provides a survey of methods and techniques that optimize the address generation process
for embedded systems, explaining current research trends and needs for future.
相似文献
Francky CatthoorEmail: |