首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Code density is of increasing concern in embedded system design since it reduces the need for the scarce resource memory and also implicitly improves further important design parameters like power consumption and performance. In this paper we introduce a novel, hardware-supported approach. Besides the code, also the lookup tables (LUTs) are compressed, that can become significant in size if the application is large and/or high compression is desired. Our scheme optimizes the number and size of generated LUTs to improve the compression ratio. To show the efficiency of our approach, we apply it to two compression schemes: “dictionary-based” and “statistical”. We achieve an average compression ratio of 48% (already including the overhead of the LUTs). Thereby, our scheme is orthogonal to approaches that take particularities of a certain instruction set architecture into account. We have conducted evaluations using a representative set of applications and have applied it to three major embedded processor architectures, namely ARM, MIPS, and PowerPC.   相似文献   

2.
The push to embed reliable and low-power memories architectures into modern systems-on-chip is driving the EDA community to develop new design techniques and circuit solutions that can concurrently optimize aging effects due to Negative Bias Temperature Instability (NBTI), and static power consumption due to leakage mechanisms. While recent works have shown how conventional leakage optimization techniques can help mitigate NBTI-induced aging effects on cache memories, in this paper we focus specifically on scratchpad memory (SPM) and present novel software approaches as a means of alleviating the NBTI-induced aging effects. In particular, we demonstrate how intelligent software directed data allocation strategies can extend the lifetime of partitioned SPMs by means of distributing the idleness across the memory sub-banks.  相似文献   

3.
Code size "bloating" in embedded very long instruction word (VLIW) processors is a major concern for embedded systems since memory is one of the most restricted resources. In this paper, we describe a code compression algorithm based on arithmetic coding, discuss how to design decompression architecture, and illustrate the tradeoffs between compression ratio and decompression overhead, by using different probability models. Experimental results for a VLIW embedded processor TMS320C6x show that compression ratios between 67% and 80% can be achieved, depending on the probability models used. A precache decompression unit design is implemented in TSMC 0.25 mum and a test chip is fabricated.  相似文献   

4.
A system-on-chip (SOC) usually consists of many memory cores with different sizes and functionality, and they typically represent a significant portion of the SOC and therefore dominate its yield. Diagnostics for yield enhancement of the memory cores thus is a very important issue. In this paper we present two data compression techniques that can be used to speed up the transmission of diagnostic data from the embedded RAM built-in self-test (BIST) circuit that has diagnostic support to the external tester. The proposed syndrome-accumulation approach compresses the faulty-cell address and March syndrome to about 28% of the original size on average under the March-17N diagnostic test algorithm. The key component of the compressor is a novel syndrome-accumulation circuit, which can be realized by a content-addressable memory. Experimental results show that the area overhead is about 0.9% for a 1Mb SRAM with 164 faults. A tree-based compression technique for word-oriented memories is also presented. By using a simplified Huffman coding scheme and partitioning each 256-bit Hamming syndrome into fixed-size symbols, the average compression ratio (size of original data to that of compressed data) is about 10, assuming 16-bit symbols. Also, the additional hardware to implement the tree-based compressor is very small. The proposed compression techniques effectively reduce the memory diagnosis time as well as the tester storage requirement.  相似文献   

5.
Strict real-time processing and energy efficiency are required by high-performance Digital Signal Processing (DSP) applications. Scratch-Pad Memory (SPM), a software-controlled on-chip memory with small area and low energy consumption, has been widely used in many DSP systems. Various data placement algorithms are proposed to effectively manage data on SPMs. However, none of them can provide optimal solution of data placement problem for array data in loops. In this paper, we study the problem of how to optimally place array data in loops to multiple types of memory units such that the energy and time costs of memory accesses can be minimized. We design a dynamic programming algorithm, Iterational Optimal Data Placement (IODP), to solve data placement problem for loops for processor architectures with multiple types of memory units. According to the experimental results, the IODP algorithm reduced the energy consumption by 20.04 % and 8.98 % compared with a random memory placement method and a greedy algorithm, respectively. It also reduced the memory access time by 19.01 % and 8.62 % compared with a random memory placement method and a greedy approach.  相似文献   

6.
In this paper, we make the case for building high-performance asymmetric-cell caches (ACCs) that employ recently-proposed asymmetric SRAMs to reduce leakage proportionally to the number of resident zero bits. Because ACCs target memory value content (independent of cell activity and access patterns), they complement prior proposals for reducing cache leakage that target memory access characteristics. Through detailed simulation and leakage estimation using a commercial 0.13-$mu$m CMOS process model, we show that: 1) on average 75% of resident data cache bits and 64% of resident instruction cache bits are zero; 2) while prior research carefully evaluated the fraction of accessed zero bytes, we show that a high fraction of accessed zero bytes is neither a necessary nor a sufficient condition for a high fraction of resident zero bits; 3) the zero-bit program behavior persists even when we restrict our attention to live data, thereby complementing prior leakage-saving techniques that target inactive cells; and 4) ACCs can reduce leakage on the average by 4.3$times$compared to a conventional data cache without any performance loss, and by 9$times$at the cost of a 5% increase in overall cache access latency.  相似文献   

7.
Appearance of highly intelligent and advanced robots that react and recognize to their environment has lead to the increment in the complexity of the embedded systems that govern their reaction. This developing multifaceted nature, over a timeframe, has influenced the general responsiveness and battery life of the robot. From controlling a drone to an internet controlled coffee maker, an on-board processor for the calculation of all control signs is required for setting off a reaction to the stimuli. At the point when the unpredictability of the embedded system rises, more convoluted computation is required to react to those inputs. This has not only reduced the general responsiveness and battery life of the robot but also has additionally prompted the need of overhauling the equipment to suit the upsurge in the calculation required. The solution to this problem is by offloading the critical tasks like control signal computation or image processing to a centralized server which would eliminate the requirement of complex on-board processors. A comprehensive analysis is presented on the feasibility of such code offloading in embedded robotic systems and its applicability to other domains is discussed. The obtained results will be compared to the results obtained in other domains involving code offloading and analysed to strike out any patterns in their results.  相似文献   

8.
We address the problem of code generation for embedded DSP systems. Such systems devote a limited quantity of silicon to program memory, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints. Unfortunately, current compiler technology is unable to generate dense, high-performance code for DSPs, due to the fact that it does not provide adequate support for the specialized architectural features of DSPs via machine-dependent code optimizations. Thus, designers often program the embedded software in assembly, a very time-consuming task. In order to increase productivity, compilers must be developed that are capable of generating high-quality code for DSPs. The compilation process must also be made retargetable, so that a variety of DSPs may be efficiently evaluated for potential use in an embedded system. We present a retargetable compilation methodology that enables high-quality code to be generated for a wide range of DSPs. Previous work in retargetable DSP compilation has focused on complete automation, and this desire for automation has limited the number of machine-dependent optimizations that can be supported. In our efforts, we have given code quality higher priority over complete automation. We demonstrate how by using a library of machine-dependent optimization routines accessible via a programming interface, it is possible to support a wide range of machine-dependent optimizations, albeit at some cost to automation. Experimental results demonstrate the effectiveness of our methodology, which has been used to build good-quality compilers for three fixed-point DSPs. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

9.
This paper introduces a color-aware instruction set extension that enhances the performance and efficiency in the processing of color images and video. Traditional multimedia extensions (e.g., MMX, VIS, and MDMX) depend solely on generic subword parallelism whereas the proposed color-aware instruction set (CAX) supports parallel operations on two-packed 16-bit (6:5:5) YCbCr (luminance-chrominance) values. A 16-bit YCbCr representation reduces storage requirements by 33% over the baseline 24-bit YCbCr representation while maintaining satisfactory image quality. Experimental results on an identically configured, dynamically scheduled superscalar processor indicate that CAX outperforms MDMX (a representative MIPS multimedia extension) in terms of speedup (3.9× baseline ISA with CAX versus 2.1× with MDMX) and energy reduction (75.8% reduction over baseline with CAX versus 54.8% reduction with MDMX). CAX can improve the performance and efficiency of future embedded color imaging products.  相似文献   

10.
The techniques known in literature for the design of SRAM structures with low standby leakage typically exploit an additional operation mode, named the sleep mode or the standby mode. In this paper, existing low leakage SRAM structures are analyzed by several SPEC2000 benchmarks. As expected, the examined SRAM architectures have static power consumption lower than the conventional 6-T SRAM cell. However, the additional activities performed to enter and to exit the sleep mode also lead to higher dynamic energy. Our study demonstrates that, due to this, the overall energy consumption achieved by the known low-leakage techniques is greater than the conventional approach. In the second part of this paper, a novel low-leakage SRAM cell is presented. The proposed structure establishes when to enter and to exit the sleep mode, on the basis of the data stored in it, without introducing time and energy penalties with respect to the conventional 6-T cell. The new SRAM structure was realized using the UMC 0.18-mum, 1.8-V, and the ST 90-nm 1-V CMOS technologies. Tests performed with a set of SPEC2000 benchmarks have shown that the proposed approach is actually energy efficient  相似文献   

11.
Automatic generation of a customized instruction set, starting from an input application code, is a complex problem that has received considerable attention in the past few years. Because of its complexity, only simplified versions of the problem have been solved exactly so far. For example, exact algorithms have been proposed for custom instruction identification but that do not consider recurrence; other methods exist that can indeed handle recurrence, but are limited in how complex an instruction they can identify. However, an exact solution that can handle identification and recurrence simultaneously has been missing. We divide the problem into several parts and concentrate on covering, that is, selecting a set of nonoverlapping and possibly recurrent custom instructions to be implemented and used. We then propose a range of novel algorithms, both exact and approximate, to solve the covering problem in conjunction with the recurrence of candidate extensions. We propose an optimal search technique that uses branch-and-bound to improve an existing solution, in conjunction with a greedy search to help the algorithm out of any local optima, and achieve a tangible improvement over nonrecurrence-aware covering.   相似文献   

12.
In this brief, we propose an energy-efficient branch target buffer (BTB) lookup scheme for the embedded processors. Unlike the traditional scheme in which the BTB has to be looked up every instruction fetch, in our design, the BTB is only looked up when the instruction is likely to be a taken branch. By dynamically profiling the taken traces during program execution, the new scheme can achieve the goal of one BTB lookup per taken trace. The experimental results show that, by filtering out the redundant lookups, our design can reduce the total processor energy consumption by about 5.24% on average for the MediaBench applications.  相似文献   

13.
mc211vm is a process-level ARM-to-x86 binary translator developed in our lab in the past several years. Currently, it is able to emulate singlethreaded programs. We extend mc211vm to emulate multi-threaded programs. Our main task is to reconstruct its architecture for multi-threaded programs. Register mapping, code cache management, and address mapping in mc2llvm have all been modified. In addition, to further speed up the emulation, we collect hot paths, aggressively optimize and generate code for them at run time. Additional threads are used to alleviate the overhead. Thus, when the same hot path is walked through again, the corresponding optimized native code will be executed instead. In our experiments, our system is 8.8X faster than QEMU (quick emulator) on average when emulating the specified benchmarks with 8 guest threads.  相似文献   

14.
文章介绍用Intel公司新的80C196UN单片机来设计数字滤波器。用此嵌入式数字信号处理器设计的滤波器的幅值、相位和响应较为精确,消除了多个或几个器件所引起的噪声,电压等误差。同时,文章还对C196UN硬件结构、软件设计及执行信号处理充集中的关键语句做了较好的概述。  相似文献   

15.
针对嵌入式应用的特点,设计了一种基于RAM比较TAG的分支目标缓冲器(BTB),并通过硬件模拟方法(BTB控制逻辑用RTL实现,存储体用定制逻辑实现)研究BTB结构参数对BTB的性能、能耗以及对整个处理器系统的性能和能耗的影响,根据仿真结果选取应用于嵌入式处理器的最优BTB结构参数.根据该参数,进一步设计基于CAM比较TAG的BTB,经SPEC2000评测,相对于基于RAM比较TAG的BTB,基于CAM比较TAG的BTB可使功耗降低37.17%.  相似文献   

16.
With aggressive supply voltage scaling, SRAM bit-cell failures in the embedded memory of the H.264 system result in significant degradation to video quality. Error Correction Coding (ECC) has been widely used in the embedded memories in order to correct these failures, however, the conventional ECC approach does not consider the differences in the importance of the data stored in the memory. This paper presents a priority based ECC (PB-ECC) approach, where the more important higher order bits (HOBs) are protected with higher priority than the less important lower order bits (LOBs) since the human visual system is less sensitive to LOB errors. The mathematical analysis regarding the error correction capability of the PB-ECC scheme and its resulting peak signal-to-noise ratio(PSNR) degradation in H.264 system are also presented to help the designers to determine the bit-allocation of the higher and lower priority segments of the embedded memory. We designed and implemented three PB-ECC cases (Hamming only, BCH only, and Hybrid PB-ECC) using 90 nm CMOS technology. With the supply voltage at 900 mV or below, the experiment results delivers up to 6.0 dB PSNR improvement with a smaller circuit area compared to the conventional ECC approach.  相似文献   

17.
Embedded content addressable memories (CAMs) are important components in many system chips where most CAMs are customized and have wide words. This poses challenges on testing and diagnosis. In this paper two efficient March-like test algorithms are proposed first. In addition to typical RAM faults, they also cover CAM-specific comparison faults. The first algorithm requires 9N Read/Write operations and 2(N + W) Compare operations to cover comparison and RAM faults (but does not fully cover the intra-word coupling faults), for an N × W-bit CAM. The second algorithm uses 3N log2 W Write and 2W log2 W Compare operations to cover the remaining intra-word coupling faults. Compared with the previous algorithms, the proposed algorithms have higher fault coverage and lower time complexity. Moreover, it can test the CAM even when its comparison result is observed only by the Hit output or the priority encoder output. We also present the algorithms that can locate the cells with comparison faults. Finally, a CAM BIST design is briefly described.  相似文献   

18.
为了有效减小程序代码尺寸、节省芯片设计面积,以一种采用Verilog HDL硬件描述语言的自主设计传输触发架构(TTA)微处理器为目标内核,研究并实现了用字典压缩算法对程序代码进行压缩和解压缩.根据TTA架构微处理器内核的特点,在不同压缩粒度上对代码压缩进行优化,并对经典的LZ78字典压缩算法进行代码压缩应用方面的改进.测试结果表明,优化的字典压缩算法改善了代码压缩效果.采用考虑字典大小的代码压缩率评估方法,实现了代码压缩率的有效评估.  相似文献   

19.
This paper proposes a software pipelining framework, CALiBeR (ClusterAware Load Balancing Retiming Algorithm), suitable for compilers targetingclustered embedded VLIW processors. CALiBeR can be used by embedded systemdesigners to explore different code optimization alternatives, that is, high-qualitycustomized retiming solutions for desired throughput and program memory sizerequirements, while minimizing register pressure. An extensive set of experimentalresults is presented, demonstrating that our algorithm compares favorablywith one of the best state-of-the-art algorithms, achieving up to 50% improvementin performance and up to 47% improvement in register requirements. In orderto empirically assess the effectiveness of clustering for high ILP applications,additional experiments are presented contrasting the performance achievedby software pipelined kernels executing on clustered and on centralized machines.  相似文献   

20.
Nowadays embedded systems are growing at an impressive rate and provide more and more sophisticated applications characterized by having a complex array index manipulation and a large number of data accesses. Those applications require high performance specific computation that general purpose processors can not deliver at a reasonable energy consumption. Very long instruction word architectures seem a good solution providing enough computational performance at low power with the required programmability to speed up the time to market. Those architectures rely on compiler effort to exploit the available instruction and data parallelism to keep the data path busy all the time. With the density of transistors doubling each 18 months, more and more sophisticated architectures with a high number of computational resources running in parallel are emerging. With this increasing parallel computation, the access to data is becoming the main bottleneck that limits the available parallelism. To alleviate this problem, in current embedded architectures, a special unit works in parallel with the main computing elements to ensure efficient feed and storage of the data: the address generator unit, which comes in many flavors. Future architectures will have to deal with enormous memory bandwidth in distributed memories and the development of address generators units will be crucial for effective next generation of embedded processors where global trade-offs between reaction-time, bandwidth, energy and area must be achieved. This paper provides a survey of methods and techniques that optimize the address generation process for embedded systems, explaining current research trends and needs for future.
Francky CatthoorEmail:
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号