期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient Code Compression for Embedded Processors

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(12):1696-1707

Code density is of increasing concern in embedded system design since it reduces the need for the scarce resource memory and also implicitly improves further important design parameters like power consumption and performance. In this paper we introduce a novel, hardware-supported approach. Besides the code, also the lookup tables (LUTs) are compressed, that can become significant in size if the application is large and/or high compression is desired. Our scheme optimizes the number and size of generated LUTs to improve the compression ratio. To show the efficiency of our approach, we apply it to two compression schemes: “dictionary-based” and “statistical”. We achieve an average compression ratio of 48% (already including the overhead of the LUTs). Thereby, our scheme is orthogonal to approaches that take particularities of a certain instruction set architecture into account. We have conducted evaluations using a representative set of applications and have applied it to three major embedded processor architectures, namely ARM, MIPS, and PowerPC. 相似文献

2.

NBTI-Aware Data Allocation Strategies for Scratchpad Based Embedded Systems

Cesare Ferri Dimitra Papagiannopoulou R. Iris Bahar Andrea Calimera 《Journal of Electronic Testing》2012,28(3):349-363

The push to embed reliable and low-power memories architectures into modern systems-on-chip is driving the EDA community to develop new design techniques and circuit solutions that can concurrently optimize aging effects due to Negative Bias Temperature Instability (NBTI), and static power consumption due to leakage mechanisms. While recent works have shown how conventional leakage optimization techniques can help mitigate NBTI-induced aging effects on cache memories, in this paper we focus specifically on scratchpad memory (SPM) and present novel software approaches as a means of alleviating the NBTI-induced aging effects. In particular, we demonstrate how intelligent software directed data allocation strategies can extend the lifetime of partitioned SPMs by means of distributing the idleness across the memory sub-banks. 相似文献

3.

Code Decompression Unit Design for VLIW Embedded Processors

Yuan Xie Wolf W. Lekatsas H. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(8):975-980

Code size "bloating" in embedded very long instruction word (VLIW) processors is a major concern for embedded systems since memory is one of the most restricted resources. In this paper, we describe a code compression algorithm based on arithmetic coding, discuss how to design decompression architecture, and illustrate the tradeoffs between compression ratio and decompression overhead, by using different probability models. Experimental results for a VLIW embedded processor TMS320C6x show that compression ratios between 67% and 80% can be achieved, depending on the probability models used. A precache decompression unit design is implemented in TSMC 0.25 mum and a test chip is fabricated. 相似文献

4.

Diagnostic Data Compression Techniques for Embedded Memories with Built-In Self-Test

Jin-Fu Li Ruey-Shing Tzeng Cheng-Wen Wu 《Journal of Electronic Testing》2002,18(4-5):515-527

A system-on-chip (SOC) usually consists of many memory cores with different sizes and functionality, and they typically represent a significant portion of the SOC and therefore dominate its yield. Diagnostics for yield enhancement of the memory cores thus is a very important issue. In this paper we present two data compression techniques that can be used to speed up the transmission of diagnostic data from the embedded RAM built-in self-test (BIST) circuit that has diagnostic support to the external tester. The proposed syndrome-accumulation approach compresses the faulty-cell address and March syndrome to about 28% of the original size on average under the March-17N diagnostic test algorithm. The key component of the compressor is a novel syndrome-accumulation circuit, which can be realized by a content-addressable memory. Experimental results show that the area overhead is about 0.9% for a 1Mb SRAM with 164 faults. A tree-based compression technique for word-oriented memories is also presented. By using a simplified Huffman coding scheme and partitioning each 256-bit Hamming syndrome into fixed-size symbols, the average compression ratio (size of original data to that of compressed data) is about 10, assuming 16-bit symbols. Also, the additional hardware to implement the tree-based compressor is very small. The proposed compression techniques effectively reduce the memory diagnosis time as well as the tester storage requirement. 相似文献

5.

A Case for Asymmetric-Cell Cache Memories

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2005,13(7):877-881

In this paper, we make the case for building high-performance asymmetric-cell caches (ACCs) that employ recently-proposed asymmetric SRAMs to reduce leakage proportionally to the number of resident zero bits. Because ACCs target memory value content (independent of cell activity and access patterns), they complement prior proposals for reducing cache leakage that target memory access characteristics. Through detailed simulation and leakage estimation using a commercial 0.13-$mu$m CMOS process model, we show that: 1) on average 75% of resident data cache bits and 64% of resident instruction cache bits are zero; 2) while prior research carefully evaluated the fraction of accessed zero bytes, we show that a high fraction of accessed zero bytes is neither a necessary nor a sufficient condition for a high fraction of resident zero bits; 3) the zero-bit program behavior persists even when we restrict our attention to live data, thereby complementing prior leakage-saving techniques that target inactive cells; and 4) ACCs can reduce leakage on the average by 4.3$times$compared to a conventional data cache without any performance loss, and by 9$times$at the cost of a 5% increase in overall cache access latency. 相似文献

6.

Optimizing Data Placement of Loops for Energy Minimization with Multiple Types of Memories

Jun Zhang Tan Deng Qiuyan Gao Qingfeng Zhuge Edwin H.-M. Sha 《Journal of Signal Processing Systems》2013,72(3):151-164

Strict real-time processing and energy efficiency are required by high-performance Digital Signal Processing (DSP) applications. Scratch-Pad Memory (SPM), a software-controlled on-chip memory with small area and low energy consumption, has been widely used in many DSP systems. Various data placement algorithms are proposed to effectively manage data on SPMs. However, none of them can provide optimal solution of data placement problem for array data in loops. In this paper, we study the problem of how to optimally place array data in loops to multiple types of memory units such that the energy and time costs of memory accesses can be minimized. We design a dynamic programming algorithm, Iterational Optimal Data Placement (IODP), to solve data placement problem for loops for processor architectures with multiple types of memory units. According to the experimental results, the IODP algorithm reduced the energy consumption by 20.04 % and 8.98 % compared with a random memory placement method and a greedy algorithm, respectively. It also reduced the memory access time by 19.01 % and 8.62 % compared with a random memory placement method and a greedy approach. 相似文献

7.

Reduction of Complexity of On-Board Embedded Robotic System Processors Using Code Offloading

Gauni Sabitha C. T. Manimegalai Srinivasa Raghavan Karthick Narayanan Nandanamudi Doraditya 《Wireless Personal Communications》2017,97(4):5089-5098

Appearance of highly intelligent and advanced robots that react and recognize to their environment has lead to the increment in the complexity of the embedded systems that govern their reaction. This developing multifaceted nature, over a timeframe, has influenced the general responsiveness and battery life of the robot. From controlling a drone to an internet controlled coffee maker, an on-board processor for the calculation of all control signs is required for setting off a reaction to the stimuli. At the point when the unpredictability of the embedded system rises, more convoluted computation is required to react to those inputs. This has not only reduced the general responsiveness and battery life of the robot but also has additionally prompted the need of overhauling the equipment to suit the upsurge in the calculation required. The solution to this problem is by offloading the critical tasks like control signal computation or image processing to a centralized server which would eliminate the requirement of complex on-board processors. A comprehensive analysis is presented on the feasibility of such code offloading in embedded robotic systems and its applicability to other domains is discussed. The obtained results will be compared to the results obtained in other domains involving code offloading and analysed to strike out any patterns in their results. 相似文献

8.

Color-Aware Instructions for Embedded Superscalar Processors

Jongmyon Kim Linda M. Wills D. Scott Wills 《Journal of Signal Processing Systems》2011,64(3):335-350

This paper introduces a color-aware instruction set extension that enhances the performance and efficiency in the processing of color images and video. Traditional multimedia extensions (e.g., MMX, VIS, and MDMX) depend solely on generic subword parallelism whereas the proposed color-aware instruction set (CAX) supports parallel operations on two-packed 16-bit (6:5:5) YCbCr (luminance-chrominance) values. A 16-bit YCbCr representation reduces storage requirements by 33% over the baseline 24-bit YCbCr representation while maintaining satisfactory image quality. Experimental results on an identically configured, dynamically scheduled superscalar processor indicate that CAX outperforms MDMX (a representative MIPS multimedia extension) in terms of speedup (3.9× baseline ISA with CAX versus 2.1× with MDMX) and energy reduction (75.8% reduction over baseline with CAX versus 54.8% reduction with MDMX). CAX can improve the performance and efficiency of future embedded color imaging products. 相似文献

9.

A Retargetable Compilation Methodology for Embedded Digital Signal Processors Using a Machine-Dependent Code Optimization Library

Ashok Sudarsanam Sharad Malik Masahiro Fujita 《Design Automation for Embedded Systems》1999,4(2-3):187-206

We address the problem of code generation for embedded DSP systems. Such systems devote a limited quantity of silicon to program memory, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints. Unfortunately, current compiler technology is unable to generate dense, high-performance code for DSPs, due to the fact that it does not provide adequate support for the specialized architectural features of DSPs via machine-dependent code optimizations. Thus, designers often program the embedded software in assembly, a very time-consuming task. In order to increase productivity, compilers must be developed that are capable of generating high-quality code for DSPs. The compilation process must also be made retargetable, so that a variety of DSPs may be efficiently evaluated for potential use in an embedded system. We present a retargetable compilation methodology that enables high-quality code to be generated for a wide range of DSPs. Previous work in retargetable DSP compilation has focused on complete automation, and this desire for automation has limited the number of machine-dependent optimizations that can be supported. In our efforts, we have given code quality higher priority over complete automation. We demonstrate how by using a library of machine-dependent optimization routines accessible via a programming interface, it is possible to support a wide range of machine-dependent optimizations, albeit at some cost to automation. Experimental results demonstrate the effectiveness of our methodology, which has been used to build good-quality compilers for three fixed-point DSPs. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

10.

Techniques for Leakage Energy Reduction in Deep Submicrometer Cache Memories

Frustaci F. Corsonello P. Perri S. Cocorullo G. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(11):1238-1249

The techniques known in literature for the design of SRAM structures with low standby leakage typically exploit an additional operation mode, named the sleep mode or the standby mode. In this paper, existing low leakage SRAM structures are analyzed by several SPEC2000 benchmarks. As expected, the examined SRAM architectures have static power consumption lower than the conventional 6-T SRAM cell. However, the additional activities performed to enter and to exit the sleep mode also lead to higher dynamic energy. Our study demonstrates that, due to this, the overall energy consumption achieved by the known low-leakage techniques is greater than the conventional approach. In the second part of this paper, a novel low-leakage SRAM cell is presented. The proposed structure establishes when to enter and to exit the sleep mode, on the basis of the data stored in it, without introducing time and energy penalties with respect to the conventional 6-T cell. The new SRAM structure was realized using the UMC 0.18-mum, 1.8-V, and the ST 90-nm 1-V CMOS technologies. Tests performed with a set of SPEC2000 benchmarks have shown that the proposed approach is actually energy efficient 相似文献

11.

Dynamically Translating Binary Code for Multi-Threaded Programs Using Shared Code Cache

Chia-Lun Liu Jiunn-Yeu Chen Wuu Yang Wei-Chung Hsu 《中国电子科技》2014,(4):434-438

mc211vm is a process-level ARM-to-x86 binary translator developed in our lab in the past several years. Currently, it is able to emulate singlethreaded programs. We extend mc211vm to emulate multi-threaded programs. Our main task is to reconstruct its architecture for multi-threaded programs. Register mapping, code cache management, and address mapping in mc2llvm have all been modified. In addition, to further speed up the emulation, we collect hot paths, aggressively optimize and generate code for them at run time. Additional threads are used to alleviate the overhead. Thus, when the same hot path is walked through again, the corresponding optimized native code will be executed instead. In our experiments, our system is 8.8X faster than QEMU （quick emulator） on average when emulating the specified benchmarks with 8 guest threads. 相似文献

12.

Recurrence-Aware Instruction Set Selection for Extensible Embedded Processors

《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2008,16(10):1259-1267

Automatic generation of a customized instruction set, starting from an input application code, is a complex problem that has received considerable attention in the past few years. Because of its complexity, only simplified versions of the problem have been solved exactly so far. For example, exact algorithms have been proposed for custom instruction identification but that do not consider recurrence; other methods exist that can indeed handle recurrence, but are limited in how complex an instruction they can identify. However, an exact solution that can handle identification and recurrence simultaneously has been missing. We divide the problem into several parts and concentrate on covering, that is, selecting a set of nonoverlapping and possibly recurrent custom instructions to be implemented and used. We then propose a range of novel algorithms, both exact and approximate, to solve the covering problem in conjunction with the recurrence of candidate extensions. We propose an optimal search technique that uses branch-and-bound to improve an existing solution, in conjunction with a greedy search to help the algorithm out of any local optima, and achieve a tangible improvement over nonrecurrence-aware covering. 相似文献

13.

An Energy-Efficient BTB Lookup Scheme for Embedded Processors

《Circuits and Systems II: Express Briefs, IEEE Transactions on》2006,53(9):817-821

In this brief, we propose an energy-efficient branch target buffer (BTB) lookup scheme for the embedded processors. Unlike the traditional scheme in which the BTB has to be looked up every instruction fetch, in our design, the BTB is only looked up when the instruction is likely to be a taken branch. By dynamically profiling the taken traces during program execution, the new scheme can achieve the goal of one BTB lookup per taken trace. The experimental results show that, by filtering out the redundant lookups, our design can reduce the total processor energy consumption by about 5.24% on average for the MediaBench applications. 相似文献

14.

数字滤波器的设计和算法的实现

蒋明《微电子学与计算机》2000,17(2):61-64

文章介绍用Ｉｎｔｅｌ公司新的８０Ｃ１９６ＵＮ单片机来设计数字滤波器。用此嵌入式数字信号处理器设计的滤波器的幅值、相位和响应较为精确,消除了多个或几个器件所引起的噪声,电压等误差。同时,文章还对Ｃ１９６ＵＮ硬件结构、软件设计及执行信号处理充集中的关键语句做了较好的概述。相似文献

15.

嵌入式处理器中分支目标缓冲器的研究与设计

王晨旭张凯峰张祥建喻明艳《微电子学与计算机》2012,29(1):27-31

针对嵌入式应用的特点,设计了一种基于RAM比较TAG的分支目标缓冲器(BTB),并通过硬件模拟方法(BTB控制逻辑用RTL实现,存储体用定制逻辑实现)研究BTB结构参数对BTB的性能、能耗以及对整个处理器系统的性能和能耗的影响,根据仿真结果选取应用于嵌入式处理器的最优BTB结构参数.根据该参数,进一步设计基于CAM比较TAG的BTB,经SPEC2000评测,相对于基于RAM比较TAG的BTB,基于CAM比较TAG的BTB可使功耗降低37.17%. 相似文献

16.

Testing and Diagnosis Methodologies for Embedded Content Addressable Memories

Jin-Fu Li Ruey-Shing Tzeng Cheng-Wen Wu 《Journal of Electronic Testing》2003,19(2):207-215

Embedded content addressable memories (CAMs) are important components in many system chips where most CAMs are customized and have wide words. This poses challenges on testing and diagnosis. In this paper two efficient March-like test algorithms are proposed first. In addition to typical RAM faults, they also cover CAM-specific comparison faults. The first algorithm requires 9N Read/Write operations and 2(N + W) Compare operations to cover comparison and RAM faults (but does not fully cover the intra-word coupling faults), for an N × W-bit CAM. The second algorithm uses 3N log₂ W Write and 2W log₂ W Compare operations to cover the remaining intra-word coupling faults. Compared with the previous algorithms, the proposed algorithms have higher fault coverage and lower time complexity. Moreover, it can test the CAM even when its comparison result is observed only by the Hit output or the priority encoder output. We also present the algorithms that can locate the cells with comparison faults. Finally, a CAM BIST design is briefly described. 相似文献

17.

Priority Based Error Correction Code (ECC) for the Embedded SRAM Memories in H.264 System

Insoo Lee Jinmo Kwon Jangwon Park Jongsun Park 《Journal of Signal Processing Systems》2013,73(2):123-136

With aggressive supply voltage scaling, SRAM bit-cell failures in the embedded memory of the H.264 system result in significant degradation to video quality. Error Correction Coding (ECC) has been widely used in the embedded memories in order to correct these failures, however, the conventional ECC approach does not consider the differences in the importance of the data stored in the memory. This paper presents a priority based ECC (PB-ECC) approach, where the more important higher order bits (HOBs) are protected with higher priority than the less important lower order bits (LOBs) since the human visual system is less sensitive to LOB errors. The mathematical analysis regarding the error correction capability of the PB-ECC scheme and its resulting peak signal-to-noise ratio(PSNR) degradation in H.264 system are also presented to help the designers to determine the bit-allocation of the higher and lower priority segments of the embedded memory. We designed and implemented three PB-ECC cases (Hamming only, BCH only, and Hybrid PB-ECC) using 90 nm CMOS technology. With the supply voltage at 900 mV or below, the experiment results delivers up to 6.0 dB PSNR improvement with a smaller circuit area compared to the conventional ECC approach. 相似文献

18.

System-Level Energy-Delay Exploration for Multimedia Applications on Embedded Cores with Hardware Cache

C. Kulkarni D. Moolenaar L. Nachtergaele F. Catthoor H. De Man 《The Journal of VLSI Signal Processing》1999,22(1):45-57

Program transformations are a powerful way of optimizing given applications for lower power and higher performance. In this paper, we explore avenues for power reduction by program transformations using the real-time constraints. In the sequel, we discuss the effects of our methodology, for optimization of power, on cache related performance aspects. Our target applications are in the real-time multimedia applications domain implemented on programmable multimedia or DSP processors. The effectiveness of our approach in obtaining a low power implementation and real-time performance is illustrated on three real-life applications, viz. a MPEG-2 decoder, a QSDPCM video codec and a Voicecoder application. Our experimental results indeed show that we are able to obtain lower power and still achieve a real-time performance. 相似文献

19.

Compositional,Dynamic Cache Management for Embedded Chip Multiprocessors

Anca M. Molnos Sorin D. Cotofana Marc J. M. Heijligers Jos T. J. van Eijndhoven 《Journal of Signal Processing Systems》2009,57(2):155-172

This paper proposes a dynamic cache repartitioning technique that enhances compositionality on platforms executing media applications with multiple utilization scenarios. Because the repartitioning between scenarios requires a cache flush, two undesired effects may occur: (1) in particular, the execution of critical tasks may be disturbed and (2) in general, a performance penalty is involved. To cope with these effects we propose a method which: (1) determines, at design time, the cache footprint of each tasks, such that it creates the premises for critical tasks safety, and minimum flush in general, and (2) enforces, at run-time, the design time determined cache footprints and further decreases the flush penalty. We implement our dynamic cache management strategy on a CAKE multiprocessor with 4 Trimedia cores. The experimental workload consists of 6 multimedia applications, each of which formed by multiple tasks belonging to an extended MediaBench suite. We found on average that: (1) the relative variations of critical tasks execution time are less than 0.1%, regardless of the scenario switching frequency, (2) for realistic scenario switching frequencies the inter-task cache interference is at most 4% for the repartitioned cache, whereas for the shared cache it reaches 68%, and (3) the off-chip memory traffic reduces with 60%, and the performance (in cycles per instruction) enhances with 10%, when compared with the shared cache.

Anca M. MolnosEmail:

相似文献

20.

Instruction Cache Locking for Embedded Systems using Probability Profile

Tiantian Liu Minming Li Chun Jason Xue 《Journal of Signal Processing Systems》2012,69(2):173-188

Cache is effective in bridging the gap between processor and memory speed. It is also a source of unpredictability because of its dynamic and adaptive behavior. A lot of modern processors provide cache locking capability which locks instructions or data of a program into cache so that a more precise estimation of execution time can be obtained. The selection of instructions or data to be locked in cache has a dramatic influence on the system performance. For real-time systems, cache locking is mostly utilized to improve the Worst-Case Execution Time (WCET). However, Average-Case Execution Time (ACET) is also an important criterion for some embedded systems, especially for soft real-time embedded systems, such as image processing systems. This paper aims to utilize instruction cache (I-Cache) locking technique to guarantee a minimized estimable ACET for embedded systems by exploring the probability profile information. A Probability Execution Flow Tree (PEFT) is introduced to model an embedded application with runtime profile information. The static I-Cache locking problem is proved to be NP-Hard and two kinds of locking, fully locking and partially locking, are proposed to find the instructions to be locked. Dynamic I-Cache locking can further improve the ACET. For dynamic I-Cache locking, an algorithm that leverages the application’s branching information is proposed. All the algorithms are executed during the compilation time and the results are applied during the runtime. Experimental results show that the proposed algorithms reduce the ACET of embedded applications further compared to state-of-the-art techniques. 相似文献