首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We address the problem of code generation for embedded DSP systems. In such systems, it is typical for one or more digital signal processors (DSPs), program memory, and custom circuitry to be integrated onto a single IC. Consequently, the amount of silicon area that is dedicated to program memory is limited, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints, which may include hard real-time constraints. Unfortunately, existing compiler technology is unable to generate dense, high-performance code for DSPs since it does not provide adequate support for the specialized architectural features of DSPs. These specialized features not only allow for the fast execution of common DSP operations, but they also allow for the generation of dense assembly code that specifies these operations. Thus, system designers often hand-program the embedded software in assembly, which is a very time-consuming task. In this paper, we focus on providing compiler support for one particular specialized architectural feature, namely the paged absolute addressing mode – this feature is found in two commercial DSPs, the Texas Instruments' TMS320C25 and TMS320C50 fixed-point DSPs; however, it may also be featured in application-specific processors (ASIPs). We present some machine-dependent code optimizations that improve code density by exploiting this architectural feature. Experimental results demonstrate that for a set of typical DSP benchmarks, some of our optimizations reduce overall code size and data memory consumption by an average of 5.0% and 16.0%, respectively. Our experimental vehicle throughout this research is the TMS320C25.  相似文献   

2.
We address the problem of code generation for DSP systems on a chip. In such systems, the amount of silicon devoted to program ROM is limited, so in addition to meeting various high-performance constraints, the application software must be sufficiently dense. Unfortunately, existing compiler technology is unable to generate high-quality code for DSPs since it does not provide adequate support for the specialized architectural features of DSPs. Thus, designers often resort to programming application software in assembly, which is a very tedious and time-consuming task. In this paper, we focus on providing compiler support for a group of specialized architectural features that exist in many DSPs, namely indirect addressing modes with auto-increment/decrement arithmetic. In these DSPs, an indexed addressing mode is generally not available, so automatic variables must be accessed by allocating address registers and performing address arithmetic. Subsuming address arithmetic into auto-increment /decrement arithmetic improves both the performance and size of the generated code. Our objective is to provide a method for comprehensively analyzing the performance benefits and hardware cost due to an auto-increment /decrement feature that varies from-l to +l, and allowing access to k address registers in an address generator. We provide this method via a parameterizable optimization algorithm that operates on a procedure-wise basis. Thus, the optimization techniques in a compiler can be used not only to generate efficient or compact code, but also to help the designer of a custom DSP architecture make decisions on address arithmetic features.  相似文献   

3.
To meet strict speed and power requirements for embedded applications, many high-end digital Signal Processors (DSPs) commonly employ non-orthogonal architectures that are typically characterized by irregular data paths, heterogeneous registers, and multiple memory banks. Obviously to harvest the benefits provided by this non-orthogonal architecture sufficient compiler support is necessary and important. However, the complexity of such architectures presents a great challenge to compiler design and the usual compilation techniques for general-purpose CPUs do not adapt well to the irregularity of DSP. The entire code generation process must include the following phases: intermediate representation, code compaction, instruction scheduling, memory bank assignment (or variable partition), and register/accumulator assignment. Much related research only considers some phases, which is inadequate. In this paper, we present an effective code generation algorithm named Rotation Scheduling with Spill Codes Predicting (RSSP) to maximally exploit the benefits of non-orthogonal architectures. It contains six parts that cover almost the entire phases of the code generation process. As well as introducing the detailed principles and algorithms of the proposed RSSP, we use an analytic model to evaluate its preliminary performance. Evaluation results clearly demonstrate the effectiveness of the proposed method. Furthermore, we also present some preliminary ideas to generalize RSSP, which can make it more practicable and suit various DSPs with similar architectural features.
Cheng Chen (Corresponding author)Email:
  相似文献   

4.
Reducing address arithmetic operations by optimization of address offset assignment greatly improves the performance of digital signal processor (DSP) applications. However, minimizing address operations alone may not directly reduce code size and schedule length for DSPs with multiple functional units. Little research work has been conducted on loop optimization with address offset assignment problem for architectures with multiple functional units. In this paper, we combine loop scheduling, array interleaving, and address assignment to minimize the schedule length and the number of address operations for loops on DSP architectures with multiple functional units. Array interleaving is applied to optimize address assignment for arrays in loop scheduling process. An algorithm, Address Operation Reduction Rotation Scheduling (AORRS), is proposed. The algorithm minimizes both schedule length and the number of address operations. with to list scheduling, AORRS shows an average reduction of 38.4% in schedule length and an average reduction of 31.7% in the number of address operations. Compared with rotation scheduling, AORRS shows an average reduction of 15.9% in schedule length and 33.6% in the number of address operations.   相似文献   

5.
We address the problem of code generation for embedded DSP systems. Such systems devote a limited quantity of silicon to program memory, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints. Unfortunately, current compiler technology is unable to generate dense, high-performance code for DSPs, due to the fact that it does not provide adequate support for the specialized architectural features of DSPs via machine-dependent code optimizations. Thus, designers often program the embedded software in assembly, a very time-consuming task. In order to increase productivity, compilers must be developed that are capable of generating high-quality code for DSPs. The compilation process must also be made retargetable, so that a variety of DSPs may be efficiently evaluated for potential use in an embedded system. We present a retargetable compilation methodology that enables high-quality code to be generated for a wide range of DSPs. Previous work in retargetable DSP compilation has focused on complete automation, and this desire for automation has limited the number of machine-dependent optimizations that can be supported. In our efforts, we have given code quality higher priority over complete automation. We demonstrate how by using a library of machine-dependent optimization routines accessible via a programming interface, it is possible to support a wide range of machine-dependent optimizations, albeit at some cost to automation. Experimental results demonstrate the effectiveness of our methodology, which has been used to build good-quality compilers for three fixed-point DSPs. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

6.
A 2-/spl mu/m CMOS VLSI digital signal processor (DSP) family, the SP50, is described that is capable of eight million instructions per second and up to six concurrent operations in each instruction. Two DSPs, the PCB5010 and PCB5011, have been developed. Both are based on a common architecture which contains two 16-bit data buses, and a 16/spl times/16/spl rarr/40-bit multiplier accumulator and 16-bit ALU, both with multiprecision support in hardware. Also implemented are two static data RAMs (128/spl times/16 or 256/spl times/16), a data ROM (51/spl times/16), a 15-word three-port register file, three address computation units, and five serial and parallel I/O interfaces. The data path is controlled by an orthogonal instruction set, using 40-bit microcode words. The controller contains a five-level stack and an instruction repeat register, and can have either on-chip program memory (RAM: 32/spl times/40; ROM: 987/spl times/40) or off-chip program memory (up to 64K/spl times/40). Benchmarks show a two to sixfold improvement in overall performance over its predecessors.  相似文献   

7.
Many software compilers for embedded processors produce machine code of insufficient quality. Since for most applications software must meet tight code speed and size constraints, embedded software is still largely developed in assembly language. In order to eliminate this bottleneck and to enable the use of high-level language compilers also for embedded software, new code generation and optimization techniques are required. This paper describes a novel code generation technique for embedded processors with irregular data path architectures, such as typically found in fixed-point DSPs. The proposed code generation technique maps data flow graph representation of a program into highly efficient machine code for a target processor modeled by instruction set behavior. High code quality is ensured by tight coupling of different code generation phases. In contrast to earlier works, mainly based on heuristics, our approach is constraint-based. An initial set of constraints on code generation are prescribed by the given processor model. Further constraints arise during code generation based on decisions concerning code selection, register allocation, and scheduling. Whenever possible, decisions are postponed until sufficient information about a good decision has been collected. The constraints are active in the "background" and guarantee local satisfiability at any point of time during code generation. This mechanism permits to simultaneously cope with special-purpose registers and instruction level parallelism. We describe the detailed integration of code generation phases. The implementation is based on the constraint logic programming (CLP) language ECLiPSe. For a standard DSP, we show that the quality of generated code comes close to hand-written assembly code. Since the input processor model can be edited by the user, also retargetability of the code generation technique is achieved within a certain processor class. This revised version was published online in July 2006 with corrections to the Cover Date.  相似文献   

8.
The synthesis of efficient programs for digital signal processors with non-homogeneous register sets is still a challenge of compiler design. In this paper, we introduce the concept of a data flow graph compiler for digital signal processors. In a first step, the data flow graph is decomposed into constrained expression trees and represented by trellis trees, which allows to apply a straight-line code generation algorithm whose complexity depends just linearly on the size of the graph. Registers are assigned by taking into account the constraints of multi-function instructions. The execution time of the resulting assembly code is minimized by exploiting instruction level parallelism and memory layout optimizations.  相似文献   

9.
The focus of high-level built-in self-test (BIST) synthesis is register assignment, which involves system register assignment, BIST register assignment, and interconnection assignment. To reduce the complexity involved in the assignment process, existing high-level BIST synthesis methods decouple the three tasks and perform the tasks sequentially at the cost of global optimality. They also try to achieve only one objective: minimizing either area overhead or test time. Hence, those methods do not render exploration of large design space, which may result in a local optimum. In this paper, we propose a new approach to the BIST data path synthesis based on integer linear programming that performs the three register assignment tasks concurrently to yield optimal designs. In addition, our approach finds an optimal register assignment for each k-test session. Therefore, it offers a range of designs with different figures of merit in area and test time. Our experimental results show that our method successfully synthesizes a BIST circuit for every k-test session for all six circuits experimented. All the BIST circuits are better in area overhead than those generated by existing high-level BIST synthesis methods  相似文献   

10.
Efficient address register allocation has been shown to be a central problem in code generation for processors with restricted addressing modes. This paper extends previous work on Global Array Reference Allocation (GARA), the problem of allocating address registers to array references in loops. It describes two heuristics to the problem, presenting experimental data to support them. In addition, it proposes an approach to solve GARA optimally which, albeit computationally exponential, is useful to measure the efficiency of other methods. Experimental results, using the MediaBench benchmark and profiling information, reveal that the proposed heuristics can solve the majority of the benchmark loops near optimality in polynomial-time. A substantial execution time speedup is reported for the benchmark programs, after compiled with the original and the optimized versions of GCC.  相似文献   

11.
The performance of signal-processing algorithms implemented in hardware depends on the efficiency of datapath, memory speed and address computation. Pattern of data access in signal-processing applications is complex and it is desirable to execute the innermost loop of a kernel in a single-clock cycle. This necessitates the generation of typically three addresses per clock: two addresses for data sample/coefficient and one for the storage of processed data. Most of the Reconfigurable Processors, designed for multimedia, focus on mapping the multimedia applications written in a high-level language directly on to the reconfigurable fabric, implying the use of same datapath resources for kernel processing and address generation. This results in inconsistent and non-optimal use of finite datapath resources. Presence of a set of dedicated, efficient Address Generator Units (AGUs) helps in better utilisation of the datapath elements by using them only for kernel operations; and will certainly enhance the performance. This article focuses on the design and application-specific integrated circuit implementation of address generators for complex addressing modes required by multimedia signal-processing kernels. A novel algorithm and hardware for AGU is developed for accessing data and coefficients in a bit-reversed order for fast Fourier transform kernel spanning over log?2 N stages, AGUs for zig-zag-ordered data access for entropy coding after Discrete Cosine Transform (DCT), convolution kernels with stored/streaming data, accessing data for motion estimation using the block-matching technique and other conventional addressing modes. When mapped to hardware, they scale linearly in gate complexity with increase in the size.  相似文献   

12.
A synthesis environment that targets software programmable architectures such as digital signal processors (DSPs) is presented. These processors are well suited for implementation of real-time signal processing systems with medium throughput requirements. Techniques that tightly couple the synthesis environment to an existing communication system simulator are also presented. This enables a seamless transition between the simulation and implementation design level of communication systems. Special focus is on optimization techniques for mapping data flow oriented block diagrams onto DSPs. The combination of different mapping and optimization strategies allows comfortable synthesis of real-time code that is highly adapted to application-specific needs imposed by constraints on memory space, sampling rate, or latency. Thus, tradeoff analysis is supported by efficient interactive or automatic exploration of the design space. All presented concepts are illustrated by the design of a phase synchronizer with automatic gain control on a floating-point DSP  相似文献   

13.
14.
In this paper, we present the problem of storage bandwidth optimization (SBO) in VLSI system realizations. Our goal is to minimize the required memory bandwidth within the given cycle budget by adding ordering constraints to the flow graph. This allows the subsequent memory allocation and assignment tasks to come up with a cheaper memory architecture with less memories and memory ports. The importance and the effect of SBO is shown on realistic examples both in the video and asynchronous transfer-mode (ATM) domains. We show that it is important to take into account which data is being accessed in parallel, instead of only considering the number of simultaneous memory accesses. Our problem formulation leads to the optimization of a conflict (hyper) graph. For the target domain of ATM, only flat graphs without loops have to be treated. For this subproblem, a prototype tool has been implemented to demonstrate the feasibility of automating this important system design step  相似文献   

15.
We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system’s performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space—just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor’s register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product costs.  相似文献   

16.
A new technique is proposed for simultaneously executing multiway branching with PLAs. The basic structure consists of three units: the microcode ROM, the microsequencer PLA and a register counter. The main idea is to store only branching information in PLA, exploiting the associative addressing properties of the latter. Benefits in speed and memory space are obtained.  相似文献   

17.
This paper presents a network flow approach to solving the register binding and allocation problem for multiword memory access DSP processors. In recently announced DSP processors, 16-bit instructions which simultaneously access four words from memory are supported. A polynomial-time network flow methodology is used to allocate multiword accesses, including constant data-memory layout, while minimizing code size. Results show that improvements of up to 87% in terms of memory bandwidth are obtained compared to compiler-generated DSP code. This research is important for industry since this value-added technique can improve code size and utilize higher-memory bandwidths without increasing cost.  相似文献   

18.
The channel assignment problem has become increasingly important in mobile telephone communication. Since the usable range of the frequency spectrum is limited, the optimal assignment problem of channels has become increasingly important. Recently Genetic Algorithms (GAs) have been proposed as new computational tools for solving optimization problems. GAs are more attractive than other optimization techniques, such as neural networks or simulated annealing, since GAs are generally good at finding an acceptably good global optimal solution to a problem very quickly. In this paper, a new channel assignment algorithm using GAs is proposed. The channel assignment problem is formulated as an energy minimization problem that is implemented by GAs. Appropriate GAs operators such as reproduction, crossover and mutation are developed and tested. In this algorithm, the cell frequency is not fixed before the assignment procedures as in the previously reported channel assignment algorithm using neural networks. The average generation numbers and the convergence rates of GAs are shown as a simulation result. When the number of cells in one cluster are increased, the generation numbers are increased and the convergence rates are decreased. On the other hand, with the increased minimal frequency interval, the generation numbers are decreased and the convergence rates are increased. The comparison of the various crossover and mutation techniques in a simulation shows that the combination of two points crossover and selective mutation technique provides better results. All three constraints are also considered for the channel assignments: the co-channel constraint, the adjacent channel constraint and the co-site channel constraint. The goal of this paper is the assignment of the channel frequencies which satisfied these constraints with the lower bound number of channels.  相似文献   

19.
This paper presents a novel approach to the synthesis of interleaved memory systems that is especially suited for application-specific processors. Our synthesis system generates the optimized interleaved memories for a specific algorithm and finds the best mapping of arrays in that algorithm onto the memory system to achieve high performance. The design space is four-dimensional (4-D) and comprises the number of memory banks, the type of memory components, the storage scheme, and the range of clock period in the system. Optimal designs are found among the Pareto points (a set of nondominated points in the design space) computed for our memory model under the performance and cost criteria set by the designer. The memory model includes all the components of an interleaved memory system and covers a lookup table-based address generation with data alignment. The synthesis is based on a general periodic storage scheme, which enables efficient handling of irregular and overlapped access patterns. The synthesis process is the exhaustive search of the heavily pruned design space, and the pruning is based on mathematically proven properties of periodic storage schemes. This paper presents the theorems, the synthesis algorithm, and the methods of effective word and bank address generation. Examples are given to illustrate the effectiveness of our method  相似文献   

20.
Data-transfer intensive applications typically contain heavily accessed memories involving considerable arithmetic for the computation and the selection of the different memory access pointers. This data processing, namely addressing, becomes dominant in the overall arithmetic cost and it has to be executed under very tight timing constraints. Different high-level optimizing alternatives suitable for addressing are explored in our Adopt methodology and prototype tool environment to reduce the addressing overhead. They include address expression splitting/clustering, induction variable analysis, target architecture selection, and global-scope algebraic optimization. In addition, some steps aiming to reduce at the system level the time-multiplexed address unit cost, are also incorporated for area and power efficiency. The techniques are demonstrated on test-vehicles representative of real-life applications, shelving important savings on the overall arithmetic cost  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号