期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Kathail V. Schlansker M.S. Rau B.R. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》2001,89(11):1676-1693

Designing compilers for Explicitly Parallel Instruction Computing (EPIC.) architectures presents challenges substantially different from those encountered in designing compilers for traditional sequential architectures. These challenges are addressed not only by employing new optimizations that are specific to EPIC, but also by employing new ways to architect compilers. EPIC architectures provide features that allow compilers to take a proactive role in exploiting instruction level parallelism. Compiler technology is intimately intertwined with the target processor architecture, and compiler architects must solve new analysis and optimization problems to achieve the highest levels of performance. When complex optimizations are uniformly applied to large applications, the resulting slow compile speeds are unacceptable. Demanding requirements to produce high-quality code at high compile speed shapes the fundamental structure of EPIC compilers 相似文献

2.

Compact Code Generation for Tightly-Coupled Processor Arrays

Srinivas Boppu Frank Hannig Jürgen Teich 《Journal of Signal Processing Systems》2014,77(1-2):5-29

In this paper, we consider programmable tightly-coupled processor arrays consisting of interconnected small light-weight VLIW cores, which can exploit both loop-level parallelism and instruction-level parallelism. These arrays are well suited for compute-intensive nested loop applications often providing a higher power and area efficiency compared with commercial off-the-shelf processors. They are ideal candidates for accelerating the computation of nested loop programs in future heterogeneous systems, where energy efficiency is one of the most important design goals for overall system-on-chip design. In this context, we present a novel design methodology for the mapping of nested loop programs onto such processor arrays. Key features of our approach are: (1) Design entry in form of a functional programming language and loop parallelization in the polyhedron model, (2) support of zero-overhead looping not only for innermost loops but also for arbitrarily nested loops. Processors of such arrays are often limited in instruction memory size to reduce the area and power consumption. Hence, (3) we present methods for code compaction and code generation, and integrated these methods into a design tool. Finally, (4) we evaluated selected benchmarks by comparing our code generator with the Trimaran and VEX compiler frameworks. As the results show, our approach can reduce the size of the generated processor codes up to 64 % (Trimaran) and 55 % (VEX) while at the same time achieving a significant higher throughput. 相似文献

3.

Phase-Coupled Mapping of Data Flow Graphs to Irregular Data Paths

Steven Bashford Rainer Leupers 《Design Automation for Embedded Systems》1999,4(2-3):119-165

Many software compilers for embedded processors produce machine code of insufficient quality. Since for most applications software must meet tight code speed and size constraints, embedded software is still largely developed in assembly language. In order to eliminate this bottleneck and to enable the use of high-level language compilers also for embedded software, new code generation and optimization techniques are required. This paper describes a novel code generation technique for embedded processors with irregular data path architectures, such as typically found in fixed-point DSPs. The proposed code generation technique maps data flow graph representation of a program into highly efficient machine code for a target processor modeled by instruction set behavior. High code quality is ensured by tight coupling of different code generation phases. In contrast to earlier works, mainly based on heuristics, our approach is constraint-based. An initial set of constraints on code generation are prescribed by the given processor model. Further constraints arise during code generation based on decisions concerning code selection, register allocation, and scheduling. Whenever possible, decisions are postponed until sufficient information about a good decision has been collected. The constraints are active in the "background" and guarantee local satisfiability at any point of time during code generation. This mechanism permits to simultaneously cope with special-purpose registers and instruction level parallelism. We describe the detailed integration of code generation phases. The implementation is based on the constraint logic programming (CLP) language ECLiPSe. For a standard DSP, we show that the quality of generated code comes close to hand-written assembly code. Since the input processor model can be edited by the user, also retargetability of the code generation technique is achieved within a certain processor class. This revised version was published online in July 2006 with corrections to the Cover Date. 相似文献

4.

An Algorithm-Hardware-System Approach to VLIW Multimedia Processors 总被引：2，自引：0，他引：2

Mladen Berekovic Peter Pirsch Johannes Kneip 《The Journal of VLSI Signal Processing》1998,20(1-2):163-180

Very Long Instruction Word (VLIW) processor architectures for multimedia applications are discussed from an algorithm, hardware and system based point of view. VLIW processors show high flexibility and processing power, as well as a good utilization of resources by compiler-generated code, but their exclusive exploitation of instruction level parallelism (ILP) decreases in efficiency as the degree of parallelism increases. This is mainly caused by characteristics of multimedia algorithms, increasing wiring delays, compiler restrictions, and a widening gap between on-chip processing speed and available bandwidth to external memory. As new multimedia applications and standards continue to evolve (MPEG-4), the demand for higher processing power will continue. Therefore, parallel processing in all its available forms will have to be exploited to achieve significant performance improvements. We show that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level. We examine how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads. 相似文献

5.

面向VLIW结构的高性能代码生成技术 总被引：1，自引：1，他引：0

王红梅王敏张铁军单睿侯朝焕《微电子学与计算机》2010,27(2)

DSP处理器通过采用VLIW结构获得了高性能,同时也增加了编译器为其生成汇编代码的难度.代码生成器作为编译器的代码生成部件,是VLIW结构能够发挥性能的关键.由此提出并实现了一种基于可重定向编译框架的代码生成器.该代码生成器充分利用VLIW的体系结构特点,支持SIMD指令,支持谓词执行,能够生成高度指令级并行的汇编代码,显著提高应用程序的执行性能. 相似文献

6.

Paged Absolute Addressing Mode Optimizations for Embedded Digital Signal Processors Using Post-pass Data-flow Analysis

Ashok Sudarsanam Sharad Malik Steve Tjiang Stan Liao 《Design Automation for Embedded Systems》1999,4(1):41-59

We address the problem of code generation for embedded DSP systems. In such systems, it is typical for one or more digital signal processors (DSPs), program memory, and custom circuitry to be integrated onto a single IC. Consequently, the amount of silicon area that is dedicated to program memory is limited, so the embedded software must be sufficiently dense. Additionally, this software must be written so as to meet various high-performance constraints, which may include hard real-time constraints. Unfortunately, existing compiler technology is unable to generate dense, high-performance code for DSPs since it does not provide adequate support for the specialized architectural features of DSPs. These specialized features not only allow for the fast execution of common DSP operations, but they also allow for the generation of dense assembly code that specifies these operations. Thus, system designers often hand-program the embedded software in assembly, which is a very time-consuming task. In this paper, we focus on providing compiler support for one particular specialized architectural feature, namely the paged absolute addressing mode – this feature is found in two commercial DSPs, the Texas Instruments' TMS320C25 and TMS320C50 fixed-point DSPs; however, it may also be featured in application-specific processors (ASIPs). We present some machine-dependent code optimizations that improve code density by exploiting this architectural feature. Experimental results demonstrate that for a set of typical DSP benchmarks, some of our optimizations reduce overall code size and data memory consumption by an average of 5.0% and 16.0%, respectively. Our experimental vehicle throughout this research is the TMS320C25. 相似文献

7.

Analysis and Evaluation of Address Arithmetic Capabilities in Custom DSP Architectures

Ashok Sudarsanam Stan Liao Srinivas Devadas 《Design Automation for Embedded Systems》1999,4(1):5-22

We address the problem of code generation for DSP systems on a chip. In such systems, the amount of silicon devoted to program ROM is limited, so in addition to meeting various high-performance constraints, the application software must be sufficiently dense. Unfortunately, existing compiler technology is unable to generate high-quality code for DSPs since it does not provide adequate support for the specialized architectural features of DSPs. Thus, designers often resort to programming application software in assembly, which is a very tedious and time-consuming task. In this paper, we focus on providing compiler support for a group of specialized architectural features that exist in many DSPs, namely indirect addressing modes with auto-increment/decrement arithmetic. In these DSPs, an indexed addressing mode is generally not available, so automatic variables must be accessed by allocating address registers and performing address arithmetic. Subsuming address arithmetic into auto-increment /decrement arithmetic improves both the performance and size of the generated code. Our objective is to provide a method for comprehensively analyzing the performance benefits and hardware cost due to an auto-increment /decrement feature that varies from-l to +l, and allowing access to k address registers in an address generator. We provide this method via a parameterizable optimization algorithm that operates on a procedure-wise basis. Thus, the optimization techniques in a compiler can be used not only to generate efficient or compact code, but also to help the designer of a custom DSP architecture make decisions on address arithmetic features. 相似文献

8.

Scalar supercomputer architecture

Weiss S. 《Proceedings of the IEEE. Institute of Electrical and Electronics Engineers》1989,77(12):1970-1982

High-performance scalar architectures that have the capability to issue multiple instructions per clock period are considered. The essential characteristics and the principal architectural tradeoffs in scientific array processors, very-long-instruction-word (VLIW) machines, the polycyclic architecture and decoupled computers are examined. Array processors rely solely on static code scheduling done manually or by the compiler. The scheduling task is quite complex, and the resulting code may not be very efficient. In a VLIW, sophisticated compiler technology provides software solutions for functions traditionally done in hardware. The polycyclic architecture is similar to array processors in its structure but provides architectural support to the instruction scheduling task. In decoupled architectures the hardware changes the order of instruction execution at run time. This dynamic code scheduling capability does not come at the expense of additional control complexity 相似文献

9.

Time-constrained code compaction for DSPs

Leupers R. Marwedel P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》1997,5(1):112-122

This paper addresses instruction-level parallelism in code generation for digital signal processors (DSPs). In the presence of potential parallelism, the task of code generation includes code compaction, which parallelizes primitive processor operations under given dependency and resource constraints. Furthermore, DSP algorithms in most cases are required to guarantee real-time response. Since the exact execution speed of a DSP program is only known after compaction, real-time constraints should be taken into account during the compaction phase. While previous DSP code generators rely on rigid heuristics for compaction, we propose a novel approach to exact local code compaction based on an integer programming (IP) model, which handles time constraints. Due to a general problem formulation, the IP model also captures encoding restrictions and handles instructions having alternative encodings and side effects and therefore applies to a large class of instruction formats. Capabilities and limitations of our approach are discussed for different DSPs 相似文献

10.

Retargetable Code Generation Based on Structural Processor Description

Rainer Leupers Peter Marwedel 《Design Automation for Embedded Systems》1998,3(1):75-108

Design automation for embedded systems comprising both hardware and software components demands for code generators integrated into electronic CAD systems. These code generators provide the necessary link between software synthesis tools in HW/SW codesign systems and embedded processors. General-purpose compilers for standard processors are often insufficient, because they do not provide flexibility with respect to different target processors and also suffer from inferior code quality. While recent research on code generation for embedded processors has primarily focussed on code quality issues, in this contribution we emphasize the importance of retargetability, and we describe an approach to achieve retargetability. We propose usage of uniform, external target processor models in code generation, which describe embedded processors by means of RT-level netlists. Such structural models incorporate more hardware details than purely behavioral models, thereby permitting a close link to hardware design tools and fast adaptation to different target processors. The MSSQ compiler, which is part of the MIMOLA hardware design system, operates on structural models. We describe input formats, central data structures, and code generation techniques in MSSQ. The compiler has been successfully retargeted to a number of real-life processors, which proves feasibility of our approach with respect to retargetability. We discuss capabilities and limitations of MSSQ, and identify possible areas of improvement. 相似文献

11.

32位MIPS处理器代码生成器的研究与实现

尹娟孙巧稚《电子科技》2014,27(5):123-126

以嵌入式系统编译器LCC和32位MIPS处理器为基础,完成了LCC在目标机MIPS处理器上的移植工作。为迅速有效地生成代码生成器,根据新目标机的特点,将原有的宏汇编指令通过指令拆分和指令间的相互转化技术重新书写机器描述文件,使得生成的目标代码包含的指令集更小,结构更加紧凑。目标代码的操作码约缩小50%,并成功实现C代码到汇编代码的转换,能通过MIPS模拟器PCSPIM的验证,同时性能也得到大幅提高。通过汇编器生成相应的机器码,并用Xilinx ISE自带的仿真软件Isim(ISE Simulator)验证了其正确性,实现LCC在MIPS处理器上的成功移植。相似文献

12.

Code Decompression Unit Design for VLIW Embedded Processors

Yuan Xie Wolf W. Lekatsas H. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(8):975-980

Code size "bloating" in embedded very long instruction word (VLIW) processors is a major concern for embedded systems since memory is one of the most restricted resources. In this paper, we describe a code compression algorithm based on arithmetic coding, discuss how to design decompression architecture, and illustrate the tradeoffs between compression ratio and decompression overhead, by using different probability models. Experimental results for a VLIW embedded processor TMS320C6x show that compression ratios between 67% and 80% can be achieved, depending on the probability models used. A precache decompression unit design is implemented in TSMC 0.25 mum and a test chip is fabricated. 相似文献

13.

Low-power data forwarding for VLIW embedded architectures

Sami M. Sciuto D. Silvano C. Zaccaria V. Zafalon R. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2002,10(5):614-622

Proposes a low-power approach to the design of embedded very long instruction word (VLIW) processor architectures based on the forwarding (or bypassing) hardware, which provides operands from interstage pipeline registers directly to the inputs of the function units. The power optimization technique exploits the forwarding paths to avoid the power cost of writing/reading short-lived variables to/from the register file (RF). Such optimization is justified by the fact that, in application-specific embedded systems, a significant number of variables are short-lived, that is, their liveness (from first definition to last use) spans only few instructions. Values of short-lived variables can thus be accessed directly through the forwarding registers, avoiding writeback to the RF by the producer instruction and successive read from the RF by the consumer instruction. The decision concerning the enabling of the RF writeback phase is taken at compile time by the compiler static scheduling algorithm. This approach implies a minimal overhead on the complexity of the processor control logic and, thus, no critical path increase. The application of the proposed solution to a VLIW embedded core has shown an average RF power saving of 7.8% with respect to the unoptimized approach on the given set of target benchmarks. 相似文献

14.

Compiling for Reduced Bit-Width Queue Processors

Arquimedes Canedo Ben A. Abderazek Masahiro Sowa 《Journal of Signal Processing Systems》2010,59(1):45-55

Embedded systems are characterized by the requirement of demanding small memory footprint code. A popular architectural modification to improve code density in RISC embedded processors is to use a reduced bit-width instruction set. This approach reduces the length of the instructions to improve code size. However, having less addressable registers by the reduced instructions, these architectures suffer a slight performance degradation as more reduced instructions are required to execute a given task. On the other hand, 0-operand computers such as stack and queue machines implicitly access their source and destination operands making instructions naturally short. Queue machines offer a highly parallel computation model, unlike the stack model. This paper proposes a novel alternative for reducing code size by using a queue-based reduced instruction set while retaining the high parallelism characteristics in programs. We introduce an efficient code generation algorithm to generate programs for our reduced instruction set. Our algorithm successfully constrains the code to the reduced instruction set with the addition of only 4% extra code, in average. We show that our proposed technique is able to generate about 16% more compact code than MIPS16, 26% over ARM/Thumb, and 50% over MIPS32 code. Furthermore, we show that our compiler is able to extract about the same parallelism than fully optimized RISC code. 相似文献

15.

Enhancing Microkernel Performance on VLIW DSP Processors via Multiset Context Switch

Kun-Yuan Hsieh Yung-Chia Lin Chien-Chin Huang Jenq-Kuen Lee 《Journal of Signal Processing Systems》2008,51(3):257-268

相似文献

16.

Combined Application of Data Transfer and Storage Optimizing Transformations and Subword Parallelism Exploitation for Power Consumption and Execution Time Reduction in VLIW Multimedia Processors

K. Masselos F. Catthoor C. E. Goutis H. DeMan 《The Journal of VLSI Signal Processing》2004,37(1):53-73

相似文献

17.

Effective Code Generation for Distributed and Ping-Pong Register Files: A Case Study on PAC VLIW DSP Cores

Yung-Chia Lin Chia Han Lu Chung-Ju Wu Chung-Lin Tang Yi-Ping You Ya-Chaio Moo Jenq-Kuen Lee 《Journal of Signal Processing Systems》2008,51(3):269-288

The compiler is generally regarded as the most important software component that supports a processor design to achieve success. This paper describes our application of the open research compiler infrastructure to a novel VLIW DSP (known as the PAC DSP core) and the specific design of code generation for its register file architecture. The PAC DSP utilizes port-restricted, distributed, and partitioned register file structures in addition to a heterogeneous clustered data-path architecture to attain low power consumption and a smaller die. As part of an effort to overcome the new challenges of code generation for the PAC DSP, we have developed a new register allocation scheme and other retargeting optimization phases that allow the effective generation of high quality code. Our preliminary experimental results indicate that our developed compiler can efficiently utilize the features of the specific register file architectures in the PAC DSP. Our experiences in designing compiler support for the PAC VLIW DSP with irregular resource constraints may also be of interest to those involved in developing compilers for similar architectures.

Jenq-Kuen Lee (Corresponding author)Email:

相似文献

18.

Retargetable pipeline hazard detection for partially bypassed processors 总被引：1，自引：0，他引：1

Shrivastava A. Earlie E. Dutt N.D. Nicolau A. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2006,14(8):791-801

Register bypassing is a widely used feature in modern processors to eliminate certain data hazards. Although complete bypassing is ideal for performance, it has significant impact on the cycle time, area, and power consumption of the processor. Owing to the strict design constraints on the performance, cost, and the power consumption of embedded processor systems, architects seek a compromise between the design parameters by implementing partial bypassing in processors. However, partial bypassing in processors presents challenges for compilation. Traditional data hazard detection and/or avoidance techniques used in retargetable compilers that assume a constant value of operation latency, break down in the presence of partial bypassing. In this article, we present the concept of operation tables (OTs) that can be used to accurately detect data hazards, even in the presence of incomplete bypassing. OTs integrate the detection of all kinds of pipeline hazards in a unified framework, and can, therefore, be easily deployed in a compiler to generate better schedules. Our experimental results on the popular Intel XScale embedded processor running embedded applications from the MiBench suite, demonstrate that accurate pipeline hazard detection by OTs can result in up to 20% performance improvement over the best performing GCC generated code. Finally, we demonstrate the usefulness of OTs over various bypass configurations of the Intel XScale. 相似文献

19.

A 2.5-GFLOPS, 6.5 million polygons per second, four-way VLIWgeometry processor with SIMD instructions and a software bypassmechanism

Kubosawa H. Higaki N. Ando S. Takahashi H. Asada Y. Anbutsu H. Sato T. Sakate M. Suga A. Kimura M. Miyake H. Okano H. Asato A. Kimura Y. Nakayama H. Kimoto M. Hirochi K. Saito H. Kaido N. Nakagawa Y. Shimada T. 《Solid-State Circuits, IEEE Journal of》1999,34(11):1619-1626

A four-way very long instruction word (VLIW), 312-MHz geometry processor with peripheral component interconnect/accelerated graphic port bus bridge was implemented in a 0.21-μm, 2.5-V, three-layer-metal CMOS process. We adopted (1) a software bypass mechanism, (2) single-instruction multiple-data stream instructions, (3) four sets of floating-point multiply add and accumulate execution units, (4) special condition code registers and a branch condition generator for a clipping operation, and (5) automatic clock delay tuning methodology. As a result of these features, we achieved a performance of 2.5 GFLOPS and 6.5 million polygons per second for a three-dimensional geometry processor, which is the highest published performance as a single geometry processor. The processor is applicable to computer-aided-design systems that require very high graphics performance 相似文献

20.

An Overview of a Compiler for Mapping Software Binaries to Hardware

Mittal G. Zaretsky D. Xiaoyong Tang Banerjee P. 《Very Large Scale Integration (VLSI) Systems, IEEE Transactions on》2007,15(11):1177-1190

As new applications in embedded communications and control systems push the computational limits of digital signal processing (DSP) functions, there will be an increasing need for software applications to be migrated to hardware in the form of a hardware-software codesign system. In many cases, access to the high-level source code may not be available. It is thus desirable to have a technology to translate the software binaries intended for processors to hardware implementations. This paper provides details on the retargetable FREEDOM compiler. The compiler automatically translates DSP software binaries to register-transfer level (RTL) VHDL and Verilog for implementation on field-programmable gate arrays (FPGAs) as standalone or system-on-chip implementations. We describe the underlying optimizations and some novel algorithms for alias analysis, data dependency analysis, memory optimizations, procedure call recovery, and back-end code scheduling. Experimental results on resource usage and performance are shown for several program binaries intended for the Texas Instruments C 6211 DSP (VLIW) and the ARM 922 T reduced instruction set computer (RISC) processors. Implementation results for four kernels from the Simulink demo library and others from commonly used DSP applications, such as MPEG-4, Viterbi, and JPEG are also discussed. The compiler generated RTL code is mapped to Xilinx Virtex II and Altera Stratix FPGAs. We record overall performance gains of 1.5-26.9 for the hardware implementations of the kernels. Comparisons with the power aware compiler techniques (PACT) high-level synthesis compiler are used to show that software binaries can be used as intermediate representations from any high-level language and generate efficient hardware implementations. 相似文献