期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The superblock: An effective technique for VLIW and superscalar compilation 总被引：8，自引：1，他引：7

Wen -Mei W. Hwu Scott A. Mahlke William Y. Chen Pohua P. Chang Nancy J. Warter Roger A. Bringmann Roland G. Ouellette Richard E. Hank Tokuzo Kiyohara Grant E. Haab John G. Holm Daniel M. Lavery 《The Journal of supercomputing》1993,7(1-2):229-248

A compiler for VLIW and superscalar processors must expose sufficient instruction-level parallelism (ILP) to effectively utilize the parallel hardware. However, ILP within basic blocks is extremely limited for control-intensive programs. We have developed a set of techniques for exploiting ILP across basic block boundaries. These techniques are based on a novel structure called thesuperblock. The superblock enables the optimizer and scheduler to extract more ILP along the important execution paths by systematically removing constraints due to the unimportant paths. Superblock optimization and scheduling have been implemented in the IMPACT-I compiler. This implementation gives us a unique opportunity to fully understand the issues involved in incorporating these techniques into a real compiler. Superblock optimizations and scheduling are shown to be useful while taking into account a variety of architectural features. 相似文献

2.

基于剖析信息和关键路径长度的软件扇出树生成算法

曾斌安虹王莉《计算机科学》2010,37(3):248-252

开发利用ILP(Inst ruction-level Parallelism)是现代高性能处理器取得高性能的关键要素之一。宽发射的超标量处理器、超长指令字处理器和数据流处理器只有在并行执行多条相邻的指令时才能获得较高的性能。数据流处理器的一个关键问题是如何把指令的计算结果高效地播送给目标指令而不用读写集中式寄存器文件。对于每条目标数大于指令所能编码的目标数的指令,编译程序都要插入一棵由MOV指令构成的软件扇出树来把计算结果播送给多条目标指令。为了暴露更多的ILP给硬件执行基底,提出了一种改进的软件扇出树生成算法,本算法根据目标指令的执行概率大小以及目标指令到该指令所在块的出口的关键路径长度来计算目标指令的权值,然后对各个叶子的优先权值进行排序,再根据优先权值的顺序来构造一棵软件扇出树,以便把指令的计算结果播送给多条目标指令。实验结果发现,本算法相对于传统的软件扇出树生成算法其性能有较大的提高。相似文献

3.

Trace Software Pipelining

下载免费PDF全文

Wang Jian Andreas Krall M.Anton Ertl 《计算机科学技术学报》1995,10(6):481-490

Global software pipelining is a complex but efficient compilation technique to exploit instruction-level parallelism for loops with branches.This paper presents a novel global software pipelining technique,called Trace Software Pipelining,targeted to the instruction-level parallel processors such as Very Long Instruction Word (VLIW) and superscalar machines.Trace software pipelining applies a global code scheduling technique to compact the original loop body.The resulting loop is called a trace software pipelined (TSP) code.The trace softwrae pipelined code can be directly executed with special architectural support or can be transformed into a globally software pipelined loop for the current VLIW and superscalar processors.Thus,exploiting parallelism across all iterations of a loop can be completed through compacting the original loop body with any global code scheduling technique.This makes our new technique very promising in practical compilers.Finally,we also present the preliminary experimental results to support our new approach. 相似文献

4.

The multiflow trace scheduling compiler 总被引：3，自引：0，他引：3

P. Geoffrey Lowney Stefan M. Freudenberger Thomas J. Karzes W. D. Lichtenstein Robert P. Nix John S. O'Donnell John C. Ruttenberg 《The Journal of supercomputing》1993,7(1-2):51-142

The Multiflow compiler uses the trace scheduling algorithm to find and exploit instruction-level parallelism beyond basic blocks. The compiler generates code for VLIW computers that issue up to 28 operations each cycle and maintain more than 50 operations in flight. At Multiflow the compiler generated code for eight different target machine architectures and compiled over 50 million lines of Fortran and C applications and systems code. The requirement of finding large amounts of parallelism in ordinary programs, the trace scheduling algorithm, and the many unique features of the Multiflow hardware placed novel demands on the compiler. New techniques in instruction scheduling, register allocation, memory-bank management, and intermediate-code optimizations were developed, as were refinements to reduce the overhead of trace scheduling. This article describes the Multiflow compiler and reports on the Multiflow practice and experience with compiling for instruction-level parallelism beyond basic blocks. 相似文献

5.

VLIW微处理器特征与编译技术支持

郑飞陆鑫达《微处理机》1996,(3):1-4

VLIW是一种早已出现但一直未能广泛使用而现今又被重新重点研究的微处理器设计思想与技术，它跟超标量技术一样支持每周期执行多条指令，但并行度更高。本文将详细介绍VLIW的概念及其发展历程，讨论VLIW微处理器的特征与所需的编译技术支持，并与超标量微处理器进行比较分析。相似文献

6.

Hardware-Software Collaborative Techniques for Runtime Profiling and Phase Transition Detection

下载免费PDF全文

Youfeng Wu Yong-Fong Lee 《计算机科学技术学报》2005,20(5):665-675

Dynamic optimization relies on runtime profile information to improve the performance of program execution. Traditional profiling techniques incur significant overhead and are not suitable for dynamic optimization. In this paper, a new profiling technique is proposed, that incorporates the strength of both software and hardware to achieve near-zero overhead profiling. The compiler passes profiling requests as a few bits of information in branch instructions to the hardware, and the processor executes profiling operations asynchronously in available free slots or on dedicated hardware. The compiler instrumentation of this technique is implemented using an Itanium research compiler. The result shows that the accurate block profiling incurs very little overhead to the user program in terms of the program scheduling cycles. For example, the average overhead is 0.6% for the SPECint95 benchmarks. The hardware support required for the new profiling is practical. The technique is extended to collect edge profiles for continuous phase transition detection. It is believed that the hardware-software collaborative scheme will enable many profile-driven dynamic optimizations for EPIC processors such as the Itanium processors. 相似文献

7.

Static scheduling for barrier MIMD architectures

Henry G. Dietz Abderrazek Zaafrani Matthew T. O'Keefe 《The Journal of supercomputing》1992,5(4):263-289

In a SIMD or VLIW machine, conceptual synchronizations are accomplished by using a static code schedule that does not require run-time synchronization. The lack of run-time synchronization overhead makes these machines very effective for fine-grain parallelism, but they cannot execute parallel code structures as general as those executed by MIMD architectures, and this limits their utility.In this paper we present a timing analysis that allows a compiler for a MIMD machine to eliminate a large fraction of the run-time synchronization by making efficient use of static code scheduling. Although these techniques can be adapted to be applied to most MIMD machines, this paper centers on the analysis and scheduling for barrier MIMD machines. Barrier MIMDs are asynchronous multiple instruction stream/multiple data stream architectures capable of parallel execution of variable execution-time instructions and arbitrary control flow (e.g., while loops and calls). However, they also incorporate a special hardware barrier synchronization mechanism that facilitates static scheduling by providing a mechanism which the compiler can use to enforce precise timing constraints. In other words, the compiler tracks relative timing between processors and uses static code scheduling until the timing imprecision becomes too large, at which point the compiler simply inserts a barrier to reduce that timing imprecision to zero (or a small constant).This paper describes new scheduling and barrier placement algorithms for barrier MIMDs that are based loosely on the list scheduling approach employed for VLIWs [Ellis 1985]. In addition, the experimental results from scheduling thousands of synthetic benchmark programs for a parameterized barrier MIMD machine are presented. 相似文献

8.

Static and dynamic evaluation of data dependence analysis techniques 总被引：2，自引：0，他引：2

Petersen P.M. Padua D.A. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(11):1121-1132

Data dependence analysis techniques are the main component of today's trategies for automatic detection of parallelism. Parallelism detection strategies are being incorporated in commercial compilers with increasing frequency because of the widespread use of processors capable of exploiting instruction-level parallelism and the growing importance of multiprocessors. An assessment of the accuracy of data dependence tests is therefore of great importance for compiler writers and researchers. The tests evaluated in this study include the generalized greatest common divisor test, three variants of Banerjee's test, and the Omega test. Their effectiveness was measured with respect to the Perfect Benchmarks and the linear algebra libraries, EISPACK and LAPACK. Two methods were applied, one using only compile-time information for the analysis, and the second using information gathered during program execution. The results indicate that Banerjee's test is for all practical purposes as accurate as the more complex Omega test in detecting parallelism. However, the Omega test is quite effective in proving the existence of dependences, in contrast with Banerjee's test, which can only disprove, or break dependences. The capability of 相似文献

9.

The Partial Reverse If-Conversion Framework for Balancing Control Flow and Predication

David I. August Wen-Mei W. Hwu Scott A. Mahlke 《International journal of parallel programming》1999,27(5):381-423

Predicated execution is a promising architectural feature for exploiting instruction-level parallelism in the presence of control flow. Compiling for predicated execution involves converting program control flow into conditional, or predicated, instructions. This process is known as if-conversion. In order to apply ifconversion effectively, one must address two major issues: what should be ifconverted and when the if-conversion should be performed. A compiler's use of predication as a representation is most effective when large amounts of code are if-converted and when if-conversion is performed early in the compilation procedure. On the other hand, efficient execution of code generated for a processor with predicated execution requires a delicate balance between control flow and predication. The appropriate balance is tightly coupled with scheduling decisions and detailed processor characteristics. This paper presents a compilation framework based on partial reverse if-conversion that allows the compiler to maximize the benefits of predication as a compiler representation while delaying the final balancing of control flow and predication to schedule time. 相似文献

10.

Code scheduling for multiple instruction stream architectures

Gary Tyson Matthew Farrens 《International journal of parallel programming》1994,22(3):243-272

Extensive research has been done on extracting parallelism from single instruction stream processors. This paper presents our investigation into ways to modify MIMD architectures to allow them to extract the instruction level parallelism achieved by current superscalar and VLIW machines. A new architecture is proposed which utilizes the advantages of a multiple instruction stream design while addressing some of the limitations that have prevented MIMD architectures from performing ILP operation. A new code scheduling mechanism is described to support this new architecture by partitioning instructions across multiple processing elements in order to exploit this level of parallelism. 相似文献

11.

The Metaflow architecture

Popescu V. Schultz M. Spracklen J. Gibson G. Lightner B. Isaman D. 《Micro, IEEE》1991,11(3)

The Metaflow architecture, a unified approach to maximizing the performance of superscalar microprocessors, is introduced. The Metaflow architecture exploits inherent instruction-level parallelism in conventional sequential programs by hardware means, without relying on optimizing compilers. It is based on a unified structure, the DRIS (deferred-scheduling, register-renaming instruction shelf), that manages out-of-order execution and most of the attendant problems. Coupling the DRIS with a speculative-execution mechanism that avoids conditional branch stalls results in performance limited only be inherent instruction-level parallelism and available execution resources. Although presented in the context of superscalar machines, the technique is equally applicable to a superpipelined implementation. Lightning, the first implementation of the Metaflow architecture, which executes the Sparc RISC instruction set is described 相似文献

12.

Region-based compilation: Introduction, motivation, and initial experience 总被引：1，自引：0，他引：1

Richard E. Hank Wen-mei W. Hwu B. Ramakrishna Rau 《International journal of parallel programming》1997,25(2):113-146

The most important task of a compiler designed to exploit instruction-level parallelism (ILP) is instruction scheduling. If higher levels of ILP are to be achieved, the compiler must use, as the unit of scheduling, regions consisting of multiple basic blocks—preferably those that frequently execute consecutively, and which capture cycles in the program’s execution. Traditionally, compilers have been built using the function as the unit of compilation. In this framework, function boundaries often act as barriers to the formation of the most suitable scheduling regions. Function inlining may be used to circumvent this problem by assembling strongly coupled functions into the same compilation unit, but at the cost of very large function bodies. Consequently, global optimizations whose compile time and space requirements are superlinear in the size of the compilation unit, may be rendered prohibitively expensive. This paper introduces a new approach, called region-based compilation, wherein the compiler, after inlining, repartitions the program into more desirable compilation units, termed regions. Region-based compilation allows the compiler to control problem size and complexity while exposing inter-procedural scheduling, optimization and code motion opportunities. 相似文献

13.

Flexible VLIW processor based on FPGA for efficient embedded real-time image processing

Vincent Brost Fan Yang Charles Meunier 《Journal of Real-Time Image Processing》2014,9(1):47-59

Modern field programmable gate array (FPGA) chips, with their larger memory capacity and reconfigurability potential, are opening new frontiers in rapid prototyping of embedded systems. With the advent of high-density FPGAs, it is now possible to implement a high-performance VLIW (very long instruction word) processor core in an FPGA. With VLIW architecture, the processor effectiveness depends on the ability of compilers to provide sufficient ILP (instruction-level parallelism) from program code. This paper describes research result about enabling the VLIW processor model for real-time processing applications by exploiting FPGA technology. Our goals are to keep the flexibility of processors to shorten the development cycle, and to use the powerful FPGA resources to increase real-time performance. We present a flexible VLIW VHDL processor model with a variable instruction set and a customizable architecture which allows exploiting intrinsic parallelism of a target application using advanced compiler technology and implementing it in an optimal manner on FPGA. Some common algorithms of image processing were tested and validated using the proposed development cycle. We also realized the rapid prototyping of embedded contactless palmprint extraction on an FPGA Virtex-6 based board for a biometric application and obtained a processing time of 145.6 ms per image. Our approach applies some criteria for co-design tools: flexibility, modularity, performance, and reusability. 相似文献

14.

Performance Evaluation of Dynamic Speculative Multithreading with the Cascadia Architecture 总被引：1，自引：0，他引：1

Zier David A. Lee Ben 《Parallel and Distributed Systems, IEEE Transactions on》2010,21(1):47-59

Thread-level parallelism (TLP) has been extensively studied in order to overcome the limitations of exploiting instruction-level parallelism (ILP) on high-performance superscalar processors. One promising method of exploiting TLP is Dynamic Speculative Multithreading (D-SpMT), which extracts multiple threads from a sequential program without compiler support or instruction set extensions. This paper introduces Cascadia, a D-SpMT multicore architecture that provides multigrain thread-level support and is used to evaluate the performance of several benchmarks. Cascadia applies a unique sustainable IPC (sIPC) metric on a comprehensive loop tree to select the best performing nested loop level to multithread. This paper also discusses the relationships that loops have on one another, in particular, how loop nesting levels can be extended through procedures. In addition, a detailed study is provided on the effects that thread granularity and interthread dependencies have on the entire system. 相似文献

15.

Hardware-Based Profiling: An Effective Technique for Profile-Driven Optimization

Thomas M. Conte Burzin A. Patel Kishore N. Menezes J. Stan Cox 《International journal of parallel programming》1996,24(2):187-206

Profile-based optimization can be used for instruction scheduling, loop scheduling, data preloading, function in-lining, and instruction cache performance enhancement. However, these techniques have not been embraced by software vendors because programs instrumented for profiling run significantly slower, an awkward compile-run-recompile sequence is required, and a test input suite must be collected and validated for each program. This paper introduces hardware-based profiling that uses traditional branch handling hardware to generate profile information in real time. Techniques are presented for both one-level and two-level branch hardware organizations. The approach produces high accuracy with small slowdown in execution (0.4%–4.6%). This allows a program to be profiled while it is used, eliminating the need for a test input suite. With contemporary processors driven increasingly by compiler support, hardware-based profiling is important for high-performance systems. 相似文献

16.

Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

《Parallel Computing》2007,33(10-11):700-719

We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multi-core processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. We investigate user-level schedulers that dynamically “rightsize” the dimensions and degrees of parallelism on the cell broadband engine. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. We evaluate recently introduced schedulers for event-driven execution and utilization-driven dynamic multi-grain parallelization on Cell. We also present a new scheduling scheme for dynamic multi-grain parallelism, S-MGPS, which uses sampling of dominant execution phases to converge to the optimal scheduling algorithm. We evaluate S-MGPS on an IBM Cell BladeCenter with two realistic bioinformatics applications that infer large phylogenies. S-MGPS performs within 2–10% of the optimal scheduling algorithm in these applications, while exhibiting low overhead and little sensitivity to application-dependent parameters. 相似文献

17.

基于真实历史反馈的自适应值预测器的设计与优化

隋兵才《计算机工程与科学》2021,43(2):274-279

乱序超标量处理器所能获得的指令级并行能力越来越有限,为了获得更高的指令并行性,必须增加更多的乱序执行和控制资源.随着处理器架构的变化,值预测技术能够在现有主流处理器微架构的基础上以更少的硬件开销,获得更高的数据并行性,进一步提升处理器的乱序执行性能.提出了一种基于真实历史反馈的上下文值预测器(RH-VTAGE),通过设置失效列表和预测精度表来控制反馈RH-VTAGE的预测精度,减少预测失效时的流水线恢复开销.同时,在值预测器的最后阶段增加了真实历史反馈的控制计数器,并设计了自适应置信度控制逻辑,针对不同类型的指令按概率对置信度进行动态调整.实际测试结果表明,相对于其他预测器,RH-VTAGE的整数程序预测性能没有明显提升,但是对于浮点程序性能最大提升31.2％. 相似文献

18.

一种动态VLIW调度机制的研究和实现 总被引：2，自引：0，他引：2

下载免费PDF全文

李云照王志英沈立《计算机工程与科学》2008,30(7):90-93

VLIW结构是开发ILP的一种重要手段,其优点是结构规整简单、硬件复杂度低。但是,完全依靠编译器进行指令调度的机制限制了VLIW结构性能的提高。本文提出了一种基于确定指令延迟的动态VLIW调度机制,该机制利用大部分指令执行时间确定的特点,根据运行时信息重新调度指令的执行顺序,以进一步开发ILP。在FPGA上的实验结果表明,该机制具有线性的硬件复杂度。相似文献

19.

超长指令字DSP上的数字图像处理算法优化方法

张帆葛颖增窦勇《微计算机应用》2008,29(10)

数字图像处理(Digital Image Processing)广泛应用于航空航天、生物医学工程、通信工程、工业和工程、军事公安、文化艺术等方面.由于一些应用的实时性和环境要求,通常采用数字信号处理器(Digital Signal Processing,简称DSP)处理图像.采用超长指令字(Very Long Instruction Word,简称VLIW)体系结构的DSP由于功耗低、硬件结构简单和并行性好等优点,在实时图像处理应用中使用广泛.根据图像处理算法特点和VLIW DSP体系结构特点提出在YLIW DSP上优化图像处理算法的一般方法,包括存储优化方法和指令级并行优化方法.最后采用提出的方法对多个常用的图像处理算法优化,试验结果表明有较好优化效果. 相似文献

20.

Tuning Compiler Optimizations for Simultaneous Multithreading

Jack L. Lo Susan J. Eggers Henry M. Levy Sujay S. Parekh Dean M. Tullsen 《International journal of parallel programming》1999,27(6):477-503

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SM T processor is capable of issuing multiple instructions from multiple threads to a processor's functional units each cycle. Unlike shared-memory multiprocessors, SMT provides and benefits from fine-grained sharing of processor and memory system resources; unlike current uniprocessors, SMT exposes and benefits from inter-thread instruction-level parallelism when hiding long-latency operations. Compiler optimizations are often driven by specific assumptions about the underlying architecture and implementation of the target machine, particularly for parallel processors. For example, when targeting shared-memory multiprocessors, parallel programs are compiled to minimize sharing, in order to decrease high-cost inter-processor communication. Therefore, optimizations that are appropriate for these conventional machines may be inappropriate for SMT, which can benefit from finegrained resource sharing within the processor. This paper reexamines several compiler optimizations in the context of simultaneous multithreading. We revisit three optimizations in this light: loop-iteration scheduling, software speculative execution, and loop tiling. Our results show that all three optimizations should be applied differently in the context of SMT architectures: threads should be parallelized with a cyclic, rather than a blocked algorithm; non-loop programs should not be software speculated, and compilers no longer need to be concerned about precisely sizing tiles to match cache sizes. By following these new guidelines, compilers can generate code that improves the performance of programs executing on SMT machines. 相似文献