期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

分簇结构超长指令字DSP编译器的设计与实现 总被引：5，自引：0，他引：5

胡定磊陈书明刘春林《小型微型计算机系统》2006,27(2):348-353

超长指令字（VLIW）是高端DSP普遍采用的体系结构。VLIW DSP在硬件上没有调度和冲突判决的机制，其性能的发挥完全依靠编译嚣的优化效果．基于可重定向编译基础设施IMPACT，为分簇VLIW DSP YHFT—D4设计与实现了优化编译器．其中着重讨论了可重定向信息的定义、代码注释、SIMD指令的支持、分簇寄存器分配以度指令级并行开发和资源冲突解决等内容．实验结果表明该编译器可以达到较好的优化效果．相似文献

2.

面向多簇超长指令字DSP的向量化优化算法

徐华叶郑启龙丁陈飞徐东鹏《计算机系统应用》2013,22(12):140-143

BWDSP是一款针对高性能计算领域设计的处理器,采用多簇超长指令字（VLIW）体系结构和SIMD架构,同时也提供了很多向量化指令．然而现有的编译框架无法对这些向量化指令提供支持,因此本文提出了一种向量化优化算法,可以显著提高一些在DSP领域有着广泛应用的计算密集型程序的性能．最终实验结果表明,该优化算法能够平均取得6．60倍的加速比．相似文献

3.

一个基于DAG图的指令调度优化算法 总被引：1，自引：0，他引：1

陆伯鹰尹宝林《计算机工程与应用》2001,37(12):121-124

指令调度是优化编译技术中一项关键技术,对于VLIW体系结构的CPU,指令调度显得尤为重要。指令调度是在保证语义正确的前提下,改变指令的执行顺序,减少流水线中的空闲周期,从而提高CPU性能的一种优化方法。文章着重分析了优化编译中的指令调度问题,提出了一个指令调度算法和DAG图的一种化简方法,证明了算法的正确性,分析了算法的效率,比较了生成的新指令序列和最优的指令序列总的执行时间的差别。同时,针对目前流行的编译器GCC的指令调度算法中存在的问题,提出了一个较好的解决途径。相似文献

4.

基于汇编代码的指令调度器的设计与实现 总被引：1，自引：0，他引：1

田祖伟李勇帆《计算机科学》2009,36(3):45-47

随着嵌入式处理器在各个领域的广泛应用,嵌入式软件的复杂度越来越高.充分发掘嵌入式处理器的性能,需要高级编译优化技术的支持.指令调度是编译器发掘程序指令级并行性的关键技术之一.设计并实现了一个基于汇编代码的指令调度器.实验结果表明,在TECC嵌入式编译器中集成指令调度器后可显著提高程序的性能. 相似文献

5.

分簇VLIW DSP上支持单双字模式选择的SIMD编译优化

黄胜兵郑启龙郭连伟《计算机应用》2015,35(8):2371-2374

BWDSP100是一款采用超长指令字(VLIW)和单指令多数据流(SIMD)架构的针对高性能计算领域而设计的32位静态标量数字信号处理器,其指令级并行(ILP)主要是通过其特殊的分簇体系结构和SIMD指令来实现,然而现有的编译框架无法对这些特殊的SIMD指令提供支持。由于BWDSP100拥有丰富的SIMD向量化资源,且其所运用的雷达数字信号处理领域对程序的性能要求极高,因此针对BWDSP100结构的特点,在传统Open64编译器中SIMD编译优化框架的基础上提出并实现了一种支持单双字模式选择的SIMD编译优化算法,通过该算法可以显著提高一些在DSP上有着广泛运用计算密集型程序的性能。实验结果表明,与优化前相比,该算法方案在BWDSP编译器上的实现能够平均取得5.66的加速比。相似文献

6.

面向国产高性能加速器的LLVM编译器设计及优化

宋强唐俊龙陈照云时洋谭期轩肖紫阳邹望辉《计算机工程》2024,(4):321-331

国防科技大学自主研制的高性能加速器采用中央处理器(CPU)+通用数字信号处理器(GPDSP)的片上异构融合架构,使用超长指令集(VLIW)+单指令多数据流(SIMD)的向量化结构的GPDSP是峰值性能主要支撑的加速核。主流编译器在密集的数据计算指令排布、为指令静态分配硬件执行单元、GPDSP特有的向量指令等方面不能很好地支持高性能加速器。基于低级虚拟器(LLVM)编译框架,在前寄存器分配调度阶段,结合峰值寄存器压力感知方法(PERP)、蚁群优化(ACO)算法与GPDSP结构特点,优化代价模型,设计支持寄存器压力感知的指令调度模块;在后寄存器分配阶段提出支持静态功能单元分配的指令调度策略,通过冲突检测机制保证功能单元分配的正确性,为指令并行执行提供软件基础;在后端封装一系列丰富且规整的向量指令接口,实现对GPDSP向量指令的支持。实验结果表明,所提出的LLVM编译架构优化方法从功能和性能上实现了对GPDSP的良好支撑,GCC testsuite测试整体性能平均加速比为4.539,SPEC CPU 2017浮点测试整体性能平均加速比为4.49,SPEC CPU 2017整型测试整体性能平均... 相似文献

7.

基于谓词代码的编译优化技术研究

田祖伟孙光《计算机科学》2010,37(5):130-133

程序中大量分支指令的存在,严重制约了体系结构和编译器开发并行性的能力。有效发掘指令级并行性的一个主要挑战是要克服分支指令带来的限制。利用谓词执行可有效地删除分支,将分支指令转换为谓词代码,从而扩大了指令调度的范围并且删除了分支误测带来的性能损失。阐述了基于谓词代码的指令调度、软件流水、寄存器分配、指令归并等编译优化技术。设计并实现了一个基于谓词代码的指令调度算法。实验表明,对谓词代码进行编译优化,能有效提高指令并行度,缩短代码执行时间,提高程序性能。相似文献

8.

魂芯DSP上复数类型的支持和优化

王玉林郑启龙赵高义《计算机系统应用》2017,26(9):40-45

魂芯DSP是一款采用VLIW和SIMD架构的针对高性能计算领域而设计的32bit静态标量数字信号处理器.为了满足数字高性能计算的性能要求,魂芯DSP提供了丰富的复数指令,而编译器不能直接利用这些复数指令来提升编译性能.因此针对魂芯DSP芯片提供了大量的复数类操作指令的特点,在传统开源编译器Open64的编译框架基础上进行研究,实现了复数作为编译器基础类型和复数运算操作的支持.同时,通过识别特定的复数类操作的模式利用魂芯DSP上的复数类指令对程序编译优化.实验结果表明,该实现方案在魂芯DSP编译器上对复数程序优化后能够取得平均5.28的加速比. 相似文献

9.

基于GCC的VLIW编译系统研究 总被引：1，自引：1，他引：0

朱凯佳尹宝林《计算机工程与应用》2001,37(12):125-128

VLIW机器在单个机器周期中同时发射并执行多个的并行操作,从而获得较高的指令级并行度,这些操作之间的依赖分析和调度工作则被完全交给相应的编译器执行,因此VLIW的并行性能能否充分发挥取决于VLIW体系结构相关编译器的质量。GNU开发的GCC是被最广泛使用的编译系统之一,它具有多语言、多平台支持的能力和开放的结构,能够运用各种成熟的常规编译优化技术生成高效的代码。文章分析了VLIW及GCC的结构特点,提出了一种基于GCC的VLIW编译系统设计方案,利用GCC进行RTL中间代码一级的体系结构无关优化和少量体系结构相关优化,在汇编代码一级针对VLIW结构进行体系结构相关的优化,从而充分利用GCC的成熟编译技术快速开发高效的VLIW多语言编译系统。相似文献

10.

改进的指令总线功耗优化策略

徐步荣李曦魏亮辉《计算机辅助工程》2007,16(1):64-68

针对编译器系统设计和编译中的低功耗优化,基于可重定向编译器,实现在编译器后端对VLIW指令总线进行功耗优化的策略.通过对编译生成的二进制目标码进行横向再调度来减少指令总线上的高低电位切换次数,达到降低系统功耗的目的.对编译后端的软件流水和超块调度两种性能优化策略进行对比实验,表明其优化效果在30%以上,并且代码的指令级并行性(Instruction Level Parallelism,ILP)与优化效果存在明显的相关性.最后,通过ILP对该策略提出改进,以指令级并行信息指导功耗优化,在功耗优化效果损失不大的前提下,可节省多达20%的算法开销. 相似文献

11.

Profile-assisted instruction scheduling

William Y. Chen Scott A. Mahlke Nancy J. Warter Sadun Anik Wen-Mei W. Hwu 《International journal of parallel programming》1994,22(2):151-181

Instruction schedulers for superscalar and VLIW processors must expose sufficient instruction-level parallelism to the hardware in order to achieve high performance. Traditional compiler instruction scheduling techniques typically take into account the constraints imposed by all execution scenarios in the program. However, there are additional opportunities to increase instruction-level parallelism for the frequent execution scenarios at the expense of the less freuent ones. Profile information identifies these important execution scenarios in a program. In this paper, two major categories of profile information are studied: control-flow and memory-dependence. Profile-assisted code scheduling techniques have been incorporated into the IMPACT-I compiler. These techniques are acyclic global scheduling and software pipelining. This paper describes the scheduling algorithms, highlights the modifications required to use profile information, and explains the hardware and compiler support for dealing with hazards that arise from aggressive use of profile information. The effectiveness of these profile-based scheduling techniques is evaluated for a range of superscalar and VLIW processors. 相似文献

12.

Automatically Partitioning Threads for Multithreaded Architectures

《Journal of Parallel and Distributed Computing》1999,58(2):159-189

There is an enormous amount of parallelism exposed to fine-grain multithreaded architectures to cover latencies. It is a demanding task for a multithreading programmer to manage such a degree of parallelism by hand. To use multithreaded architectures efficiently it is essential to have compiler support for automatically partitioning programs into threads. This paper solves a fundamental problem in compiling for multithreaded architectures, automatically partitioning a program into threads. The focus of such partitioning is to overlap the remote communication latency and minimize the total execution time. We first formulate the partitioning problem based on a multithreaded execution cost model. Then, we prove such a formulation is NP-hard. Therefore, we propose two heuristic thread-partitioning methods to solve this problem in practice. The advanced partitioning algorithm is a novel extension of list scheduling, and it takes advantage of the cost model to generate near-optimum partitioning results. The remote-path-based partitioning algorithm is a simplified version of the advanced one but it is easy for compiler implementation. The two partitioning algorithms were implemented respectively in a thread partitioning testbed and a research EARTH-C compiler. The experimental results show that both partitioning algorithms are effective to generate efficient threaded code, and code generated by the compiler is comparable to hand-written code. 相似文献

13.

The Partial Reverse If-Conversion Framework for Balancing Control Flow and Predication

David I. August Wen-Mei W. Hwu Scott A. Mahlke 《International journal of parallel programming》1999,27(5):381-423

Predicated execution is a promising architectural feature for exploiting instruction-level parallelism in the presence of control flow. Compiling for predicated execution involves converting program control flow into conditional, or predicated, instructions. This process is known as if-conversion. In order to apply ifconversion effectively, one must address two major issues: what should be ifconverted and when the if-conversion should be performed. A compiler's use of predication as a representation is most effective when large amounts of code are if-converted and when if-conversion is performed early in the compilation procedure. On the other hand, efficient execution of code generated for a processor with predicated execution requires a delicate balance between control flow and predication. The appropriate balance is tightly coupled with scheduling decisions and detailed processor characteristics. This paper presents a compilation framework based on partial reverse if-conversion that allows the compiler to maximize the benefits of predication as a compiler representation while delaying the final balancing of control flow and predication to schedule time. 相似文献

14.

面向Storm的数据流编程模型与编译优化方法研究

杨秋吉于俊清莫斌生何云峰《计算机工程与科学》2016,38(12):2409-2418

数据流编程模型将程序的计算与通信分离,暴露了应用程序潜在的并行性并简化了编程难度。分布式计算框架利用廉价PC构建多核集群解决了大规模并行计算问题,但多核集群层次性存储结构和处理单元对数据流程序的性能提出了新的挑战。针对数据流程序在分布式架构下所面临的问题,设计并实现了数据流编程模型和分布式计算框架的结合——在COStream的基础上提出了面向Storm的编译优化框架。框架包括两个模块:面向Storm的层次性任务划分与调度,以及面向Storm的层次性软件流水与代码生成。层次性任务划分利用Storm的任务调度机制将程序所有子任务分配到Storm集群节点内的多核上。层次性软件流水与代码生成将子任务构造成集群节点间的软件流水和节点内多核间的软件流水,并生成相应的目标代码。实验以多核集群为目标平台,在集群上搭建Storm分布式架构,选取数字媒体处理领域典型程序作为测试程序,对面向Storm的编译优化后的程序进行实验分析。实验结果表明了结合方法的有效性。相似文献

15.

基于嵌入式机器码的软件PLC系统研究

黄仁杰陈浚清郑霁《工业控制计算机》2008,21(3):40-41

针对基于虚拟机机制的软件PLC可移植性差,执行效率低等不足,研究基于嵌入式机器码的软件PLC系统,通过梯形图编译器、代码解析生成器、汇编编译器等处理,将用户开发的逻辑程序直接编译成能够在CPU环境下执行的嵌入式机器码,该方法减少PLC虚拟指令执行过程,提高软件PLc执行效率. 相似文献

16.

Region-based compilation: Introduction, motivation, and initial experience 总被引：1，自引：0，他引：1

Richard E. Hank Wen-mei W. Hwu B. Ramakrishna Rau 《International journal of parallel programming》1997,25(2):113-146

The most important task of a compiler designed to exploit instruction-level parallelism (ILP) is instruction scheduling. If higher levels of ILP are to be achieved, the compiler must use, as the unit of scheduling, regions consisting of multiple basic blocks—preferably those that frequently execute consecutively, and which capture cycles in the program’s execution. Traditionally, compilers have been built using the function as the unit of compilation. In this framework, function boundaries often act as barriers to the formation of the most suitable scheduling regions. Function inlining may be used to circumvent this problem by assembling strongly coupled functions into the same compilation unit, but at the cost of very large function bodies. Consequently, global optimizations whose compile time and space requirements are superlinear in the size of the compilation unit, may be rendered prohibitively expensive. This paper introduces a new approach, called region-based compilation, wherein the compiler, after inlining, repartitions the program into more desirable compilation units, termed regions. Region-based compilation allows the compiler to control problem size and complexity while exposing inter-procedural scheduling, optimization and code motion opportunities. 相似文献

17.

Optimizing OpenMP Programs on Software Distributed Shared Memory Systems

Min Seung-Jai Basumallik Ayon Eigenmann Rudolf 《International journal of parallel programming》2003,31(3):225-249

This paper describes compiler techniques that can translate standard OpenMP applications into code for distributed computer systems. OpenMP has emerged as an important model and language extension for shared-memory parallel programming. However, despite OpenMP's success on these platforms, it is not currently being used on distributed system. The long-term goal of our project is to quantify the degree to which such a use is possible and develop supporting compiler techniques. Our present compiler techniques translate OpenMP programs into a form suitable for execution on a Software DSM system. We have implemented a compiler that performs this basic translation, and we have studied a number of hand optimizations that improve the baseline performance. Our approach complements related efforts that have proposed language extensions for efficient execution of OpenMP programs on distributed systems. Our results show that, while kernel benchmarks can show high efficiency of OpenMP programs on distributed systems, full applications need careful consideration of shared data access patterns. A naive translation (similar to OpenMP compilers for SMPs) leads to acceptable performance in very few applications only. However, additional optimizations, including access privatization, selective touch, and dynamic scheduling, resulting in 31% average improvement on our benchmarks. 相似文献

18.

基于编译指示的向量化方法

下载免费PDF全文

姚远赵荣彩《计算机工程》2012,38(12):272-275

编译器由于程序分析能力不足,无法自动实现循环向量化或者会造成盲目自动向量化。为此,提出一种基于编译指示的向量化方法。通过在代码中插入向量化编译指示语句,指导自动向量化编译工具的处理过程,自动生成高效的向量化代码。测试结果表明,该方法能够有效提高目标代码的运行性能。相似文献