期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

周波张源杨珉周曦《小型微型计算机系统》2013,34(6)

选择性编译能降低程序编译开销和生成代码的存储空间需求,但面临热方法检测延时和编译延时问题.现有降低这些延时的方法因需要复杂数据结构、算法或特殊硬件的支持而不适合嵌入式虚拟机平台.针对嵌入式平台,提出使用将可执行代码缓存至文件并按需复用的方法来降低这两种延时.为此,本文以Android系统虚拟机的即时编译器为基础,设计实现了轻量级的CCARF(Code Cache and Reuse Framework).CCARF为即时编译器设计了位置无关代码生成算法,使编译生成的代码不合位置依赖信息,从而保证代码可被正确复用;基于该算法,CCARF实现了一个代码管理器,高效地将位置无关代码缓存并复用.SPECjvm98基准测试集的测试结果表明,CCARF能在控制生成代码增长的前提下,平均提升基准测试程序性能约11％. 相似文献

2.

改进的指令总线功耗优化策略

徐步荣李曦魏亮辉《计算机辅助工程》2007,16(1):64-68

针对编译器系统设计和编译中的低功耗优化,基于可重定向编译器,实现在编译器后端对VLIW指令总线进行功耗优化的策略.通过对编译生成的二进制目标码进行横向再调度来减少指令总线上的高低电位切换次数,达到降低系统功耗的目的.对编译后端的软件流水和超块调度两种性能优化策略进行对比实验,表明其优化效果在30%以上,并且代码的指令级并行性(Instruction Level Parallelism,ILP)与优化效果存在明显的相关性.最后,通过ILP对该策略提出改进,以指令级并行信息指导功耗优化,在功耗优化效果损失不大的前提下,可节省多达20%的算法开销. 相似文献

3.

可重构指令集处理器的代码优化生成算法研究

张惠臻王超李曦周学海《计算机研究与发展》2012,49(9):2018-2026

可重构指令集处理器能够适应多变的计算任务在性能和灵活性两方面的要求,而传统的编译后端技术无法为其生成高效的可执行代码,需要有新的代码生成方法.针对传统编译后端代码生成三阶段方法进行扩展的代码混合优化生成算法正是这样一种方法.该算法很大程度地复用了原有的三阶段代码生成过程,同时针对可重构指令集具有动态性的特点,根据系统硬件资源和重构配置,扩展了针对可重构指令代码生成的优化处理,从而能够获得切合可重构指令集处理器体系结构特性的可执行代码.相关实验与分析说明了该算法针对硬件重构得到的新平台所做的可重构指令代码生成是有效的,能够较好地提高应用程序在新平台上的执行性能. 相似文献

4.

多面体模型下的循环置换与自动调优

彭畅刘青枝陈长波《计算机工程与科学》2023,(12):2121-2134

针对常用多面体编译器Pluto默认循环调度和分块大小性能欠佳的问题，提出了一种为其调度计算多种合法置换，根据置换和分块大小构成的配置空间为循环程序自动调优的方法。通过对定义循环融合的标量维度的处理，实现了非完美嵌套循环块间和块内的同时置换。构建了4种机器学习驱动的自动调优策略，为循环程序在指定问题规模下寻找优化的置换序和分块大小组合。默认分块大小下，扩展后的Pluto编译器并行环境下生成的最佳置换相较于Pluto的默认调度取得了最高4.02和几何平均2.12的加速比。通过进一步搜索更优的置换序和分块大小组合，最好的自动调优策略在并行环境下相较于Pluto的默认优化取得了最高5.48和几何平均2.86的加速比。此外，指定问题规模下，自动调优得到的最佳配置和学习模型应用于相似问题规模时，相较于Pluto的默认优化也取得了不同程度的性能提升。相似文献

5.

面向MPI代码生成的Open64编译器后端

《计算机学报》2014,(7)

随着计算机体系结构的发展,分布式存储结构以其良好的扩展性逐渐占据了高性能计算机体系结构市场的主导地位.为了将现有的串行程序转换为能够在高性能计算机上运行的并行程序,研究人员提出了并行化编译器.然而,当前面向分布存储并行系统的编译器发展却相对较慢,而面向共享存储并行系统的编译器及其相应技术已逐渐成熟.一种开发面向分布存储并行系统编译器的可行方法是改进现有的面向共享存储并行系统的编译器,使其自动生成能够在分布存储结构高性能计算机上运行的MPI(Message Passing Interface)并行程序.因此,该文为面向共享存储并行系统的编译器Open64设计并实现了一个支持MPI代码生成的后端.根据分布式并行化编译的特点,主要从自动生成计算划分、改进循环优化和自动生成MPI并行代码3个方面对Open64进行了改进,使其能够实现面向分布存储的并行化编译.实验测试利用带有MPI后端的Open64对串行程序进行编译,生成的MPI并行代码可直接运行在具有分布存储结构的高性能计算机上.通过将该MPI并行代码的执行效率与传统面向分布存储并行系统编译器生成的MPI代码效率进行比较,并行效率有明显的提升. 相似文献

6.

基于机器学习的编译器自动调优综述

池昊宇陈长波《计算机科学》2022,49(1):241-251

现代编译器提供的优化选项众多,选择何种参数因子、选择哪些选项组合以及以何种顺序应用这些选项成为复杂的问题,其中优化次序问题是最困难的优化问题.随着传统方法的改进(迭代编译结合启发式优化搜索)以及新技术的出现(机器学习),构建一种相对高效、智能的编译器自动调优框架成为可能.文中通过调查过去数十年的相关研究,总结了前人的研... 相似文献

7.

基于类库的可重定向编译器后端设计与实现

王民华张素琴田金兰《计算机工程与应用》2003,39(9):115-118

该文在对几种可重定向编译器进行分析的基础上,提出了一种基于类库的可重定向编译器后端设计技术。该技术通过恰当定义机器描述与代码生成之间的接口,抽象不同硬件平台共有的操作与功能,隔离中间表示和不同硬件平台汇编语言代码的差异。根据不同硬件平台特点,利用面向对象技术实现接口,构成重定向支持类库。代码生成器通过对接口的调用,将中间表示转化为相应平台的汇编语言,完成编译器后端的重定向工作。相似文献

8.

程序并行化中基于暴露集生成数据分布代码

丁锐赵荣彩韩林《计算机工程与设计》2009,30(15)

在并行化编译中,代码生成属于编译器的后端,决定着并行程序的执行效率.数据划分将计算循环中被重定义或没被读引用的数据映射到处理器,按照数据划分生成通信代码会产生冗余通信.提出了利用数组数据流分析求解暴露集,并建立计算划分、循环迭代以及暴露集的不等式限制系统,最后通过FME(fourier Motzkin elimination)消元生成数据分布代码的优化算法.测试结果表明该算法对数据分布的优化效果明显. 相似文献

9.

一种基于增量式实例学习的迭代编译方法

下载免费PDF全文

马晓东李中升漆锋滨尉红梅《计算机工程》2012,38(3):4-6

为提高编译器的自适应性,以应对复杂的体系结构,提出一个结合迭代编译和机器学习的编译框架。编译器可将在优化空间中搜索到的最佳编译选项信息保存到知识库中,并能从知识库中学习获得适合当前程序的最佳编译选项。实例学习算法具有增量式的特点,可有效利用编译过程中积累的数据。通过避免冗余实例入库以及从库中剔除噪声实例,保证学习的精度与效率。相似文献

10.

SIMD计算机的优化编译器设计 总被引：1，自引：1，他引：0

下载免费PDF全文

赵辉黄石《计算机工程》2009,35(1):201-203

利用处理器的相关资源,提高编译器优化性能和增强代码可适应性是SIMD处理器优化编译的关键。该文基于M语言和LSSIMD体系结构,结合现代编译器的编译技术,提出针对SIMD协处理器编译器的优化和实现方法,包括寄存器分配、单值合并、代码压缩等。实验结果表明,编译生成的目标代码准确、高效。相似文献

11.

Towards optimized tensor code generation for deep learning on sunway many-core processor

Mingzhen LI Changxi LIU Jianjin LIAO Xuegui ZHENG Hailong YANG Rujun SUN Jun XU Lin GAN Guangwen YANG Zhongzhi LUAN Depei QIAN 《Frontiers of Computer Science》2024,18(2):182101

The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the existing deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific computing and deep learning workloads. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient codes for deep learning workloads on Sunway. The experiment results show that the codes generated by swTVM achieve 1.79

×

improvement of inference latency on average compared to the state-of-the-art deep learning framework on Sunway, across eight representative benchmarks. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and Sunway processor particularly with productivity and efficiency in mind. We believe this work will encourage more people to embrace the power of deep learning and Sunway many-core processor. 相似文献

12.

Proofs of numerical programs when the compiler optimizes

Sylvie Boldo Thi Minh Tuyen Nguyen 《Innovations in Systems and Software Engineering》2011,7(2):151-160

On certain recently developed architectures, a numerical program may give different answers depending on the execution hardware and the compilation. Our goal is to formally prove properties about numerical programs that are true for multiple architectures and compilers. We propose an approach that states the rounding error of each floating-point computation whatever the environment and the compiler choices. This approach is implemented in the Frama-C platform for static analysis of C code. Small case studies using this approach are entirely and automatically proved. 相似文献

13.

Milepost GCC: Machine Learning Enabled Self-tuning Compiler

Grigori Fursin Yuriy Kashnikov Abdul Wahid Memon Zbigniew Chamski Olivier Temam Mircea Namolaru Elad Yom-Tov Bilha Mendelson Ayal Zaks Eric Courtois Francois Bodin Phil Barnard Elton Ashton Edwin Bonilla John Thomson Christopher K. I. Williams Michael O��Boyle 《International journal of parallel programming》2011,39(3):296-327

Tuning compiler optimizations for rapidly evolving hardware makes porting and extending an optimizing compiler for each new platform extremely challenging. Iterative optimization is a popular approach to adapting programs to a new architecture automatically using feedback-directed compilation. However, the large number of evaluations required for each program has prevented iterative compilation from widespread take-up in production compilers. Machine learning has been proposed to tune optimizations across programs systematically but is currently limited to a few transformations, long training phases and critically lacks publicly released, stable tools. Our approach is to develop a modular, extensible, self-tuning optimization infrastructure to automatically learn the best optimizations across multiple programs and architectures based on the correlation between program features, run-time behavior and optimizations. In this paper we describe Milepost GCC, the first publicly-available open-source machine learning-based compiler. It consists of an Interactive Compilation Interface (ICI) and plugins to extract program features and exchange optimization data with the cTuning.org open public repository. It automatically adapts the internal optimization heuristic at function-level granularity to improve execution time, code size and compilation time of a new program on a given architecture. Part of the MILEPOST technology together with low-level ICI-inspired plugin framework is now included in the mainline GCC. We developed machine learning plugins based on probabilistic and transductive approaches to predict good combinations of optimizations. Our preliminary experimental results show that it is possible to automatically reduce the execution time of individual MiBench programs, some by more than a factor of 2, while also improving compilation time and code size. On average we are able to reduce the execution time of the MiBench benchmark suite by 11% for the ARC reconfigurable processor. We also present a realistic multi-objective optimization scenario for Berkeley DB library using Milepost GCC and improve execution time by approximately 17%, while reducing compilation time and code size by 12% and 7% respectively on Intel Xeon processor. 相似文献

14.

Compilation techniques for parallel systems

《Parallel Computing》1999,25(13-14):1741-1783

Over the past two decades tremendous progress has been made in both the design of parallel architectures and the compilers needed for exploiting parallelism on such architectures. In this paper we summarize the advances in compilation techniques for uncovering and effectively exploiting parallelism at various levels of granularity. We begin by describing the program analysis techniques through which parallelism is detected and expressed in form of a program representation. Next compilation techniques for scheduling instruction level parallelism (ILP) are discussed along with the relationship between the nature of compiler support and type of processor architecture. Compilation techniques for exploiting loop and task level parallelism on shared-memory multiprocessors (SMPs) are summarized. Locality optimizations that must be used in conjunction with parallelization techniques for achieving high performance on machines with complex memory hierarchies are also discussed. Finally we provide an overview of compilation techniques for distributed memory machines that must perform partitioning of both code and data for parallel execution. Communication optimization and code generation issues that are unique to such compilers are also briefly discussed. 相似文献

15.

Alternatives of profile-guided code optimizations for one-stage compilation

O.?A.?Chetverina Email author 《Programming and Computer Software》2016,42(1):34-40

Optimizing compilers increase the resulting code performance by carrying out a number of code optimization techniques. Profile information assistance for code optimizations gives an opportunity to greatly increase the code performance in some cases. However, the impossibility to provide a representative training execution often leads to the decline in efficiency of profile-dependent code optimizations. This paper investigates the main causes of the performance loss for the one-stage optimization as compared to the profileguided optimization (PGO) and introduces some alternative compilation techniques to reduce this loss. The effectiveness of these techniques is evaluated for a VLIW-architecture Elbrus compiler. 相似文献

16.

Java编译程序技术与Java性能 总被引：4，自引：1，他引：3

冀振燕程虎《软件学报》2000,11(2):173-178

概述了Java编译程序技术,把Java编译程序分成5类:具有解释技术的编译程序;具有及时(JIT)编译技术的编译程序;具有自适应优化技术的编译程序;本地编译程序和翻译程序.详细描述和分析了它们的体系结构和工作原理.同时也分析了编译程序技术对Java性能的影响. 相似文献

17.

An Object-Oriented Framework for Loop Parallelization

Omori Youichi Fukuda Akira Joe Kazuki 《The Journal of supercomputing》1999,13(1):57-69

Generation of efficient parallel code is a major goal of a well-designed and developed parallelizing compiler. Another important goal is portability of both compiler system and the resulting output source codes. The various choices of current and future parallel computer architectures as well as the cost of developing a parallelizing compiler make portability a very important design goal. Since the design of parallelizing compilers is considerably move complex than designing conventional compilers, it is very important to achieve both efficiency and portability. To meet this dual goal, we have investigated the application of object oriented design to parallelizing compilers. Our parallelizing compiler design is based on abstractions of intermediate representations of loops and their class definitions. In this paper, we address the problem of loop parallelization and propose a framework where the loop parallelization process is divided into three phases and the optimization of loops is performed via a cyclic application of these three phases. The class of each phase is hierarchically derived from intermediate representations of loops. This facilitates the portability of the resulting parallelizing compilers. Furthermore, one of the phases uses a reservation table of hardware resources in order to obtain optimized parallel programs for given hardware resources. The validation of the proposed framework is given through the application of the object oriented design on an example program which is then parallelized efficiently. 相似文献

18.

A highly flexible,parallel virtual machine: design and experience of ILDJIT

Simone Campanoni Giovanni Agosta Stefano Crespi Reghizzi Andrea Di Biagio 《Software》2010,40(2):177-207

ILDJIT, a new‐generation dynamic compiler and virtual machine designed to support parallel compilation, is introduced here. Our dynamic compiler targets the increasingly popular ECMA‐335 specification. The goal of this project is twofold: on one hand, it aims at exploiting the parallelism exposed by multi‐core architectures to hide the dynamic compilation latencies by pipelining compilation and execution tasks; on the other hand, it provides a flexible, modular and adaptive framework for dynamic code optimization. The ILDJIT organization and the compiler design choices are presented and discussed highlighting how adaptability and extensibility can be achieved. Thanks to the compilation latency masking effect of the pipeline organization, our dynamic compiler is able to mask most of the compilation delay, when the underlying hardware exposes sufficient parallelism. Even when running on a single core, the ILDJIT adaptive optimization framework manages to speedup the computation with respect to other open‐source implementations of ECMA‐335. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

19.

Compiler techniques for the distribution of data and computation

Navarro A. Zapata E. Padua D. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):545-562

This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines. 相似文献

20.

结合模型和迭代编译优化矩阵相乘程序

陆平静王正华车永刚《计算机工程与科学》2009,31(Z1)

高性能计算应用程序获得的持续性能与机器峰值性能的差距日益扩大,很大程度上制约着高性能计算的发展。程序变换通过对程序进行适应机器体系结构特征的优化变换,提高程序实际执行性能,是解决该问题的有效途径之一。很多高级程序变换均具有数值参数,为了获得最优性能,需要仔细选择参数的值。传统的编译器使用简单的模型选择这些参数,难以适应日趋复杂的硬件平台和应用程序。迭代编译通过生成不同的程序版本并在实际硬件评估上运行程序,来评估关键优化参数的值并决定能够产生最优性能的值,显著优于静态方法,但巨大的优化开销限制了其应用范围。本文针对矩阵相乘程序提出一种结合性能模型和迭代编译的优化方法,利用基于对机器体系结构和程序的经验知识构造性能模型约束优化空间,并使用遗传算法加速在优化空间中寻找优秀解的过程。实验结果表明,该方法可以较低的开销获得更优的性能优化效果。相似文献