期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

一个用于数据并行语言计算划分的时序优化模型 总被引：2，自引：0，他引：2

余华山胡长军黄其军丁文魁许卓群《软件学报》2001,12(10):1434-1446

一个程序中数据并行语句的计算划分(CP)对该程序的运行性能有决定性的作用.尽管人们对这一问题已经进行了广泛的研究,但这些研究的重点都集中在如何提高被选择计算划分的空间局部性上.针对并行循环结构的计算划分问题,提出了一个时序优化模型.在该模型中,一个计算划分被表示成一个有向图,在把并行语句中的操作映射到各个处理器的同时,给出了被分配到不同处理器上的操作之间的相关性.对于一条数据并行语句,时序优化模型对它的每个计算划分选择方案分别采用多种有效的优化策略进行优化;并综合考虑各个计算划分选择方案的负载平衡性、处理器间的操作依赖性、数据访问的空间局部性和时间局部性四个方面的因素,估算每个方案的执行效率;最后从这些方案中选择一个执行效率最优的方案作为该语句的计算划分.作者已在HPF编译器p-HPF采用时序优化模型实现了对FORALL结构的支持.实验结果表明,该模型具有非常好的通用性,对不同领域多种数据并行问题均取得了理想的加速比.同时,只需略微改动,该模型也可用于其他类型数据并行语句的计算划分. 相似文献

2.

基于Tree-SSA优化框架的高级循环优化

杨夏《数字社区&智能家居》2009,(24)

现代的计算机处理器和计算机系统实现了很多先进技术,要利用这些技术更需要编译器的支持以取得高性能。GCC中Tree-SSA优化框架提供了一个功能强大的程序分析框架。增强的数据依赖分析信息允许编译器变换一个算法以取得更大的局部性,提高资源的利用率以增大吞吐量,提高性能。该文对数据依赖、矩阵变换、循环变换进行了研究,分析了它们的特点,算法和性能,陈述了GCC中循环变换的现状,对以后的研究做出了一定的展望。相似文献

3.

一个结构网格并行CFD程序的单机性能优化

车永刚张理论王勇献徐传福刘巍王正华刘化勇《计算机科学》2013,40(3):116-120

从单机性能优化角度对一个高阶精度结构网格CFI)并行程序进行了优化。通过识别关键变量并对其进行常量参数化优化,使编译器能够实现更高级别的针对性优化;根据程序数据结构特点及访问模式,设计了分级数据缓存技术,使程序主要计算代码能够以更优的方式访问主要数据结构,提高了访存空间局部性;进行了各种循环变换,以优化访存性能。在国家超算长沙中心“`Tianhe—lA',并行机上的测试结果表明,相对于采用Intel编译器最高优化级别的版本,其对10。万网格点二维翼型算例,串行程序性能提高约22.2%-28.9%;对1. 12亿网格点三角翼算例,并行程序性能提高约13.9%-20.2%。相似文献

4.

指令级并行编译器的数据预取及优化方法 总被引：6，自引：0，他引：6

连瑞琦张兆庆乔如良《计算机学报》2000,23(6):576-584

微处理器芯片的处理能力越来越强,但是,存储器的速度却远远不能与其匹配,造成了整个系统的性能不理想,为解决这个总理２,编译器发展了局部性优化、数据预取等多种技术,文中将介绍一种用于ＩＬＰ（Ｉｎｓｔｒｕｃｔｉｏｎｌｅｖ－ｅｌＰａｒａｌｌｅｌｉｓｍ）优化编译器的数据预取技术以及一种利用寄存器堆减少主存访问次数、对程序进行优化的方法,利用它们可以提高平均存储性能,对科学和工程计算的应用是相当有效的。相似文献

5.

SIMD计算机的优化编译器设计 总被引：1，自引：1，他引：0

下载免费PDF全文

赵辉黄石《计算机工程》2009,35(1):201-203

利用处理器的相关资源,提高编译器优化性能和增强代码可适应性是SIMD处理器优化编译的关键。该文基于M语言和LSSIMD体系结构,结合现代编译器的编译技术,提出针对SIMD协处理器编译器的优化和实现方法,包括寄存器分配、单值合并、代码压缩等。实验结果表明,编译生成的目标代码准确、高效。相似文献

6.

一种利用数据融合来提高局部性和减少伪共享的方法 总被引：6，自引：0，他引：6

曾丽芳杨学军夏军陈娟《计算机学报》2004,27(1):32-41

某些应用程序不能通过数组内元素的重排优化获得性能提高 .针对这一问题 ,该文扩展了数组之间数据重组优化方法 ,着重分析了将多个数组的数据按一定方式进行融合来提高局部性和减少伪共享优化方法的特性 .文章针对几种典型的数组关联模式 ,提出了相应的数据融合方法 ,并建立了一组粗略的性能代价判别规则 ,以指导编译器有选择地融合数组以提高程序的全局优化效果 .根据在多个平台上的测试结果 ,该文还分析了数据融合优化方法在不同体系结构上的性能可移植性 ,并将体系结构特征加入到性能代价判别规则中 ,使得此优化方法能适用于不同的体系结构 .测试结果表明 ,数据融合优化方法对提高某些应用程序的性能 ,尤其是其在软件DSM体系结构上的性能 ,是非常有效的相似文献

7.

基于软件共享存储的Co-Array Fortran编译器实现

黄春《计算机科学》2012,39(1):287-289,304

Co-Array Fortran(CAF)已经成为Fortran语言标准的一部分,在科学计算领域逐渐被接受。基于软件共享存储实现了一个CAF编译器,其通过直接的数组赋值实现Co-array数据通信,利用数据垫塞技术提高数据局部性,减少伪共享,优化CAF程序性能。典型科学计算程序测试表明,CAF能够获得和MPI相当的性能。相似文献

8.

一种有效的优化编译实现策略

王新辉于健王昭顺《计算机工程与应用》2001,37(11):33-35,38

优化编译技术在现代处理器的研究中表现出越来越重要的作用。文章从现代编译器的结构入手,综合介绍现代编译器所普遍采用的优化技术,并提出了一种有效的优化编译器实现策略。相似文献

9.

带类型恢复的编译器源源翻译技术

米伟李玉祥陈莉冯晓兵张兆庆《计算机研究与发展》2010,47(7)

源源翻译是使编译器的分析和优化可重定向的一种重要方式.它被广泛用来支持并行语言扩展或者各种体系结构无关的优化,并且可以帮助程序员进行正确性或者性能的调试.在多核/众核时代,程序分析和优化倾向于让用户更多地参与,这种平台无关而且用户友好的代码生成方式也越来越受到欢迎.在简单的编译器中添加源源翻译的支持很容易,但在实现了复杂的程序分析和激进的优化的编译器中,却很少有编译器提供健壮的源源翻译支持.优化对程序结构的改变是造成翻译困难的首要原因.结合大量出错实例对优化给源源翻译带来的困难进行分析,提出了一套基于类型恢复的翻译技术,并在Open64编译器中实现了这种方法.通过supertest和spec2000测试集的测试,验证了这种方法对源源翻译的健壮性有很大改善.该方法的实现模块集成在源源翻译器内,与编译器各种分析优化模块独立,所以该方法容易移植到其他编译器中. 相似文献

10.

面向理想性能空间的跨架构编译分析方法

赖庆宽吕方贺春林何先波冯晓兵《计算机研究与发展》2021,58(3):668-680

编译器性能是计算机系统架构充分发挥优势的体现,编译器优化受机器平台与编译器特征的影响.编译器分析是在目标编译器与多参照编译器、目标平台与多参照平台之间进行的,即编译器与平台的组合是分析的基础.只有在多组合情况下才能为目标编译器优化提供最大可能的性能提升空间和详细的优化方案,但增加编译器与平台的组合往往会增加无法计量的分析工作量.为此,提出了一种基于峰值架构的面向跨平台跨编译器分析方法.基于峰值架构集为目标编译器构建理想性能空间,结合细粒度优势优化定位技术为目标编译器提供优势优化选项和优化方向,并实现编译器优化.最后通过实验验证了该分析技术的实用性与普适性,并为Intel平台上的目标编译器gcc提供了优化方向. 相似文献

11.

Compiler techniques for the distribution of data and computation

Navarro A. Zapata E. Padua D. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(6):545-562

This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines. 相似文献

12.

Data locality and parallelism optimization using a constraint-based approach

Ozcan Ozturk Author Vitae 《Journal of Parallel and Distributed Computing》2011,71(2):280-287

Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In the context of data-intensive embedded applications, there have been two complementary approaches to enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism. Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This can be achieved by considering multiple loop nests simultaneously. Although compilers address these two problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously. Therefore, an integrated approach that combines these two can generate much better results than each individual approach. Based on these observations, this paper proposes a constraint network (CN)-based formulation for data locality optimization and code parallelization. The paper also presents experimental evidence, demonstrating the success of the proposed approach, and compares our results with those obtained through previously proposed approaches. The experiments from our implementation indicate that the proposed approach is very effective in enhancing data locality and parallelization. 相似文献

13.

The Implementation of a High Performance GPGPU Compiler

Yi Yang Huiyang Zhou 《International journal of parallel programming》2013,41(6):768-781

In this paper we present our experience in developing an optimizing compiler for general purpose computation on graphics processing units (GPGPU) based on the Cetus compiler framework. The input to our compiler is a naïve GPU kernel procedure, which is functionally correct but without any consideration for performance optimization. Our compiler applies a set of optimization techniques to the naive kernel and generates the optimized GPU kernel. Our compiler supports optimizations for GPU kernels using either global memory or texture memory. The implementation of our compiler is facilitated with a source-to-source compiler infrastructure, Cetus. The code transformation in the Cetus compiler framework is called a pass. We classify all the passes used in our work into two categories: functional passes and optimization passes. The functional passes translate input kernels into desired intermediate representation, which clearly represents memory access patterns and thread configurations. A series of optimization passes improve the performance of the kernels by adapting them to the target GPGPU architecture. Our experiments show that the optimized code achieves very high performance, either superior or very close to highly fine-tuned libraries. 相似文献

14.

Integrating code generation and peephole optimization

Mahadevan Ganapathi Charles N. Fischer 《Acta Informatica》1988,25(4):85-109

Summary Peephole optimization when integrated with automatic code generation into a uniform framework has significant advantages in the specification and implementation of efficient compiler back-ends. Attribute grammars provide a framework for expression of machine-specific code optimizations. We present a grammar-driven peephole optimization algorithm that is particularly well suited to attributed-parsing code generators. Integration via semantic attributes corrects interrelated phase-ordering problems and produces a faster and smaller compiler back-end. 相似文献

15.

Integrating code generation and peephole optimization

Mahadevan Ganapathi Charles N. Fischer 《Acta Informatica》1988,25(1):85-109

Summary Peephole optimization when integrated with automatic code generation into a uniform framework has significant advantages in the specification and implementation of efficient compiler back-ends. Attribute grammars provide a framework for expression of machine-specific code optimizations. We present a grammar-driven peephole optimization algorithm that is particularly well suited to attributed-parsing code generators. Integration via semantic attributes corrects interrelated phase-ordering problems and produces a faster and smaller compiler back-end. 相似文献

16.

数据融合优化在IA-64机器上的性能可移植性测试和分析

曾丽芳杨学军《计算机工程与应用》2005,41(15):1-4,16

文章[1]中提出了数组之间的数据融合优化方法,并以IA-32服务器为平台测试了数据融合优化的效果。测试结果表明,在IA-32机器上,数据融合优化在性能代价模型的控制下,能较好地改善具有非连续数据访问特征的应用程序的CACHE利用率。那么,在新一代体系结构IA-64平台上,数据融合优化的效果如何呢?该文分别以IntelIA-32服务器和HPITANIUM服务器为平台,用IntelFORTRAN编译器ifc和efc及自由软件编译器g95分别编译并运行数据融合优化变换前后的程序,获得两种平台上的执行时间及相关的性能数据。测试结果表明,源程序级的数据融合优化不能很好地与IA-64平台上的EFC编译器高级优化配合工作,在O3级优化开关控制下,优化效果是负值。此测试结果进一步表明,编译高级优化如数据预取、循环变换和数据变换等各种优化必须结合体系结构的特点统筹考虑,才能取得好的全局优化效果。该文为研究各种面向IA-32体系结构的编译优化算法在IA-64体系结构上的性能可移植性优化起到抛砖引玉的作用。相似文献

17.

异构架构下基于放松重用距离的多平台数据布局优

刘颖黄磊吕方崔慧敏王蕾冯晓兵《软件学报》2016,27(8):2168-2184

异构架构迅速发展,依靠编译器来挖掘应用程序的数据局部性、充分发挥加速设备片上cache的硬件优势,是十分重要的.然而,传统的重用距离在异构背景下面临平台差异性挑战,缺乏统一的计算框架.为了更好地刻画和优化异构程序的局部性,建立了一个多平台统一的重用距离计算机制和数据布局优化框架.该框架根据应用在异构架构下的并行执行方式,从统计平均的角度提出了放松重用距离,并以OpenCL程序为例给出了它的计算方法,为多平台数据布局优化决策提供统一的依据.为了验证该方法的有效性,在Intel Xeon Phi,AMD Opteron CPU,Tilera TileGX-36这3个平台上进行了实验,结果表明,该方法在多平台上可获得至少平均1.14x的加速比. 相似文献

18.

Arranging statements and data of program instances for locality

Claudia Leopold 《Future Generation Computer Systems》1998,14(5-6):293-311

In memory hierarchies, programs can be speeded up by increasing their degree of locality. One human approach to locality optimization considers several small program instances of a given program, optimizes the instances for locality, and generalizes the structure of the solutions to the program. The paper suggests a semi-automatic locality-optimization method based on this approach. The major contribution is a local search algorithm that automatically optimizes program instances via combined code and data transformations. The algorithm uses a novel objective function that quantifies the intuitive notion of locality. Experimental results indicate that our method compares well with current compiler optimizations. 相似文献

19.

Data-Centric Transformations for Locality Enhancement

Kodukula Induprakas Pingali Keshav 《International journal of parallel programming》2001,29(3):319-364

On modern computers, the performance of programs is often limited by memory latency rather than by processor cycle time. To reduce the impact of memory latency, the restructuring compiler community has developed locality-enhancing program transformations such as loop permutation and tiling. These transformations work well for perfectly nested loops (loops in which all assignment statements are contained in the innermost loop), but their performance on codes such as matrix factorizations that contain imperfectly nested loops leaves much to be desired. In this paper, we propose an alternative approach called data-centric transformation. Instead of reasoning directly about the control structure of the program, a compiler using the data-centric approach chooses an order for the arrival of data elements in the cache, determines what computations should be performed when that data arrives, and generates the appropriate code. At runtime, program execution will automatically pull data into the cache in an order that corresponds approximately to the order chosen by the compiler; since statements that touch a data structure element are scheduled close together, locality is improved. The idea of data-centric transformation is very general, and in this paper, we discuss a particular transformation called data-shackling. We have implemented shackling in the SGI MIPSPro compiler which already has a sophisticated implementation of control-centric transformations for locality enhancement. We present experimental results on the SGI Octane comparing the performance of the two approaches, and show that for dense numerical linear algebra codes, data-shackling does better by factors of two to five. 相似文献