首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 889 毫秒
1.
 Vectorization and parallelization of programs written in languages where pointers are used is now a subject of increasing interest. The presence of pointers in programs, however, poses new problems to dependence analysis in vectorizing and parallelizing compilers which had been designed to target only at FORTRAN77 programs. In this paper, a new method to analyze dependencies between pointer references in Pascal is proposed, which can also be applied to Fortran 90. It is designed to handle programs with dynamic data structures, such as linear linked lists or trees, which are the most common use of pointers. The method divides into two stages. The first stage is a safe alias analysis which handles any kind of dynamic data structures. The second stage focuses on the specific data structures. It first detects linear linked lists, and then performs dependence analysis between pointer references to the same list. The paper also proposes ways to enhance the second stage. Tree structures are handled here. Loops which manipulate linked lists can now be considered for vectorization by the proposed analysis. Techniques to vectorize such loops are presented in this paper. Some of the proposed algorithms are implemented in V-Pascal, the automatic vectorizing Pascal compiler of our laboratory. The effectiveness of the vectorization of list operations is proved by an experiment on HITAC S-820/80. Received: August 5, 1994/January 17, 1995  相似文献   

2.
Many scientific codes can achieve significant performance improvement when executed on a computer equipped with a vector processor. Vector constructs in source code should be recognized by a vectorizing compiler or preprocessor. This paper discusses, from a general point of view, how a vectorizing compiler/preprocessor can be evaluated. The areas discussed include data dependence analysis, IF loop analysis, nested loops, loop interchanging, loop collapsing, indirect addressing, use of temporary storage, and order of arithmetic. The ideas presented are based on vectorization of over a million lines of production codes and an extensive test suite developed to evaluate preprocessors under varying degrees of code complexity. Areas for future research are also discussed.  相似文献   

3.
Traditionally a vectorizing compiler matches the iterative constructs of a program against a set of predefined templates. If a loop contains no dependency cycles then amaptemplate can be used; other simple dependencies can often be expressed in terms offoldorscantemplates. This paper addresses the template matching problem within the context of functional programming. A small collection of program identities are used to specify vectorizable for-loops. By incorporating these program identities within a monad,allwell-typed for-loops in which the body of the loop is expressed using thevectorization monadcan be vectorized. This technique enables the elimination of template matching from a vectorizing compiler, and the proof of the safety of vectorization can be performed by a type inference mechanism.  相似文献   

4.
Automatic Intra-Register Vectorization for the Intel® Architecture   总被引:2,自引:0,他引:2  
Recent extensions to the Intel® Architecture feature the SIMD technique to enhance the performance of computational intensive applications that perform the same operation on different elements in a data set. To date, much of the code that exploits these extensions has been hand-coded. The task of the programmer is substantially simplified, however, if a compiler does this exploitation automatically. The high-performance Intel® C++/Fortran compiler supports automatic translation of serial loops into code that uses the SIMD extensions to the Intel® Architecture. This paper provides a detailed overview of the automatic vectorization methods used by this compiler together with an experimental validation of their effectiveness.  相似文献   

5.
A Vectorizing Compiler for Multimedia Extensions   总被引:6,自引:0,他引:6  
In this paper, we present an implementation of a vectorizing C compiler for Intel's MMX (Multimedia Extension). This compiler would identify data parallel sections of the code using scalar and array dependence analysis. To enhance the scope for application of the subword semantics, our compiler performs several code transformations. These include strip mining, scalar expansion, grouping and reduction, and distribution. Thereafter inline assembly instructions corresponding to the data parallel sections are generated. We have used the Stanford University Intermediate Format (SUIF), a public domain compiler tool, for our implementation. We evaluated the performance of the code generated by our compiler for a number of benchmarks. Initial performance results reveal that our compiler generated code produces a reasonable performance improvement (speedup of 2 to 6.5) over the the code generated without the vectorizing transformations/inline assembly. In certain cases, the performance of the compiler generated code is within 85% of the hand-tuned code for MMX architecture.  相似文献   

6.
在Open64编译框架基础上,提出一种基于Profile信息的循环内数据访问连续性分析算法及其向量化优化方法。采用反馈式编译优化技术,获取程序运行时的连续性Profile信息,通过结构体剥离和数据重组方法实现程序向量化。实验结果表明,该算法针对不规则程序代码,可提供更精确的向量化信息,提高程序的向量化程度。  相似文献   

7.
The structure and implementation of P3C, a compiler that uses relatively simple parallelization techniques to produce highly efficient code for both shared-memory and distributed-memory MIMD multiprocessors, are discussed. P3C is a fully automatic, portable parallelizing Pascal compiler for scientific code, which is characterized by loops operating on regular data structures. P3 C compiles the same source code to all target machines without modification. P3C gives an accurate estimate about whether the parallel code will actually reduce execution time over serial code, taking into account the associated overhead. It derives the estimate by statically analyzing the program at compile time, referring to a table of the target machine's parameters. Speedups of up to 24 have been achieved on a Sequent Symmetry with 25 active processors. The estimated speedup was within 8% of the actual figure  相似文献   

8.
随着向量长度的不断增长, SIMD扩展部件得以处理更为庞大的数据级并行, 但程序的并行阈值也随之提高. 对于现有的自动向量化编译器, 如果在分析阶段不能从串行代码中发掘出足够的数据级并行以完全填充向量寄存器, 则不会进入相应的向量代码变换阶段, 从而无法向量化. 较长的向量长度使得某些并行性不足的程序失去了向量化的机会, 造成了性能下降. 为了更加充分的利用SIMD部件, 介绍了一种面向基本块的非满载向量化方法ISLP. 基于开源GCC编译器, 从并行性检测、代码生成和代价模型3个方面详细阐述了ISLP的设计与实现. 在标准测试集上的实验结果表明, 该方法可以有效地对超字级并行性不足的程序进行向量化处理, 提高程序执行效率. 选取的测试用例在向量化后的平均加速比达到1.14, 性能较常规SLP方法提升11.8%.  相似文献   

9.
Data dependence and its application to parallel processing   总被引:8,自引:0,他引:8  
Data dependence testing is required to detect parallelism in programs. In this paper data dependence concepts and data dependence direction vectors are reviewed. Data dependence computation in parallel and vector constructs as well as serialdo loops is covered. Several transformations that require data dependence are given as examples, such as vectorization (translating serial code into vector code), concurrentization (translating serial code into concurrent code for multiprocessors), scalarization (translating vector or concurrent code into serial code for a scalar uniprocessor), loop interchanging and loop fusion. The details of data dependence testing including several data dependence decision algorithms are given.  相似文献   

10.
An introduction to a formal theory of dependence analysis   总被引:3,自引:1,他引:2  
Dependence analysis is a very important part of any vectorizing or concurrentizing compiler. This paper is an introduction to a formal theory of dependence analysis. The emphasis here is on rigor —the subject matter is not new. The program model is a Fortran do loop consisting of loops and assignment statements. We carefully explain the key dependence concepts and indicate through examples how the dependence tests work. These ideas and methods can be readily extended to more general programs.  相似文献   

11.
Loop skewing is a new procedure to derive the wavefront method of execution of nested loops. The wavefront method is used to execute nested loops on parallel and vector computers when none of the loops can be done in vector mode. Loop skewing is a simple transformation of loop bounds and is combined with loop interchanging to generate the wavefront. This derivation is particularly suitable for implementation in compilers that already perform automatic detection of parallelism and generation of vector and parallel code, such as are available today. Loop normalization, a loop transformation used by several vectorizing translators, is related to loop skewing, and we show how loop normalization, applied blindly, can adversely affect the parallelism detected by these translators.  相似文献   

12.
13.
赵捷  赵荣彩  丁锐  黄品丰 《软件学报》2012,23(10):2695-2704
传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合OpenMP代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合MPI代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和MPI并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的MPI并行代码其性能加速比提高了20%以上.  相似文献   

14.
作为SIMD扩展部件向量化的重要手段,自动向量化已在LLVM编译器中得到实现,但向量长度以及指令集功能的差异,导致国产平台在自动向量化过程中容易错失向量化机会以及向量化后产生倒加速的问题。为使SIMD得到充分应用,结合国产平台的指令集特征完善指令代价信息以提高收益分析精准度,使其在自动向量化后生成后端支持且简洁高效的向量指令。在此基础上,提出一种改进的控制流向量化方法,通过添加指令代价信息提高自动向量化的适配能力,从而形成一套面向国产平台的LLVM自动向量化系统。实验结果表明,相比自动向量化移植前,通过该方法进行移植优化后,SPEC测试的整体性能提升10.8%,TSVC测试集中的加速比提升16%,精准代价指导下的加速比提升42%,控制流向量化下的加速比提升51%。  相似文献   

15.
姚远  赵荣彩 《计算机工程》2012,38(12):272-275
编译器由于程序分析能力不足,无法自动实现循环向量化或者会造成盲目自动向量化。为此,提出一种基于编译指示的向量化方法。通过在代码中插入向量化编译指示语句,指导自动向量化编译工具的处理过程,自动生成高效的向量化代码。测试结果表明,该方法能够有效提高目标代码的运行性能。  相似文献   

16.
Linear recurrences are the most important class of nonvectorizable problems in typical scientific/engineering calculations. This work discusses high-performance methods for solving first-order linear recurrences on a vector computer, investigates automatic transformations, and develops compiling techniques for first-order linear recurrence problems. The results show that the improved vector code generated by the vectorizing compiler on the HITAC S-820 supercomputer runs at the rate of 150 MFLOPS (million floating operations per second) for moderate loop lengths (>1000) and over 200 MFLOPS for long loop lengths (> 10000). Also, overall performance improvements of 69% in the 14 Lawrence Livermore Loops and 25 % in the 24 Lawrence Livermore Loops, as measured by the harmonic mean, are attained.  相似文献   

17.
韩林  徐金龙  李颖颖  王阳 《计算机科学》2017,44(2):70-74, 81
大量循环中都存在着少数无法向量化的语句以及许多可向量化语句,循环分布通常可以将这些语句分离到不同的循环中,进而实现循环的部分向量化。目前主流的优化编译器仅支持简单激进的循环分布方法,因而导致向量化后的循环开销过大,且不利于寄存器和cache的重用。针对上述问题,提出了面向部分向量化的循环分布及聚合方法。首先,分析了一般循环分布的两个关键问题:语句集的划分和循环执行顺序的确定;其次,提出了面向最大聚合的凝聚图结点排序方法来指导循环合并,在不影响并行性的前提下减小了循环开销;最后,通过实验对提出的方法进行了验证。实验结果表明,对于测试用例,提出的方法能够生成正确的向量化代码,并且能够显著提高向量化程序的执行效率。  相似文献   

18.
李朋远  赵荣彩  高伟  张庆花 《计算机科学》2015,42(5):194-199, 203
随着SIMD扩展部件的迅速发展,自动向量化工具已逐渐成熟.现阶段的工具能对连续访存程序进行较好的处理,然而,大部分非连续访存的多媒体程序并不能被转换为高效的向量化代码.提出并实现了一种支持跨幅访存的向量化代码生成方法,其利用目标系统已有的基本数据处理指令实现多个向量间的任意重组来解决含有非连续访存语句的向量化代码生成问题.经过实验分析和验证,提出的代码生成方法能够将含有跨幅访存的语句转化为面向目标系统的高效向量化代码,以提高程序执行效率.  相似文献   

19.
1 引言在微处理器设计中,开发指令级并行(ILP)以提高微处理器系统的性能受到了很大的限制。研究更大发射的超标量微处理器已经是一件极其复杂而没有意义的事。但是如果在开发指令级并行的同时,开发数据级并行,理论分析表明可以显著提高微处理器的性能,微处理器的等效IPC(每个时钟周期发射的指令条数)和超标量微处理器相比可以提高20~40多倍。因此,在微处理器系统设计中开发数据级并行具有重要的理论意义和实用价值。  相似文献   

20.
面向SLP 的多重循环向量化   总被引:1,自引:0,他引:1  
魏帅  赵荣彩  姚远 《软件学报》2012,23(7):1717-1728
如今,越来越多的处理器集成了SIMD(single instruction multiple data)扩展,现有的编译器大多也实现了自动向量化的功能,但是一般都只针对最内层循环进行向量化,对于多重循环缺少一种通用、易行的向量化方法.为此,提出了一种面向SLP(superword level parallelism)的多重循环向量化方法,从外至内依次对各个循环层次进行分析,收集各层循环对应的一些影响向量化效果的属性值,主要包括能否对该循环进行直接循环展开和压紧、有多少数组引用相对于该循环索引连续以及该循环所包含的区域等,然后根据这些属性值决定在哪些循环层次进行直接循环展开和压紧,最后通过SLP对循环中的语句进行向量化.实验结果表明,该算法相对于内层循环向量化和简单的外层循环向量化平均加速比提升了2.13和1.41,对于一些常用的核心循环可以得到高达5.3的加速比.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号