期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Loop Staggering,Loop Compacting:Restructuring Techniques for Thrashing Problem 总被引：1，自引：0，他引：1

Jin Guohua Yang Xuejun Chen Fujie 《计算机科学技术学报》1993,8(1):49-57

Parallel loops account for the greatest amount of parallelism in numerical programs.Executing nested loops in parallel wit low run-time overhead is thus very important for achieving high performance in paralleo processing systems.However,in parallel processing systems with caches of local memories in memory hierarchies,“thrashing problemmay” may arise when data move back and forth frequently between the caches or local memories in different processors.The techniques associated with parallel compiler to solve the problem are not completely developed.In this paper,we present two restructuring techniques called loopg staggering,loop staggering and compacting,with which we can not only eliminate the cache or local memory thrashing phemomena significantly,but also exploit the potential parallelism existing in outer serial loop.Loop staggering benefits the dynamic loop scheduling strategies,whereas loop staggering and compacting is good for static loop scheduling strategies,Our method especially benefits parallel programs,in which a parallel loop is enclosed by a serial loop and array elements are repeatedly used in the different iterations of the parallel loop. 相似文献

2.

Achieving full parallelism using multidimensional retiming

Passos N.L. Sha E.H.-M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(11):1150-1163

Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms 相似文献

3.

复杂访问模式下假共享Cache行抖动的消除

金国华陈福接《计算机学报》1994,17(6):446-455

在详细讨论了简单数据组访问模式下假共享抖动现象及其消除方法的基础上，本文着重分析了复杂访问模式下的假共享Ｃａｃｈｅ行抖动现象和真假共享抖动并存现象，引入了并行循环访问距概念，提出了消除假共享抖动的编译方法－块化错位方法。结合块化错位方法，我们提出了多维数组的数组扩展思想，给出了多重嵌套循环含多次写访问情况下减少或消除抖动的算法。相似文献

4.

Coarse-grained loop parallelization: Iteration Space Slicing vs affine transformations

Anna Beletska Albert Cohen 《Parallel Computing》2011,37(8):479-497

Automatic coarse-grained parallelization of program loops is of great importance for parallel computing systems. This paper presents the theory of Iteration Space Slicing aimed at extracting synchronization-free parallelism available in arbitrarily nested program loops. We demonstrate that Iteration Space Slicing algorithms permits for extracting more coarse-grained parallelism than that extracted by means of the Affine Transformation Framework provided that we are able to calculate the transitive closure of the union of relations describing all dependences in the affine loop. Experimental results show that by means of Iteration Space Slicing algorithms, we are able to extract coarse-grained parallelism for many loops of NAS and UTDSP benchmarks. Problems to be resolved in order to enhance the theory of Iteration Space Slicing are discussed. 相似文献

5.

EIGENVECTORS-BASED PARALLELISATION OF NESTED LOOPS WITH AFFINE DEPENDENCES

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(3):227-248

Abstract

This paper presents a method for parallelising nested loops with affine dependences. The data dependences of a program are represented exactly using a dependence matrix rather than an imprecise dependence abstraction. By a careful analysis of the eigenvectors and eigenvalues of the dependence matrix, we detect the parallelism inherent in the program, partition the iteration space of the program into sequential and parallel regions, and generate parallel code to execute these regions. For a class of programs considered in the paper, the proposed method can expose more coarse-grain and fine-grain parallelism than a hyperplane-based loop transformation. 相似文献

6.

EFFECTIVE PARALLELIZATION TECHNIQUES FOR LOOP NESTS WITH NON-UNIFORM DEPENDENCES

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1):37-64

The parallelism of loop nests with non-uniform dependences is difficult to extract and ineffectively explored by the existing parallelization schemes. In this paper, we propose new efficient techniques in extracting parallelism of loop nests with non-uniform dependences using their irregularity. By this way, current highly parallel multiprocessor systems such as multithreaded and clustering multiprocessor systems can be fully utilized. These four mechanisms are (a) parallelization part splitting, (b) partial parallelization decomposition, (c) irregular loop interchange and (d) growing pattern detection. They explore parallelisms of special parallel patterns for nested loops with non-uniform dependences. The loop transformations used in uniform loops are also applied in non-uniform dependence loops after legality tests. We apply the results of classical convex theory and detect special parallel patterns of dependence vectors. We also proposed an algorithm that combines above mechanisms to enhance parallelism. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. Thus, we are encouraged by these apparent enhancements to pursue further development. 相似文献

7.

基于嵌套循环分类的并行识别技术

赵捷赵荣彩丁锐黄品丰《软件学报》2012,23(10):2695-2704

传统的分布存储并行编译系统大多是在共享存储并行编译系统的基础上开发的.共享存储并行编译系统的并行识别技术适合OpenMP代码生成,实现方式是将所有嵌套循环都按照相同的识别方法进行处理,用于分布存储并行编译系统必然会导致无法高效发掘程序的并行性.分布存储并行编译系统应根据嵌套循环结构的特点进行分类处理,提出适合MPI代码生成的并行识别技术.为解决上述问题,根据嵌套循环的结构和MPI并行程序的特点,提出了一种新的嵌套循环分类方法,并针对不同的嵌套循环分别提出了相应的并行识别技术.实验结果表明,与采用传统并行识别技术的分布存储并行编译系统相比,按照所提方法对嵌套循环进行分类,采用相应并行识别技术的编译系统能够更高效地识别基准程序中的并行循环,自动生成的MPI并行代码其性能加速比提高了20%以上. 相似文献

8.

A compiler optimization algorithm for shared-memory multiprocessors

McKinley K.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):769-787

This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines 相似文献

9.

基于LLVM Pass的复杂嵌套循环自动并行化框架

马春燕吕炳旭叶许姣张雨《软件学报》2023,34(7):3022-3042

随着多核处理器的普及应用,针对嵌入式遗留系统中串行代码的自动并行化方法是研究热点.其中,针对具有非完美嵌套结构、非仿射依赖关系特征的复杂嵌套循环的自动并行化方法存在技术挑战.提出了一种基于LLVMPass的复杂嵌套循环的自动并行化框架(CNLPF).首先,提出了一种复杂嵌套循环的表示模型,即循环结构树,并将嵌套循环的正则区域自动转换为循环结构树表示;然后,对循环结构树进行数据依赖分析,构建循环内和循环间的依赖关系;最后,基于OpenMP共享内存的编程模型生成并行的循环程序.针对SPEC2006数据集中包含近500个复杂嵌套循环的6个程序案例,分别对其进行复杂嵌套循环占比统计和并行性能加速测试.结果表明,提出的自动并行化框架可以处理LLVMPolly无法优化的复杂嵌套循环,增强了LLVM的并行编译优化能力,且该方法结合Polly的组合优化,比单独采用Polly优化的加速效果提升了9%-43%. 相似文献

10.

PFACC: An OpenACC-like programming model for irregular nested parallelism

Ming Hsiang Huang Wuu Yang 《Software》2020,50(10):1877-1904

OpenACC is a directive-based programming model which allows programmers to write graphic processing unit (GPU) programs by simply annotating parallel loops. However, OpenACC has poor support for irregular nested parallel loops, which are natural choices to express nested parallelism. We propose PFACC, a programming model similar to OpenACC. PFACC directives can be used to annotate parallel loops and to guide data movement between different levels of memory hierarchy. Parallel loops can be arbitrarily nested or be placed inside functions that would be (possibly recursively) called in other parallel loops. The PFACC translator translates C programs with PFACC directives into CUDA programs by inserting runtime iteration-sharing and memory allocation routines. The PFACC runtime iteration-sharing routine is a two-level mechanism. Thread blocks dynamically organize loop iterations into batches and execute the batches in a depth-first order. Different thread blocks share iterations among one another with an iteration-stealing mechanism. PFACC generates CUDA programs with reasonable memory usage because of the depth-first execution order. The two-level iteration-sharing mechanism is implemented purely in software and fits well with the CUDA thread hierarchy. Experiments show that PFACC outperforms CUDA dynamic parallelism in terms of performance and code size on most benchmarks. 相似文献

11.

测试任一嵌套DO循环置换合法性的阻碍矩阵算法

林源朱传琪《计算机工程与科学》1996,18(4):1-6

对程序进行并行变换是提高程序并行性的有效手段，许多并行变换都要寻找一种最优的循不置换，在寻找过程中，如果对每一个被考察的置换都重新进行相关性测试，那么整个寻找过程将极费时间，本文给出了一个测试嵌套循环任一置换的阻碍矩阵测试算法，它将测试一循环置换的合成法性转化为测试一组向前置换的合法性，并且仅需要嵌套循循环做一遍相关性测试，利用该算法可以简便迅速检查任一循环置换的合法性，从而使许多并行变换变得实际可相似文献

12.

Loops skewing: The wavefront method revisited

Michael Wolfe 《International journal of parallel programming》1986,15(4):279-293

Loop skewing is a new procedure to derive the wavefront method of execution of nested loops. The wavefront method is used to execute nested loops on parallel and vector computers when none of the loops can be done in vector mode. Loop skewing is a simple transformation of loop bounds and is combined with loop interchanging to generate the wavefront. This derivation is particularly suitable for implementation in compilers that already perform automatic detection of parallelism and generation of vector and parallel code, such as are available today. Loop normalization, a loop transformation used by several vectorizing translators, is related to loop skewing, and we show how loop normalization, applied blindly, can adversely affect the parallelism detected by these translators. 相似文献

13.

On the Problem of Optimizing Parallel Programs for Complex Memory Hierarchies

下载免费PDF全文

Jin Guohua Chen Fujie 《计算机科学技术学报》1994,9(1):1-26

Based on a thorough study of the relationship between array element accesses and loop indices of the nested loop,a method is presented with which the staggering relation and the compacting relation between the threads of the nested loop (either with a single linear function of with multiple linear functions) can be determined at compile-time,and accordingly the nested loop (either perfectly nested one or imperfectly nested one) can be restructured to avoid the thrashing problem.Due to its simplicity,our method can be efficiently implemented in any parallel compiler,and the improvement of the performance is significant as shown be the experimental results. 相似文献

14.

一个有效的并行分析算法 总被引：3，自引：0，他引：3

胡永刚乔如良《计算机学报》1999,22(2):134-140

并行分析在并行编译系统中有着很重要的作用,它的优劣直接影响到编译系统的成败,随着机群系统及其并行开发环境的发展,多数的并行系统可支持多重并行循环的运行。而对只支持一重并行循环的编程系统,选择并行运行效率最高的循环,也是很重要的。为此,本文提出了一个有效的循环并行分析方案,它不但能给出多层循环的并行性,而且能够处理绝大部分实际应用中的并行性问题,本文对传统的并行分析算法进行修改,并给出了一个有效的并相似文献

15.

Loop Parallelization using the 3D Iteration Space Visualizer

《Journal of Visual Languages and Computing》2001,12(2):163-181

A 3D iteration space visualizer (ISV) is presented to analyze the parallelism in loops and to find loop transformations which enhance the parallelism. Using automatic program instrumentation, the iteration space dependency graph (ISDG) is constructed, which shows the exact data dependencies of arbitrarily nested loops. Various graphical operations such as rotation, zooming, clipping, coloring and filtering, permit a detailed examination of the dependence relations. Furthermore, an animated dataflow execution shows the maximal parallelism and the parallel loops are indicated automatically by an embedded data dependence analysis. In addition, the user may discover and indicate additional parallelism for which a suitable unimodular loop transformation is calculated and verified. The ISV has been applied to parallelize algorithmic kernel programs, a computational fluid dynamics (CFD) simulation code, the detection of statement-level parallelism and loop variable privatization. The applications show that the visualizer is a versatile and easy to use tool for the high-performance application programmer. 相似文献

16.

Using knowledge-based systems for research on parallelizing compilers

Chao-Tung Yang Shian-Shyong Tseng Yun-Woei Fann Ting-Ku Tsai Ming-Huei Hsieh Cheng-Tien Wu 《Concurrency and Computation》2001,13(3):181-208

相似文献

17.

On loop transformations for generalized cycle shrinking

Weijia Shang O'Keefe M.T. Fortes J.A.B. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(2):193-204

This paper describes several loop transformation techniques for extracting parallelism from nested loop structures. Nested loops can then be scheduled to run in parallel so that execution time is minimized. One technique is called selective cycle shrinking, and the other is called true dependence cycle shrinking. It is shown how selective shrinking is related to linear scheduling of nested loops and how true dependence shrinking is related to conflict-free mappings of higher dimensional algorithms into lower dimensional processor arrays. Methods are proposed in this paper to find the selective and true dependence shrinkings with minimum total execution time by applying the techniques of finding optimal linear schedules and optimal and conflict-free mappings proposed by W. Shang and A.B. Fortes 相似文献

18.

COMPILE TIME PARTITIONING OF NESTED LOOP ITERATION SPACES WITH NON-UNIFORM DEPENDENCES*

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-3):113-141

In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. 相似文献

19.

选择性循环的并行方法

下载免费PDF全文

吴悦雷超付杨洪斌《计算机工程》2010,36(9):35-37,40

针对含有大量循环的串行程序存在的问题,提出一种基于线程级前瞻技术的循环选择方案。该方案对循环进行最优选择后建立一个可并行运行的循环集。对于该集合中的循环,选择并行效率高的代码段作并行处理,以加快串行程序运行速度。实验表明,相对于一般的简单内部循环或外部循环并行方法,该方案使9种基准代码的加速比平均上升23.8%,从而提高串行程序并行运行的效率。相似文献

20.

选择性循环的并行方法

下载免费PDF全文

吴悦雷超付杨洪斌《计算机工程》2010,36(9):35-37,4

针对含有大量循环的串行程序存在的问题,提出一种基于线程级前瞻技术的循环选择方案。该方案对循环进行最优选择后建立一个可并行运行的循环集。对于该集合中的循环,选择并行效率高的代码段作并行处理,以加快串行程序运行速度。实验表明,相对于一般的简单内部循环或外部循环并行方法,该方案使9种基准代码的加速比平均上升23.8%,从而提高串行程序并行运行的效率。相似文献