期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An efficient code generation technique for tiled iteration spaces

Goumas G. Athanasaki M. Koziris N. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(10):1021-1034

This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code. 相似文献

2.

COMPILE TIME PARTITIONING OF NESTED LOOP ITERATION SPACES WITH NON-UNIFORM DEPENDENCES*

《International Journal of Parallel, Emergent and Distributed Systems》2012,27(1-3):113-141

In this paper we address the problem of partitioning nested loops with non-uniform (irregular) dependence vectors. Parallelizing and partitioning of nested loops requires efficient inter-iteration dependence analysis. Although many methods exist for nested loop partitioning, most of these perform poorly when parallelizing nested loops with irregular dependences. Unlike the case of nested loops with uniform dependences these will have a complicated dependence pattern which forms a non-uniform dependence vector set. We apply the results of classical convex theory and principles of linear programming to iteration spaces and show the correspondence between minimum dependence distance computation and iteration space tiling. Cross-iteration dependences are analyzed by forming an Integer Dependence Convex Hull (IDCH). Every integer point in this IDCH corresponds to a dependence vector in the iteration space of the nested loops. A simple way to compute minimum dependence distances from the dependence distance vectors of the extreme points of the IDCH is presented. Using these minimum dependence distances the iteration space can be tiled. Iterations within a tile can be executed in parallel and the different tiles can then be executed with proper synchronization. We demonstrate that our technique gives much better speedup and extracts more parallelism than the existing techniques. 相似文献

3.

Tiling Nested Loops into Maximal Rectangular Blocks

Yeong-Sheng Chen Sheng-De Wang Chien-Min Wang 《Journal of Parallel and Distributed Computing》1996,35(2):123

In this paper, an approach to tiling nested loops for maximizing parallelism is proposed. The proposed method aims at aggregating independent computations of a loop nest into rectangular blocks and maximizing the block sizes for maximizing parallelism. At first, all the independent computations that can be executed in the first time unit are identified. These computations are called the initially independent computations. Then it is shown that all of them can be collected as a union of rectangular blocks. So, based on these, the entire iteration space of the loops is partitioned into rectangular blocks for maximizing parallelism. The proposed method is formulated as systematic procedures which can easily be implemented in a parallelizing compiler. It is shown that when the wavefront transformation is combined with the proposed method, the loops can always be tiled so that the tile size is greater than one. In comparison with previous work on tiling, the proposed method is shown to have several advantages as summarized in the conclusions of this paper. 相似文献

4.

Optimal semi-oblique tiling 总被引：1，自引：0，他引：1

Andonov R. Balev S. Rajopadhye S. Yanev N. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(9):944-960

For 2D iteration space tiling, we address the problem of determining the tile parameters that minimize the total execution time on a parallel machine. We consider uniform dependency computations tiled so that (at least) one of the tile boundaries is parallel to the domain boundaries. We determine the optimal tile size as a closed form solution. In addition, we determine the optimal number of processors and also the optimal slope of the oblique tile boundary. Our results are based on the BSP model, which assures the portability of the results. Our predictions are justified on a sequence global alignment problem specialized to similar sequences using Fickett's k-band algorithm, for which our optimal semi-oblique tiling yields an improvement of a factor of 2.5 over orthogonal tiling. Our optimal solution requires a block-cyclic distribution of tiles to processors. The best one can obtain with only block distribution (as many authors require) is three times slower. Furthermore, our best running time is within 10 percent of the "predicted theoretical peak" performance of the machine!. 相似文献

5.

Reuse-Driven Tiling for Improving Data Locality

Jingling Xue Chua-Huang Huang 《International journal of parallel programming》1998,26(6):671-696

This paper applies unimodular transformations and tiling to improve data locality of a loop nest. Due to data dependences and reuse information, not all dimensions of the iteration space will and can be tiled. By using cones to represent data dependences and vector spaces to quantify data reuse in the program, a reuse-driven transformational approach is presented, which aims at maximizing the amount of data reuse carried in the tiled dimensions of the iteration space while keeping the number of tiled dimensions to a minimum (to reduce loop control overhead). In the special case of one single fully permutable loop nest, an algorithm is presented that tiles the program optimally so that all data reuse is carried in the tiled dimensions. In the general case of multiple fully permutable loop nests, data dependences can prevent all data reuse to be carried in the tiled dimensions. An algorithm is presented that aims at localizing data reuse in the tiled dimensions so that the reuse space localized has the largest dimensionality possible. 相似文献

6.

一类基于迭代空间条块的并行有限差分Stencil 算法

张纪林狄鹏蒋从锋张伟徐向华万健任永坚《软件学报》2010,21(Z1):270-283

高效的并行有限差分Stencil 算法对于求解大型线性方程组是十分重要的.针对并行有限差分Stencil 算法中数据局部性差、同步和通信开销大的问题.首先改进传统有限差分Stencil 算法,提出了多层对称遍历有限差分Stencil 算法.然后给出了以迭代空间条块序作为执行序的串行算法,通过沿时间轴对迭代空间进行时滞划分,在不改变迭代算法性质的同时,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行算法,该算法利用改进的多面体模型对迭代空间网格划分,并通过网格条块重排序减少了Cache 缺失率、通信启动和同步次数.理论分析和实验结果表明,该并行模型比传统的区域分解方法和红黑排序并行算法具有更好的数据局部性,并行效率和可扩展性. 相似文献

7.

迭代空间交错条块并行Gauss-Seidel算法 总被引：1，自引：0，他引：1

胡长军张纪林王珏李建江《软件学报》2008,19(6):1274-1282

针对并行GS(Gauss-Seidel)迭代算法中数据局部性差、同步和通信开销大的问题,首先改进传统GS迭代,提出了多层对称GS迭代算法.然后给出了以迭代空间条块序作为执行序的串行执行模型.该模型通过对迭代空间进行"时滞"划分,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行执行模型.该模型改进了迭代空间网格划分,并通过网格条块重排序减少了cache缺失率、通信启动和同步次数.实验结果表明,迭代空间交错条块并行算法比传统的区域分解方法和红黑排序并行算法具有更好的并行效率和可扩展性. 相似文献

8.

Parallel loop generation and scheduling

Shahriar Lotfi Saeed Parsa 《The Journal of supercomputing》2009,50(3):289-306

Loop tiling is an efficient loop transformation, mainly applied to detect coarse-grained parallelism in loops. It is a difficult task to apply n-dimensional non-rectangular tiles to generate parallel loops. This paper offers an efficient scheme to apply non-rectangular n-dimensional tiles in non-rectangular iteration spaces, to generate parallel loops. In order to exploit wavefront parallelism efficiently, all the tiles with equal sum of coordinates are assumed to reside on the same wavefront. Also, in order to assign parallelepiped tiles on each wavefront to different processors, an improved block scheduling strategy is offered in this paper. 相似文献

9.

Optimal Orthogonal Tiling of 2-D Iterations

Rumen Andonov Sanjay Rajopadhye 《Journal of Parallel and Distributed Computing》1997,45(2):1324

Iteration space tiling is a common strategy used by parallelizing compilers and in performance tuning of parallel codes. We address the problem of determining the tile size that minimizes the total execution time. We restrict our attention to uniform dependency computations with two-dimensional, parallelogram-shaped iteration domain which can be tiled with lines parallel to the domain boundaries. The target architecture is a linear array (or a ring). Our model is developed in two steps. We first abstract each tile by two simple parameters, namely tile periodP_tand intertile latencyL_t. We formulate and partially resolve the corresponding optimization problem independent of the machine and program. Next, we refine the model with realistic machine and program parameters, yielding a discrete nonlinear optimization problem. We solve this analytically, yielding a closed form solution, which can be used by a compiler before code generation. 相似文献

10.

Tiling on systems with communication/computation overlap

Pierre-Yves Calland Jack Dongarra Yves Robert 《Concurrency and Computation》1999,11(3):139-153

In the framework of fully permutable loops, tiling is a compiler technique (also known as ‘loop blocking’) that has been extensively studied as a source-to-source program transformation. Little work has been devoted to the mapping and scheduling of the tiles on to physical parallel processors. We present several new results in the context of limited computational resources and assuming communication–computation overlap. In particular, under some reasonable assumptions, we derive the optimal mapping and scheduling of tiles to physical processors. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

11.

A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping

《Journal of Parallel and Distributed Computing》2003,63(11):1138-1151

This paper proposes a new method for the problem of minimizing the execution time of nested for-loops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to nonlocal processors. We survey the application of our schedule to modern communication architectures. We performed two sets of experimental results, one using MPI primitives over FastEthernet and one using the SISCI API over an SCI network. In both cases, the total completion time is significantly reduced. 相似文献

12.

Automatic partitioning of parallel loops and data arrays fordistributed shared-memory multiprocessors

Agarwal A. Kranz D.A. Natarajan V. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(9):943-962

Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor 相似文献

13.

向量化友好的循环分块因子选择算法

下载免费PDF全文

柴晓菲刘松屈彬王倩伍卫国《计算机工程与应用》2020,56(15):37-42

具有病态规模的嵌套循环程序在进行循环分块时容易忽略分块因子对向量化的影响,导致非对齐数据访问,降低分块后循环代码的性能。提出了一种向量化友好的循环分块因子选择算法VEC-TSS。该算法对可向量化循环层以向量化收益分析确定分块因子,对其他循环层通过以局部性收益和并行粒度确定分块因子。实验结果表明,针对具有病态规模的循环程序,VEC-TSS算法与另外两种分块因子选择算法相比可以获得更好的程序加速比,同时具有良好的可扩展性。相似文献

14.

一种面向服务器制图可视化的矢量数据多尺度组织方法 总被引：1，自引：0，他引：1

孙璐陈荦刘露苏德国《计算机工程与科学》2014,36(2):226-232

提出了一种面向服务器制图可视化的矢量数据多尺度组织方法。基于矢量数据瓦片化思想,将矢量数据按照全球地理空间金字塔索引模型划分为层次化瓦片数据,将服务器制图可视化处理中对数据图层的空间查询操作,转化为对瓦片数据的数据读取操作。实验及应用表明,该方法减少了数据读取时间,降低了I/O代价,提高了矢量数据服务器制图可视化的整体性能。相似文献

15.

Real-Time Texture Synthesis Using s-Tile Set 总被引：3，自引：0，他引：3

下载免费PDF全文

Feng Xue You-Sheng Zhang Ju-Lang Jiang Min Hu Xin-Dong Wu and Rong-Gui Wang 《计算机科学技术学报》2007,22(4):590-596

This paper presents a novel method of generating a set of texture tiles from samples,which can be seamlessly tiled into arbitrary size textures in real-time.Compared to existing methods,our approach is simpler and more advantageous in eliminating visual seams that may exist in each tile of the existing methods,especially when the samples have elaborate features or distinct colors.Texture tiles generated by our approach can be regarded as single-colored tiles on each orthogonal direction border,which are easier for tiling and more suitable for sentence tiling.Experimental results demonstrate the feasibility and effectiveness of our approach. 相似文献

16.

A Loop Transformation Algorithm for Communication Overlapping

Kazuaki Ishizaki Hideaki Komatsu Toshio Nakatani 《International journal of parallel programming》2000,28(2):135-154

Overlapping communication with computation is a well-known approach to improving performance. Previous research has focused on optimizations performed by the programmer. This paper presents a compiler algorithm that automatically determines the appropriate loop indices of a given nested loop and applies loop interchange and tiling in order to overlap communication with computation. The algorithm avoids generating redundant communication by providing a framework for combining information on data dependence, communication, and reuse. It also describes a method of generating messages to exchange data between processors for tiled loops on distributed memory machines. The algorithm has been implemented in our High Performance Fortran (HPF) compiler, and experimental results have shown its effectiveness on distributed memory machines, such as the RISC System/6000 Scalable POWERparallel System. This paper also discusses the architectural problems of efficient optimization. 相似文献

17.

Polyominoes simulating arbitrary-neighborhood zippers and tilings

Lila Kari Benoît Masson 《Theoretical computer science》2011,412(43):6083-6100

This paper provides a bridge between the classical tiling theory and the complex-neighborhood self-assembling situations that exist in practice.The neighborhood of a position in the plane is the set of coordinates which are considered adjacent to it. This includes classical neighborhoods of size four, as well as arbitrarily complex neighborhoods. A generalized tile system consists of a set of tiles, a neighborhood, and a relation which dictates which are the “admissible” neighboring tiles of a given tile. Thus, in correctly formed assemblies, tiles are assigned positions of the plane in accordance with this relation.We prove that any validly tiled path defined in a given but arbitrary neighborhood (a zipper) can be simulated by a simple “ribbon” of microtiles. A ribbon is a special kind of polyomino, consisting of a non-self-crossing sequence of tiles on the plane, in which successive tiles stick along their adjacent edge.Finally, we extend this construction to the case of traditional tilings, proving that we can simulate arbitrary-neighborhood tilings by simple-neighborhood tilings, while preserving some of their essential properties. 相似文献

18.

A cost-effective implementation of multilevel tiling

Jimenez M. Llaberia J.M. Fernandez A. 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(10):1006-1020

This paper presents a new cost-effective algorithm to compute exact loop bounds when multilevel tiling is applied to a loop nest having affine functions as bounds (nonrectangular loop nest). Traditionally, exact loop bounds computation has not been performed because its complexity is doubly exponential on the number of loops in the multilevel tiled code and, therefore, for certain classes of loops (i.e., nonrectangular loop nests), can be extremely time consuming. Although computation of exact loop bounds is not very important when tiling only for cache levels, it is critical when tiling includes the register level. This paper presents an efficient implementation of multilevel tiling that computes exact loop bounds and has a much lower complexity than conventional techniques. To achieve this lower complexity, our technique deals simultaneously with all levels to be tiled, rather than applying tiling level by level as is usually done. For loop nests having very simple affine functions as bounds, results show that our method is between 15 and 28 times faster than conventional techniques. For loop nests caving not so simple bounds, we have measured speedups as high as 2,300. Additionally, our technique allows eliminating redundant bounds efficiently. Results show that eliminating redundant bounds in our method is between 22 and 11 times faster than in conventional techniques for typical linear algebra programs. 相似文献

19.

A New Genetic Algorithm for Loop Tiling

Saeed Parsa Shahriar Lotfi 《The Journal of supercomputing》2006,37(3):249-269

Tiling is a known problem especially in the field of computational geometry and its related engineering branches. In fact, a tile is a set of points in the Cartesian space. The goal is to partition the space of the points as tiles with optimal dimensions and shapes such that a number of predefined semantic relations holds amongst the tiles. So far, this problem has been solved in special cases with two or three dimensions. The problem of determining the optimal tile is an NP-Hard problem. Presenting a novel constraint genetic algorithm in this paper, we have been able to solve the tiling problem in Cartesian spaces with more than two dimensions, for the loop parallelization problem. 相似文献

20.

一个有效的并行分析算法 总被引：3，自引：0，他引：3

胡永刚乔如良《计算机学报》1999,22(2):134-140

并行分析在并行编译系统中有着很重要的作用,它的优劣直接影响到编译系统的成败,随着机群系统及其并行开发环境的发展,多数的并行系统可支持多重并行循环的运行。而对只支持一重并行循环的编程系统,选择并行运行效率最高的循环,也是很重要的。为此,本文提出了一个有效的循环并行分析方案,它不但能给出多层循环的并行性,而且能够处理绝大部分实际应用中的并行性问题,本文对传统的并行分析算法进行修改,并给出了一个有效的并相似文献