期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Matrix-Based Approach to Global Locality Optimization

《Journal of Parallel and Distributed Computing》1999,58(2):190-235

Global locality optimization is a technique for improving the cache performance of a sequence of loop nests through a combination of loop and data layout transformations. Pure loop transformations are restricted by data dependencies and may not be very successful in optimizing imperfectly nested loops and explicitly parallelized programs. Although pure data transformations are not constrained by data dependencies, the impact of a data transformation on an array might be program-wide; that is, it can affect all the references to that array in all the loop nests. Therefore, in this paper we argue for an integrated approach that employs both loop and data transformations. The method enjoys the advantages of most of the previous techniques for enhancing locality and is efficient. In our approach, the loop nests in a program are processed one by one and the data layout constraints obtained from one nest are propagated for optimizing the remaining loop nests. We show a simple and effective matrix-based framework to implement this process. The search space that we consider for possible loop transformations can be represented by general nonsingular linear transformation matrices and the data layouts that we consider are those that can be expressed using hyperplanes. Experiments with several floating-point programs on an 8-processor SGI Origin 2000 distributed-shared-memory machine demonstrate the efficacy of our approach. 相似文献

2.

Communication-free data allocation techniques for parallelizingcompilers on multicomputers

Tzung-Shi Chen Jang-Ping Sheu 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(9):924-938

In distributed memory multicomputers, local memory accesses are much faster than those involving interprocessor communication. For the sake of reducing or even eliminating the interprocessor communication, the array elements in programs must be carefully distributed to local memory of processors for parallel execution. We devote our efforts to the techniques of allocating array elements of nested loops onto multicomputers in a communication-free fashion for parallelizing compilers. We first analyze the pattern of references among all arrays referenced by a nested loop, and then partition the iteration space into blocks without interblock communication. The arrays can be partitioned under the communication-free criteria with nonduplicate or duplicate data. Finally, a heuristic method for mapping the partitioned array elements and iterations onto the fixed-size multicomputers under the consideration of load balancing is proposed. Based on these methods, the nested loops can execute without any communication overhead on the distributed memory multicomputers. Moreover, the performance of the strategies with nonduplicate and duplicate data for matrix multiplication is studied 相似文献

3.

Communication-Free Alignment for Array References with Linear Subscripts in Three Loop Index Variables or Quadratic Subscripts

Chang Weng-Long Chu Chih-Ping Wu Jia-Hwa 《The Journal of supercomputing》2001,20(1):67-83

相似文献

4.

Precise Data Locality Optimization of Nested Loops

Vincent Loechner Benoît Meister Philippe Clauss 《The Journal of supercomputing》2002,21(1):37-76

A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of nested loops is proposed, driven by parameterized cost functions. The considered loops can be imperfectly nested. New data layouts are propagated through the connected references and through the loop nests as constraints for optimizing the next connected reference in the same nest or in the other ones. Unlike many existing methods, special attention is paid to TLB (Translation Lookaside Buffer) effectiveness since TLB misses can take from tens to hundreds of processor cycles. Our approach only considers active data, that is, array elements that are actually accessed by a loop, in order to prevent useless memory loads and take advantage of storage compression and temporal locality. Moreover, the same data transformation is not necessarily applied to a whole array. Depending on the referenced data subsets, the transformation can result in different data layouts for a same array. This can significantly improve the performance since a priori incompatible references can be simultaneously optimized. Finally, the process does not only consider the innermost loop level but all levels. Hence, large strides when control returns to the enclosing loop are avoided in several cases, and better optimization is provided in the case of a small index range of the innermost loop. 相似文献

5.

Compile-time techniques for data distribution in distributed memorymachines

Ramanujam J. Sadayappan P. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(4):472-482

A solution to the problem of partitioning data for distributed memory machines is discussed. The solution uses a matrix notation to describe array accesses in fully parallel loops, which allows the derivation of sufficient conditions for communication-free partitioning (decomposition) of arrays. A series of examples that illustrate the effectiveness of the technique for linear references, the use of loop transformations in deriving the necessary data decompositions, and a formulation that aids in deriving heuristics for minimizing a communication when communication-free partitions are not feasible are presented 相似文献

6.

A Computation+Communication Load Balanced Loop Partitioning Method for Distributed Memory Systems

Santosh Pande Tareq Bali 《Journal of Parallel and Distributed Computing》1999,58(3):251

Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D. 相似文献

7.

一种基于代表元的划分算法

张为华王鹏臧斌宇朱传琪《计算机学报》2008,31(3):400-410

划分是把程序中不同的计算和数据分配到并行处理系统的不同处理机来充分利用并行系统的计算资源、提高程序处理速度的一种优化技术.划分的效果对程序在并行系统上的执行效率将产生至关重要的影响,因此划分问题一直是并行领域研究的一个热点.但是应用程序的一些特性,如非紧密嵌套循环、一条语句对非只读数组的多次引用间存在重叠、不同语句对同一数组不同步长的引用,给有效解决划分问题设置了极大的障碍.已有的划分算法无法对具有这些特征的程序进行自动划分.虽然在对具有这些特征的程序进行手工优化过程中,存在一些直观上的划分策略,但这些策略无法应用到编译器中来指导编译器完成对程序的自动划分.文中根据这类程序的特点,提出了一种基于代表元的划分算法.该算法通过使用程序中对划分计算产生实际影响的数组引用作为代表元素构造各种划分的限制条件,完成程序的划分.同时通过寻找最大一致性数据划分方向有效减少了程序划分过程中的数据重组织通信.该算法已经在AFT2004中实现,并对应用程序获得了很好的效果. 相似文献

8.

Unimodular transformations of non-perfectly nested loops

《Parallel Computing》1997,22(12):1621-1645

A framework is described in which a class of imperfectly nested loops can be restructured using unimodular transformations. In this framework, an imperfect loop nest is converted to a perfect loop nest using Abu-Sufah's Non-Basic-to-Basic-Loop transformation. Conditions for the legality of this transformation and techniques for their verification are discussed. An iteration space, which extends the usual concept so as to represent explicitly the executions of individual statements, is proposed to model the converted loop nest. Since the converted loop nest is a perfect loop nest, data dependences can be extracted and optimal transformations can be selected for parallelism and/or locality in the normal manner. To generate the restructured code for a unimodular transformation, a code generation method is provided that produces the restructured code that is free of if statements by construction. 相似文献

9.

On the Problem of Optimizing Parallel Programs for Complex Memory Hierarchies

下载免费PDF全文

Jin Guohua Chen Fujie 《计算机科学技术学报》1994,9(1):1-26

Based on a thorough study of the relationship between array element accesses and loop indices of the nested loop,a method is presented with which the staggering relation and the compacting relation between the threads of the nested loop (either with a single linear function of with multiple linear functions) can be determined at compile-time,and accordingly the nested loop (either perfectly nested one or imperfectly nested one) can be restructured to avoid the thrashing problem.Due to its simplicity,our method can be efficiently implemented in any parallel compiler,and the improvement of the performance is significant as shown be the experimental results. 相似文献

10.

GCC4．1数据依赖分析器的分析与改进 总被引：1，自引：0，他引：1

下载免费PDF全文

曾利永杨灿群黄春《计算机工程与科学》2006,28(10):104-106

本文深入分析了GCC4．1的数据依赖分析器，针对它在分析Fortran程序的线性化数组访问时的不足，给出了两点改进：一是初步实现了一个非仿射数组下标依赖分析算法；二是提出并实现了分裂递归链的仿射数组下标数据依赖分析方法。实验表明，这两点改进增强了GCC4．1的数据依赖分析能力，为进行循环变换如循环交换提供了更准确的数据依赖信息。相似文献

11.

A linear algebra framework for automatic determination of optimaldata layouts

Kandemir M. Choudhary A. Shenoy N. Banerjee P. Ramenujarn J. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(2):115-135

This paper presents a data layout optimization technique for sequential and parallel programs based on the theory of hyperplanes from linear algebra. Given a program, our framework automatically determines suitable memory layouts that can be expressed by hyperplanes for each array that is referenced. We discuss the cases where data transformations are preferable to loop transformations and show that under certain conditions a loop nest can be optimized for perfect spatial locality by using data transformations. We argue that data transformations can also optimize spatial locality for some arrays without distorting temporal/spatial locality exhibited by others. We divide the problem of optimizing data layout into two independent subproblems: 1) determining optimal static data layouts, and 2) determining data transformation matrices to implement the optimal layouts. By postponing the determination of the transformation matrix to the last stage, our method can be adapted to compilers with different default layouts. We then present an algorithm that considers optimizing parallelism and spatial locality simultaneously. Our results on eight programs on two distributed shared-memory multiprocessors, the Convex Exemplar SPP-2000 and the SGI Origin 2000, show that the layout optimizations are effective in optimizing spatial locality and parallelism 相似文献

12.

一种动态分布数组的数据划分模式

丁强臧斌宇朱传琪《计算机工程与设计》2005,26(5):1135-1139,1143

数据划分是分布主存系统中并行编译的关键技术,它以教组和包含这些教组的嵌套循环为研究对象,以提高教据局部性和挖掘计算并行性为根本目的。对满足给定模式的动态分布的教组向量,通过选取代表元,给出数据划分模式。将单个嵌套循环内的数据划分技术和过程间投影技术很好地结合,解决了动态分布教组的数据划分问题。这种模式弥补了现有数据划分研究的不足。相似文献

13.

Communication-free data alignment for arrays with exponential references in parallelizing compilers for scalable parallel systems

Minyi?Guo Weng-Long?Chang Bo?Jiang Shu-Chien?Huang Sien-Tang?Tsai Michael??Ho 《The Journal of supercomputing》2012,60(1):4-30

相似文献

14.

Automatic Array Partitioning Based on the Smith Normal Form

Eric Hung-Yu?Tseng Jean-Luc?Gaudiot Email author 《International journal of parallel programming》2005,33(1):35-56

We investigate the lattice-based array partitioning based on the theory of the Smith Normal Form and we present two elegant techniques for partitioning arrays in parallel DoAll loops for message-passing parallel machines: (1) DoAll loops with constant dependencies for communication-free partitioning: a general solution of all possible communication-free partitioning is derived where the dependencies among array references are described in constant distance vectors. (2) DoAll loops with non-constant dependencies for block-communication partitioning: the dependencies among array references are described in non-constant distance vectors. We derive the partitioning equations which allocate all remote data to a unique processor such that only one block-communication can obtain all the remote data for the computation. By using the Smith Normal Form decomposition, we are also able to verify our partitioning results. 相似文献

15.

A unified framework for optimizing locality, parallelism, andcommunication in out-of-core computations

Kandemir M. Choudhary A. Ramanujam J. Kandaswamy M.A. 《Parallel and Distributed Systems, IEEE Transactions on》2000,11(7):648-668

This paper presents a unified framework that optimizes out-of-core programs by exploiting locality and parallelism, and reducing communication overhead. For out-of-core problems where the data set sizes far exceed the size of the available in-core memory, it is particularly important to exploit the memory hierarchy by optimizing the I/O accesses. We present algorithms that consider both iteration space (loop) and data space (file layout) transformations in a unified framework. We show that the performance of an out-of-core loop nest containing references to out-of-core arrays can be improved by using a suitable combination of file layout choices and loop restructuring transformations. Our approach considers array references one-by-one and attempts to optimize each reference for parallelism and locality. When there are references for which parallelism optimizations do not work, communication is vectorized so that data transfer can be performed before the innermost loop. Results from hand-compiles on IBM SP-2 and Inter Paragon distributed-memory message-passing architectures show that this approach reduces the execution times and improves the overall speedups. In addition, we extend the base algorithm to work with file layout constraints and show how it is useful for optimizing programs that consist of multiple loop nests 相似文献

16.

Communication-Free Hyperplane Partitioning of Nested Loops

《Journal of Parallel and Distributed Computing》1993,19(2):90-102

This paper addresses the problem of partitioning the iterations of nested loops, and data arrays accessed by the loops. Hyperplane partitions of disjoint subsets of data arrays and loop iterations that result in the elimination of communication are sought. A characterization of necessary and sufficient conditions for communication-free hyperplane partitioning is provided. 相似文献

17.

A Direct Approach for Finding Loop Transformation Matrices

下载免费PDF全文

LIN Hua LU Mi Jesse Z.FANG 《计算机科学技术学报》1996,11(3):237-256

Loop transformations,such as loop interchange,reversal and skewing,have been unified under linear matrix transformations.A legal transformation matrix is usually generated based upon distance vectors or direction vectors.Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors, Unfortunately,for some nested loops,distance vectors may not be computable and direction vectors,on the other hand,may not contain useful information.We propose the use of linear equations or inequalities of distance vectors to approximate data dependence.This approach is advantageous since(1) many loops having no constant distance vectors have very simple equations of distance vectors;(2) these equations contain more information than direction vectors do,thus the chance of exploiting potential parallelism is improved.In general,the equations or inequalities that approximate the data dependence of a given nested loop is not unique,hence classification is discussed for the purpose of loop transformationEfficient algorithms are developed to generate all kinds of linear equations of distance vectors for a given nested loop.The issue of how to obtain a desired transformation matrix from those equations is also addressed. 相似文献

18.

基于指针数组的数据划分模式 总被引：1，自引：0，他引：1

丁强臧斌宇朱传琪《计算机工程与应用》2005,41(27):62-65,183

数据划分是分布主存系统中并行编译的关键技术,它以数组和包含这些数组的嵌套循环为研究对象,以提高数据局部性和挖掘计算并行性为根本目的。传统数据划分模式不适合指向数组的指针数组的数据划分,论文提出了解决该类指针数组数据划分的划分模式,文中称为数组向量的数据划分。分析其数据引用的特性,通过选取代表元,给出数据划分的策略,弥补了现有数据划分研究的不足。相似文献

19.

Using knowledge-based systems for research on parallelizing compilers

Chao-Tung Yang Shian-Shyong Tseng Yun-Woei Fann Ting-Ku Tsai Ming-Huei Hsieh Cheng-Tien Wu 《Concurrency and Computation》2001,13(3):181-208

相似文献

20.

面向SLP 的多重循环向量化 总被引：1，自引：0，他引：1

魏帅赵荣彩姚远《软件学报》2012,23(7):1717-1728

如今,越来越多的处理器集成了SIMD(single instruction multiple data)扩展,现有的编译器大多也实现了自动向量化的功能,但是一般都只针对最内层循环进行向量化,对于多重循环缺少一种通用、易行的向量化方法.为此,提出了一种面向SLP(superword level parallelism)的多重循环向量化方法,从外至内依次对各个循环层次进行分析,收集各层循环对应的一些影响向量化效果的属性值,主要包括能否对该循环进行直接循环展开和压紧、有多少数组引用相对于该循环索引连续以及该循环所包含的区域等,然后根据这些属性值决定在哪些循环层次进行直接循环展开和压紧,最后通过SLP对循环中的语句进行向量化.实验结果表明,该算法相对于内层循环向量化和简单的外层循环向量化平均加速比提升了2.13和1.41,对于一些常用的核心循环可以得到高达5.3的加速比. 相似文献