期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

基于SIMD机器的优化数据传输的并行循环分割 总被引：2，自引：1，他引：2

林进张兆庆祝明发《计算机学报》1998,21(7):577-585

本文提出一个基于分布式局存的ＳＩＭＤ机器的循环分割理论体系以优化运算中所需要的数据传输。该体系使用矩阵表示迭代空间、数据空间和数组存取式。我们引入数据传输概念，并建立一个简单有效的数据传输模型来评估数据在全局内存和局部内存之间的传输开销。最后，对于给定的循环嵌套，我们给出一个循环分割算法以获得优化循环块，使得循环嵌套中所需要的数据传输开销最小，并且大大减少了数据传输和计算的同步开销。实验结果证明了相似文献

2.

基于线性不等式的数据划分方法的优化 总被引：1，自引：0，他引：1

董春丽赵荣彩杜澎王峥《计算机应用》2007,27(5):1251-1253

计算和数据划分是串行程序并行化时所要解决的一个重要问题，如何对程序中引用的数据进行合理的分布以最大限度的发现程序的并行性减少数据重分布的通信开销，是并行编译优化的重点。给出的数据和计算的优化分解方法是基于Anderson-Lam的分解算法上改进得到的。根据Anderson-Lam的算法得到数据和计算划分后，以线性不等式的形式表示，然后通过分析循环嵌套中能够进行边界冗余的只读数组，重新构造数据划分不等式，根据此不等式进行数据分布，实现具有边界冗余的只读数组的数据划分，有效地减少了数据收发的通信量。相似文献

3.

程序并行化中基于暴露集生成数据分布代码

丁锐赵荣彩韩林《计算机工程与设计》2009,30(15)

在并行化编译中,代码生成属于编译器的后端,决定着并行程序的执行效率.数据划分将计算循环中被重定义或没被读引用的数据映射到处理器,按照数据划分生成通信代码会产生冗余通信.提出了利用数组数据流分析求解暴露集,并建立计算划分、循环迭代以及暴露集的不等式限制系统,最后通过FME(fourier Motzkin elimination)消元生成数据分布代码的优化算法.测试结果表明该算法对数据分布的优化效果明显. 相似文献

4.

一类非规则并行应用问题的通信集生成算法

胡长军李静王珏姚广利李永红丁良李建江《计算机学报》2008,31(1):120-126

非规则计算是大规模并行应用中普遍存在和影响效率的关键问题.在基于分布式内存的数据并行范例中,如何针对非规则数组引用,有效地生成本地内存访问序列和通信集,是并行编译生成SPMD结点程序所必须解决的重要问题.文中针对两重嵌套循环中,下一层循环边界是上一层循环变量的线性或非线性函数,数组下标是两层循环变量的非线性函数这样一类包含非规则数组引用的并行应用问题,提出了一种在编译时生成通信集的代数算法.并且针对cyclic(k)数据分布和线性对齐模板,借助整数格概念,给出了编译时全局地址和本地地址之间的转换方法.文中还给出了相应的经过通信优化的SPMD结点程序.最后通过实例验证了算法的正确性.该算法的意义在于避免了传统Inspector/Executor非规则计算模型中的Inspector阶段,从而节省了运行时Inspector阶段通过穷举下标生成通信集的巨大开销. 相似文献

5.

迭代空间交错条块并行Gauss-Seidel算法 总被引：1，自引：0，他引：1

胡长军张纪林王珏李建江《软件学报》2008,19(6):1274-1282

针对并行GS(Gauss-Seidel)迭代算法中数据局部性差、同步和通信开销大的问题,首先改进传统GS迭代,提出了多层对称GS迭代算法.然后给出了以迭代空间条块序作为执行序的串行执行模型.该模型通过对迭代空间进行"时滞"划分,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行执行模型.该模型改进了迭代空间网格划分,并通过网格条块重排序减少了cache缺失率、通信启动和同步次数.实验结果表明,迭代空间交错条块并行算法比传统的区域分解方法和红黑排序并行算法具有更好的并行效率和可扩展性. 相似文献

6.

一类基于迭代空间条块的并行有限差分Stencil 算法

张纪林狄鹏蒋从锋张伟徐向华万健任永坚《软件学报》2010,21(Z1):270-283

高效的并行有限差分Stencil 算法对于求解大型线性方程组是十分重要的.针对并行有限差分Stencil 算法中数据局部性差、同步和通信开销大的问题.首先改进传统有限差分Stencil 算法,提出了多层对称遍历有限差分Stencil 算法.然后给出了以迭代空间条块序作为执行序的串行算法,通过沿时间轴对迭代空间进行时滞划分,在不改变迭代算法性质的同时,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行算法,该算法利用改进的多面体模型对迭代空间网格划分,并通过网格条块重排序减少了Cache 缺失率、通信启动和同步次数.理论分析和实验结果表明,该并行模型比传统的区域分解方法和红黑排序并行算法具有更好的数据局部性,并行效率和可扩展性. 相似文献

7.

MPI自动并行化编译系统中消息传递代码生成算法

陈达智赵荣彩姚远韩林《计算机科学》2012,39(6):301-304

传统MPI自动并行化编译系统从数据重分布的角度,生成面向分布式存储系统的消息传递程序,但是大量数据重分布通信的额外开销导致其加速比低。为了解决此问题,在基于Open64的MPI自动并行化编译系统后端,提出了一种消息传递代码生成算法。该算法以统一数据分布为中心,根据给定的并行化循环集和通信数组集,通过修改WHIRL表示的串行代码语法结构树,生成更精确的消息传递代码。实验结果表明,该算法能够较大程度地降低消息传递程序的通信开销,并且明显提升其加速比。相似文献

8.

分布式图计算框架混合计算模式的研究

丁鑫陈榕陈海波《小型微型计算机系统》2015,(4):665-670

分析了分布式图计算框架的同步和异步计算模式在调度开销和收敛速度上存在的优点与不足.同步计算模式调度开销小,但是收敛较慢;而异步计算模式收敛较快,但调度开销大.基于上述发现,提出一种混合计算模式,能够在分布式环境下有效地结合同步与异步计算模式的优点克服各自不足,以获得最优性能.混合计算模式采用"同步控制流"以降低分布式环境下的调度开销,同时采用"异步数据流"使计算过程使用较新的数据以加快收敛速度.基于多个典型图算法和真实大规模图的评测显示,混合计算模式的性能是原有同步计算模式的1.2倍到2.4倍,计算量平均减少30%;相对于异步计算模式通过减少调度开销,整体性能可以提升至其2.3倍到4.6倍. 相似文献

9.

一种分布式共享存储系统的线程分配算法 总被引：3，自引：0，他引：3

刘轶郑守淇钱德沛《计算机研究与发展》2000,37(5):521-526

讨论了软件实现了多线程ＤＳＭ的通信开销和线程分配问题,给出了一种基于线程关系图的调度模型,并在此基础上提出了一种基于迭代的线程分配算法,通过大量的线程关系图对算法进行了评价,并且在一个软件ＤＳＭ系统中实现了该算法,同时给出了算法的评价结果和应用程序的性能数据。相似文献

10.

Asyn-SimRank：一种可异步执行的大规模 SimRank 算法

王春磊张岩峰鲍玉斌赵长宽于戈高立新《计算机研究与发展》2015,(7)

SimRank 算法利用网络结构来评估网络中任意2点的相似性,它被广泛应用于社交网络和链接预测等诸多领域中．近年来,随着大数据技术的发展,SimRank 算法处理的数据不断增大,人们利用MapReduce 等分布式计算模型设计实现分布式的大规模 SimRank 算法来适应大数据处理的需求．但是,由于 SimRank 算法包含开销较大的迭代过程,每次迭代之后都需要一个全局同步,且每次迭代的计算复杂度高、通信量大,SimRank 算法不能在分布式环境下高效地实现．1）提出 Asyn‐SimRank 算法,该算法采用迭代‐累积的方式完成迭代计算,异步执行 SimRank 的核心迭代过程,避免了大规模分布式计算中的大量同步开销,同时有效降低计算量并减少通信开销;2）提出关键点优先调度计算,提升了 Asyn‐SimRank 算法的全局收敛速度;3）证明了 Asyn‐SimRank 算法的正确性和收敛性以及关键点优先调度计算的有效性;4）支持异步迭代的分布式框架 Maiter 上实现了 Asyn‐SimRank 算法．实验结果显示,相比较于 Hadoop ,Spark 上实现的 SimRank 算法和 Delta‐SimRank 算法,Asyn‐SimRank 算法大大提升了算法的计算效率,加速了算法收敛．相似文献

11.

A Computation+Communication Load Balanced Loop Partitioning Method for Distributed Memory Systems

Santosh Pande Tareq Bali 《Journal of Parallel and Distributed Computing》1999,58(3):251

Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D. 相似文献

12.

A Loop Transformation Algorithm for Communication Overlapping

Kazuaki Ishizaki Hideaki Komatsu Toshio Nakatani 《International journal of parallel programming》2000,28(2):135-154

Overlapping communication with computation is a well-known approach to improving performance. Previous research has focused on optimizations performed by the programmer. This paper presents a compiler algorithm that automatically determines the appropriate loop indices of a given nested loop and applies loop interchange and tiling in order to overlap communication with computation. The algorithm avoids generating redundant communication by providing a framework for combining information on data dependence, communication, and reuse. It also describes a method of generating messages to exchange data between processors for tiled loops on distributed memory machines. The algorithm has been implemented in our High Performance Fortran (HPF) compiler, and experimental results have shown its effectiveness on distributed memory machines, such as the RISC System/6000 Scalable POWERparallel System. This paper also discusses the architectural problems of efficient optimization. 相似文献

13.

A general methodology of partitioning and mapping for given regulararrays

Xian Chen Megson G.M. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(10):1100-1107

A methodology for partitioning and mapping of arbitrary uniform recurrence equations (UREs) expressed as computation graphs onto a given regular array is proposed. Deriving and based on a set of canonical dependencies together with two models of space projection, we give a general method of parallelization. The method has significant advantages in mapping an arbitrary computation graph onto a given processor array while preserving high efficiency in both communication and computation 相似文献

14.

分布存储系统中优化通信的冗余计算分割 总被引：1，自引：0，他引：1

陈莉张兆庆冯晓兵《计算机学报》2003,26(2):180-187

针对并行循环套序列，提出一种冗余计算分割的通信优化方法，根据数据流分析，文中给出用以确定每个循环套的冗余计算量的一般方法，并在此基础上提出冗余计算分割的实现和判定，针对规则依赖的程序，该文还提出了一个高效的冗余计算分割的实现方法，该技术已经在一个并行编译器中实现，试验结果表明，它比传统的通信优化技术有明显的优越性。相似文献

15.

Partitioning and mapping nested loops on multiprocessor systems

Jang-Ping Sheu Tsu-Huei Tai 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(4):430-439

A method for executing nested loops with constant loop-carried dependencies in parallel on message-passing multiprocessor systems to reduce communication overhead is presented. In the partitioning phase, the nested loop is divided into blocks that reduce the interblock communication, without regard to the machine topology. The execution ordering of the iterations is defined by a given time function based on L. Lamport's (1974) hyperplane method. The iterations are then partitioned into blocks so that the execution ordering is not disturbed, and the amount of interblock communication is minimized. In the mapping phase, the partitioned blocks are mapped onto a fixed-size multiprocessor system in such a manner that the blocks that have to exchange data frequently are allocated to the same processor or neighboring processors. A heuristic mapping algorithm for hypercube machines is proposed 相似文献

16.

A consistent generation of pipeline parallelism and distribution of operations and data among processors

E. V. Adutskevich N. A. Likhoded 《Programming and Computer Software》2006,32(3):166-176

The problem of mapping affine loop nests onto parallel computers with distributed memory is considered. A technique for algorithm scheduling and distributing operations and data over processors is proposed. This technique makes it possible to generate pipeline parallelism and minimize the amount of data exchanges between the processors. The method is adapted for automation and explicitly allows for dependence on outer variables of loops. 相似文献

17.

Parallel Image Correlation: Case Study to Examine Trade-Offs in Algorithm-to-Machine Mappings

Armstrong James B. Maheswaran Muthucumaru Theys Mitchell D. Siegel Howard Jay Nichols Mark A. Casey Kenneth H. 《The Journal of supercomputing》1998,12(1-2):7-35

Performance of a parallel algorithm on a parallel machine depends not only on the time complexity of the algorithm, but also on how the underlying machine supports the fundamental operations used by the algorithm. This study analyzes various mappings of image correlation algorithms in SIMD, MIMD, and mixed-mode environments. Experiments were conducted on the Intel Paragon, MasPar MP-1, nCUBE 2, and PASM prototype. The machine features considered in this study include: modes of parallelism, communication/computation ratio, network topology and implementation, SIMD CU/PE overlap, and communication/computation overlap. Performance of an implementation can be enhanced by using algorithmic techniques that match the machine features. Some algorithmic techniques discussed here are additional communication versus redundant computation, data block transfers, and communication/computation overlap. The results presented are applicable to a large class of image processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping the subtasks of an application task, or a set of independent application tasks, onto a heterogeneous suite of parallel machines. 相似文献

18.

分布式系统中数据分解的研究

沈亚楠姚远张平赵荣彩罗向阳《计算机工程》2006,32(11):114-115,132

数据分解对消息传递并行机下的并行编译器取得高性能至关重要。根据编译器自动得出的数据分解（映射数据到处理机）信息，C语言版本的发送／接收消息循环嵌套可产生出来，从而在处理机之间实现分布数据。不仅一个已被证明且功能强大的数学模型用于产生数据分解代码，而且一个形式化的算法及其实现也已给出。初步实验结果显示该算法能显著提高性能。相似文献

19.

Mapping of Affine Loop Nests onto Independent Processors

N. A. Likhoded 《Cybernetics and Systems Analysis》2003,39(3):459-466

A method is given for obtaining independent parts of algorithms, represented by affine loop nests (not necessary perfectly nested). The method is based on a modular affine mapping of algorithm operations onto independent virtual processors. The method can select more independent computations than the known procedures based on affine mappings. 相似文献