首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 140 毫秒
1.
基于SIMD机器的优化数据传输的并行循环分割   总被引:2,自引:1,他引:2  
本文提出一个基于分布式局存的SIMD机器的循环分割理论体系以优化运算中所需要的数据传输。该体系使用矩阵表示迭代空间、数据空间和数组存取式。我们引入数据传输概念,并建立一个简单有效的数据传输模型来评估数据在全局内存和局部内存之间的传输开销。最后,对于给定的循环嵌套,我们给出一个循环分割算法以获得优化循环块,使得循环嵌套中所需要的数据传输开销最小,并且大大减少了数据传输和计算的同步开销。实验结果证明了  相似文献   

2.
基于线性不等式的数据划分方法的优化   总被引:1,自引:0,他引:1  
董春丽  赵荣彩  杜澎  王峥 《计算机应用》2007,27(5):1251-1253
计算和数据划分是串行程序并行化时所要解决的一个重要问题,如何对程序中引用的数据进行合理的分布以最大限度的发现程序的并行性减少数据重分布的通信开销,是并行编译优化的重点。给出的数据和计算的优化分解方法是基于Anderson-Lam的分解算法上改进得到的。根据Anderson-Lam的算法得到数据和计算划分后,以线性不等式的形式表示,然后通过分析循环嵌套中能够进行边界冗余的只读数组,重新构造数据划分不等式,根据此不等式进行数据分布,实现具有边界冗余的只读数组的数据划分,有效地减少了数据收发的通信量。  相似文献   

3.
在并行化编译中,代码生成属于编译器的后端,决定着并行程序的执行效率.数据划分将计算循环中被重定义或没被读引用的数据映射到处理器,按照数据划分生成通信代码会产生冗余通信.提出了利用数组数据流分析求解暴露集,并建立计算划分、循环迭代以及暴露集的不等式限制系统,最后通过FME(fourier Motzkin elimination)消元生成数据分布代码的优化算法.测试结果表明该算法对数据分布的优化效果明显.  相似文献   

4.
非规则计算是大规模并行应用中普遍存在和影响效率的关键问题.在基于分布式内存的数据并行范例中,如何针对非规则数组引用,有效地生成本地内存访问序列和通信集,是并行编译生成SPMD结点程序所必须解决的重要问题.文中针对两重嵌套循环中,下一层循环边界是上一层循环变量的线性或非线性函数,数组下标是两层循环变量的非线性函数这样一类包含非规则数组引用的并行应用问题,提出了一种在编译时生成通信集的代数算法.并且针对cyclic(k)数据分布和线性对齐模板,借助整数格概念,给出了编译时全局地址和本地地址之间的转换方法.文中还给出了相应的经过通信优化的SPMD结点程序.最后通过实例验证了算法的正确性.该算法的意义在于避免了传统Inspector/Executor非规则计算模型中的Inspector阶段,从而节省了运行时Inspector阶段通过穷举下标生成通信集的巨大开销.  相似文献   

5.
迭代空间交错条块并行Gauss-Seidel算法   总被引:1,自引:0,他引:1  
胡长军  张纪林  王珏  李建江 《软件学报》2008,19(6):1274-1282
针对并行GS(Gauss-Seidel)迭代算法中数据局部性差、同步和通信开销大的问题,首先改进传统GS迭代,提出了多层对称GS迭代算法.然后给出了以迭代空间条块序作为执行序的串行执行模型.该模型通过对迭代空间进行"时滞"划分,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行执行模型.该模型改进了迭代空间网格划分,并通过网格条块重排序减少了cache缺失率、通信启动和同步次数.实验结果表明,迭代空间交错条块并行算法比传统的区域分解方法和红黑排序并行算法具有更好的并行效率和可扩展性.  相似文献   

6.
高效的并行有限差分Stencil 算法对于求解大型线性方程组是十分重要的.针对并行有限差分Stencil 算法中数据局部性差、同步和通信开销大的问题.首先改进传统有限差分Stencil 算法,提出了多层对称遍历有限差分Stencil 算法.然后给出了以迭代空间条块序作为执行序的串行算法,通过沿时间轴对迭代空间进行时滞划分,在不改变迭代算法性质的同时,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行算法,该算法利用改进的多面体模型对迭代空间网格划分,并通过网格条块重排序减少了Cache 缺失率、通信启动和同步次数.理论分析和实验结果表明,该并行模型比传统的区域分解方法和红黑排序并行算法具有更好的数据局部性,并行效率和可扩展性.  相似文献   

7.
传统MPI自动并行化编译系统从数据重分布的角度,生成面向分布式存储系统的消息传递程序,但是大量数据重分布通信的额外开销导致其加速比低。为了解决此问题,在基于Open64的MPI自动并行化编译系统后端,提出了一种消息传递代码生成算法。该算法以统一数据分布为中心,根据给定的并行化循环集和通信数组集,通过修改WHIRL表示的串行代码语法结构树,生成更精确的消息传递代码。实验结果表明,该算法能够较大程度地降低消息传递程序的通信开销,并且明显提升其加速比。  相似文献   

8.
分析了分布式图计算框架的同步和异步计算模式在调度开销和收敛速度上存在的优点与不足.同步计算模式调度开销小,但是收敛较慢;而异步计算模式收敛较快,但调度开销大.基于上述发现,提出一种混合计算模式,能够在分布式环境下有效地结合同步与异步计算模式的优点克服各自不足,以获得最优性能.混合计算模式采用"同步控制流"以降低分布式环境下的调度开销,同时采用"异步数据流"使计算过程使用较新的数据以加快收敛速度.基于多个典型图算法和真实大规模图的评测显示,混合计算模式的性能是原有同步计算模式的1.2倍到2.4倍,计算量平均减少30%;相对于异步计算模式通过减少调度开销,整体性能可以提升至其2.3倍到4.6倍.  相似文献   

9.
一种分布式共享存储系统的线程分配算法   总被引:3,自引:0,他引:3  
讨论了软件实现了多线程DSM 的通信开销和线程分配问题,给出了一种基于线程关系图的调度模型,并在此基础上提出了一种基于迭代的线程分配算法,通过大量的线程关系图对算法进行了评价,并且在一个软件DSM系统中实现了该算法,同时给出了算法的评价结果和应用程序的性能数据。  相似文献   

10.
SimRank 算法利用网络结构来评估网络中任意2点的相似性,它被广泛应用于社交网络和链接预测等诸多领域中.近年来,随着大数据技术的发展,SimRank 算法处理的数据不断增大,人们利用MapReduce 等分布式计算模型设计实现分布式的大规模 SimRank 算法来适应大数据处理的需求.但是,由于 SimRank 算法包含开销较大的迭代过程,每次迭代之后都需要一个全局同步,且每次迭代的计算复杂度高、通信量大,SimRank 算法不能在分布式环境下高效地实现.1)提出 Asyn‐SimRank 算法,该算法采用迭代‐累积的方式完成迭代计算,异步执行 SimRank 的核心迭代过程,避免了大规模分布式计算中的大量同步开销,同时有效降低计算量并减少通信开销;2)提出关键点优先调度计算,提升了 Asyn‐SimRank 算法的全局收敛速度;3)证明了 Asyn‐SimRank 算法的正确性和收敛性以及关键点优先调度计算的有效性;4)支持异步迭代的分布式框架 Maiter 上实现了 Asyn‐SimRank 算法.实验结果显示,相比较于 Hadoop ,Spark 上实现的 SimRank 算法和 Delta‐SimRank 算法,Asyn‐SimRank 算法大大提升了算法的计算效率,加速了算法收敛.  相似文献   

11.
Due to a significant communication overhead of sending and receiving data, the loop partitioning approaches on distributed memory systems must guarantee not just the computation load balance but computation+communication load balance. The previous approaches in loop partitioning have achieved a communication-free, computation load balanced iteration space partitioning solution for a limited subset of DOALL loops. But a large category of DOALL loops inevitably result in communication and the trade-offs between computation and communication must be carefully analyzed for these loops in order to balance out the combined computation time and communication overheads. In this work, we describe a partitioning approach based on the above motivation for the general cases of DOALL loops. Our goal is to achieve a computation+communication load balanced partitioning through static data and iteration space distribution. Our approach first performs partitioning of iteration and data spaces of a loop nest by analyzing communication and parallelism; it then performs architecture-dependent analysis to adjust the granularity of partitions, load balance each partition with respect to total computation+communication, and then performs mapping of partitions onto the available number of processors. This multiphase partitioning method works as follows. First, the code partitioning phase analyzes the references in the body of the DOALL loop nest and determines a set of directions for reducing a larger degree of communication by trading a lesser degree of parallelism. The partitioning is carried out in the iteration space of the loop by cyclically following a set of direction vectors such that the data references are maximally localized and reused, eliminating a larger communication volume than parallelism. We then perform data space partitioning based on a new larger partition owns rule to minimize the communication overhead for a compute intensive partition by localizing its references relatively more than a smaller noncompute intensive partition. A partition interaction graph is then constructed which is used by the architecture-dependent analysis phase to merge the partitions to achieve granularity adjustment, computation+communication load balance, and mapping on the actual number of available processors. Relevant theory and algorithms are developed along with a performance evaluation on the Cray T3D.  相似文献   

12.
Overlapping communication with computation is a well-known approach to improving performance. Previous research has focused on optimizations performed by the programmer. This paper presents a compiler algorithm that automatically determines the appropriate loop indices of a given nested loop and applies loop interchange and tiling in order to overlap communication with computation. The algorithm avoids generating redundant communication by providing a framework for combining information on data dependence, communication, and reuse. It also describes a method of generating messages to exchange data between processors for tiled loops on distributed memory machines. The algorithm has been implemented in our High Performance Fortran (HPF) compiler, and experimental results have shown its effectiveness on distributed memory machines, such as the RISC System/6000 Scalable POWERparallel System. This paper also discusses the architectural problems of efficient optimization.  相似文献   

13.
A methodology for partitioning and mapping of arbitrary uniform recurrence equations (UREs) expressed as computation graphs onto a given regular array is proposed. Deriving and based on a set of canonical dependencies together with two models of space projection, we give a general method of parallelization. The method has significant advantages in mapping an arbitrary computation graph onto a given processor array while preserving high efficiency in both communication and computation  相似文献   

14.
分布存储系统中优化通信的冗余计算分割   总被引:1,自引:0,他引:1  
针对并行循环套序列,提出一种冗余计算分割的通信优化方法,根据数据流分析,文中给出用以确定每个循环套的冗余计算量的一般方法,并在此基础上提出冗余计算分割的实现和判定,针对规则依赖的程序,该文还提出了一个高效的冗余计算分割的实现方法,该技术已经在一个并行编译器中实现,试验结果表明,它比传统的通信优化技术有明显的优越性。  相似文献   

15.
A method for executing nested loops with constant loop-carried dependencies in parallel on message-passing multiprocessor systems to reduce communication overhead is presented. In the partitioning phase, the nested loop is divided into blocks that reduce the interblock communication, without regard to the machine topology. The execution ordering of the iterations is defined by a given time function based on L. Lamport's (1974) hyperplane method. The iterations are then partitioned into blocks so that the execution ordering is not disturbed, and the amount of interblock communication is minimized. In the mapping phase, the partitioned blocks are mapped onto a fixed-size multiprocessor system in such a manner that the blocks that have to exchange data frequently are allocated to the same processor or neighboring processors. A heuristic mapping algorithm for hypercube machines is proposed  相似文献   

16.
The problem of mapping affine loop nests onto parallel computers with distributed memory is considered. A technique for algorithm scheduling and distributing operations and data over processors is proposed. This technique makes it possible to generate pipeline parallelism and minimize the amount of data exchanges between the processors. The method is adapted for automation and explicitly allows for dependence on outer variables of loops.  相似文献   

17.
Performance of a parallel algorithm on a parallel machine depends not only on the time complexity of the algorithm, but also on how the underlying machine supports the fundamental operations used by the algorithm. This study analyzes various mappings of image correlation algorithms in SIMD, MIMD, and mixed-mode environments. Experiments were conducted on the Intel Paragon, MasPar MP-1, nCUBE 2, and PASM prototype. The machine features considered in this study include: modes of parallelism, communication/computation ratio, network topology and implementation, SIMD CU/PE overlap, and communication/computation overlap. Performance of an implementation can be enhanced by using algorithmic techniques that match the machine features. Some algorithmic techniques discussed here are additional communication versus redundant computation, data block transfers, and communication/computation overlap. The results presented are applicable to a large class of image processing tasks. Case studies, such as the one presented here, are a necessary step in developing software tools for mapping an application task onto a single parallel machine and for mapping the subtasks of an application task, or a set of independent application tasks, onto a heterogeneous suite of parallel machines.  相似文献   

18.
沈亚楠  姚远  张平  赵荣彩  罗向阳 《计算机工程》2006,32(11):114-115,132
数据分解对消息传递并行机下的并行编译器取得高性能至关重要。根据编译器自动得出的数据分解(映射数据到处理机)信息,C语言版本的发送/接收消息循环嵌套可产生出来,从而在处理机之间实现分布数据。不仅一个已被证明且功能强大的数学模型用于产生数据分解代码,而且一个形式化的算法及其实现也已给出。初步实验结果显示该算法能显著提高性能。  相似文献   

19.
A method is given for obtaining independent parts of algorithms, represented by affine loop nests (not necessary perfectly nested). The method is based on a modular affine mapping of algorithm operations onto independent virtual processors. The method can select more independent computations than the known procedures based on affine mappings.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号