首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 140 毫秒
1.
高效的并行有限差分Stencil 算法对于求解大型线性方程组是十分重要的.针对并行有限差分Stencil 算法中数据局部性差、同步和通信开销大的问题.首先改进传统有限差分Stencil 算法,提出了多层对称遍历有限差分Stencil 算法.然后给出了以迭代空间条块序作为执行序的串行算法,通过沿时间轴对迭代空间进行时滞划分,在不改变迭代算法性质的同时,对迭代空间条块内部多次迭代计算,提高算法的数据局部性.最后提出一种基于迭代空间条块的并行算法,该算法利用改进的多面体模型对迭代空间网格划分,并通过网格条块重排序减少了Cache 缺失率、通信启动和同步次数.理论分析和实验结果表明,该并行模型比传统的区域分解方法和红黑排序并行算法具有更好的数据局部性,并行效率和可扩展性.  相似文献   

2.
逐次超松弛迭代方法被广泛应用于油藏数值模拟中压力方程的求解.其并行实现是提高模拟速度的重要途径.传统并行方案大都只是在一次迭代内进行数据划分,而没有进一步将数据划分与迭代空间划分相结合,故针对SOR算法和SMP(symmetric multi-processors)系统的特点,以OpenMP为并行化实现工具,提出了基于SMP的并行逐次超松弛迭代方法(parallelSOR).方法通过改变不同迭代步内数据点的更新次序,使不同区域内的数据点可以并行执行多次迭代.总结出针对三维油藏区域在数据空间划分和迭代空间合并上相对较优的策略,分析了迭代过程中网格块的生长形状.与传统的并行策略相比,该方法具有可减小同步开销、改进数据局部性、cache命中率高等优点.实验结果表明,该方法具有较高的加速比和效率.  相似文献   

3.
为适应海量地震数据以及集群并行规模不断增大的趋势,提出了多维度成像空间分解算法.根据大规模集群系统有多个并行层次的特征,首先沿炮检距方向分解成像空间;然后再沿in-line方向继续切分,直到成像空间小于计算节点物理内存;最后在二维地表上以面元为单位分解成像空间.算法实现上,共炮检距成像空间映射到计算节点组上,计算节点内的CPU核之间按照round-robin均分面元.该并行算法在不增加数据通信量的情况下,降低了内存的需求,减少了通信开销和同步时间,提高了数据的局部性.实际资料测试表明,该并行算法比传统的输出并行和输入并行算法具备更好的性能与可扩展性,实验作业调度多达497个节点、7 552个线程,仍然具备较好的加速效果.  相似文献   

4.
面向大型网格模型的简化问题,提出了一种基于顶点聚类方法采用多数据流策略的并行核外模型简化算法。算法首先将传统顶点聚类简化算法中的代表点计算方法改进为顶点筛选方法,进而设计了一种适用于分布式计算环境的数据外存储策略,最后采用多数据流的思想改进单元筛选与顶点筛选两个方法的执行过程,从而形成完整的并行核外模型简化算法。实验结果表明,该算法有效避免了基于区域分解的并行算法对模型结构的破坏,提高了模型简化的质量;相比于多种现有的并行算法,该算法极大程度优化了并行资源的负载分配问题,具备更为理想的加速比和并行效率。  相似文献   

5.
在图像并行处理应用中,有很大一部分并行算法是属于迭代同步的数据并行算法。这类分布式应用的子任务间需要某种形式的同步,因而要有协作调度的算法保证子任务基本上同时开始,并以同样速度执行。该文提出了一种以并行程序最短执行时间为目标的数据划分和任务协作调度算法。与其它类似算法相比,这个算法的最大特点是考虑了通信开销,因而更加符合实际的应用,更加有效。  相似文献   

6.
Jacobi迭代算法是一种求解偏微分方程组的常用循环运算.由于该算法存在语句间的数据相关,阻碍了其在图像处理单元(Graphic Processing Unit,GPU)等并行计算平台的高效实现.通过数学证明与实验验证,比较不同的循环优化策略,消除语句间数据相关,增强数据局部性,从而获得更高的执行性能.此外,利用块(Tile)大小选取模型,合理的划分计算数据,充分利用GPU的运算资源,进一步提高性能.实验结果表明,Jacobi奇偶复制算法比传统Jacobi并行算法在GPU上的性能提高4倍以上.  相似文献   

7.
逐次松弛迭代算法(SOR)是求解线性方程组的一种常用迭代算法,当系数矩阵正定时,它具有较快的收敛速度。但是,由于每个迭代步内存在数据相关,它难以实现并行计算。目前的SOR并行算法采用数据分解的方法,但由于该法并行区域过小,同步通讯代价大,并行效率低。本文提出了SOR的一种新型并行算法,该算法与传统SOR方法等价,具有相同的收敛性和迭代结果。该并行算法通过矩阵分块增大了可并行计算的区域,并引入流水线技术,利用各处理器间通讯与计算时间的重叠,获得较理想的并行加速效率。通过多核微机以及小规模集群上的数值实验证明,本文提出的SOR并行算法在求解大型稠密线性方程组时具有较好的并行效率。  相似文献   

8.
一个用于数据并行语言计算划分的时序优化模型   总被引:2,自引:0,他引:2  
一个程序中数据并行语句的计算划分(CP)对该程序的运行性能有决定性的作用.尽管人们对这一问题已经进行了广泛的研究,但这些研究的重点都集中在如何提高被选择计算划分的空间局部性上.针对并行循环结构的计算划分问题,提出了一个时序优化模型.在该模型中,一个计算划分被表示成一个有向图,在把并行语句中的操作映射到各个处理器的同时,给出了被分配到不同处理器上的操作之间的相关性.对于一条数据并行语句,时序优化模型对它的每个计算划分选择方案分别采用多种有效的优化策略进行优化;并综合考虑各个计算划分选择方案的负载平衡性、处理器间的操作依赖性、数据访问的空间局部性和时间局部性四个方面的因素,估算每个方案的执行效率;最后从这些方案中选择一个执行效率最优的方案作为该语句的计算划分.作者已在HPF编译器p-HPF采用时序优化模型实现了对FORALL结构的支持.实验结果表明,该模型具有非常好的通用性,对不同领域多种数据并行问题均取得了理想的加速比.同时,只需略微改动,该模型也可用于其他类型数据并行语句的计算划分.  相似文献   

9.
为减少空间降水插值的计算时间,以MPI并行接口为技术手段,采用数据划分建模方法,实现改进Kriging算法的并行算法.在Linux操作系统上搭建并行计算环境,试验数据表明,该并行算法能有效节省计算时间并具有良好的加速比、并行效率和扩展性.为Kriging插值算法的并行化实现和应用提供有意义的参考.  相似文献   

10.
作为颗粒离散元软件并行化的前期研究,对二维稳态导热问题的有限差分法求解程序进行了并行化处理.并行算法将计算域划分为若干个子域,并将各子域上的迭代计算任务分配给相应的处理器执行.同时,算法考虑负载平衡,并采用计算和通信的重叠技术,提高并行算法的效率.通过对二维稳态温度场导热问题的串/并行程序在曙光TC2600刀片服务器上的计算结果进行比较分析,验证了该并行方法的有效性.实验结果表明,计算耗时与通信耗时的比值越大,并行效率越高.  相似文献   

11.
Optimal semi-oblique tiling   总被引:1,自引:0,他引:1  
For 2D iteration space tiling, we address the problem of determining the tile parameters that minimize the total execution time on a parallel machine. We consider uniform dependency computations tiled so that (at least) one of the tile boundaries is parallel to the domain boundaries. We determine the optimal tile size as a closed form solution. In addition, we determine the optimal number of processors and also the optimal slope of the oblique tile boundary. Our results are based on the BSP model, which assures the portability of the results. Our predictions are justified on a sequence global alignment problem specialized to similar sequences using Fickett's k-band algorithm, for which our optimal semi-oblique tiling yields an improvement of a factor of 2.5 over orthogonal tiling. Our optimal solution requires a block-cyclic distribution of tiles to processors. The best one can obtain with only block distribution (as many authors require) is three times slower. Furthermore, our best running time is within 10 percent of the "predicted theoretical peak" performance of the machine!.  相似文献   

12.
Many computationally-intensive programs, such as those for differential equations, spatial interpolation, and dynamic programming, spend a large portion of their execution time in multiply-nested loops that have a regular stencil of data dependences. Tiling is a well-known compiler optimization that improves performance on such loops, particularly for computers with a multilevel hierarchy of parallelism and memory. Most previous work on tiling is limited in at least one of the following ways: they only handle nested loops of depth two, orthogonal tiling, or rectangular tiles. In our work, we tile loop nests of arbitrary depth using polyhedral tiles. We derive a prediction formula for the execution time of such tiled loops, which can be used by a compiler to automatically determine the tiling parameters that minimizes the execution time. We also explain the notion of rise, a measure of the relationship between the shape of the tiles and the shape of the iteration space generated by the loop nest. The rise is a powerful tool in predicting the execution time of a tiled loop. It allows us to reason about how the tiling affects the length of the longest path of dependent tiles, which is a measure of the execution time of a tiling. We use a model of the tiled iteration space that allows us to determine the length of the longest path of dependent tiles using linear programming. Using the rise, we derive a simple formula for the length of the longest path of dependent tiles in rectilinear iteration spaces, a subclass of the convex iteration spaces, and show how to choose the optimal tile shape.  相似文献   

13.
Presents a theoretical framework for automatically partitioning parallel loops to minimize cache coherency traffic on shared-memory multiprocessors. While several previous papers have looked at hyperplane partitioning of iteration spaces to reduce communication traffic, the problem of deriving the optimal tiling parameters for minimal communication in loops with general affine index expressions has remained open. Our paper solves this open problem by presenting a method for deriving an optimal hyperparallelepiped tiling of iteration spaces for minimal communication in multiprocessors with caches. We show that the same theoretical framework can also be used to determine optimal tiling parameters for both data and loop partitioning in distributed memory multicomputers. Our framework uses matrices to represent iteration and data space mappings and the notion of uniformly intersecting references to capture temporal locality in array references. We introduce the notion of data footprints to estimate the communication traffic between processors and use linear algebraic methods and lattice theory to compute precisely the size of data footprints. We have implemented this framework in a compiler for Alewife, a distributed shared-memory multiprocessor  相似文献   

14.
This paper presents a novel approach for the problem of generating tiled code for nested for-loops, transformed by a tiling transformation. Tiling or supernode transformation has been widely used to improve locality in multilevel memory hierarchies, as well as to efficiently execute loops onto parallel architectures. However, automatic code generation for tiled loops can be a very complex compiler work, especially when nonrectangular tile shapes and iteration space bounds are concerned. Our method considerably enhances previous work on rewriting tiled loops, by considering parallelepiped tiles and arbitrary iteration space shapes. In order to generate tiled code, we first enumerate all tiles containing points within the iteration space and, second, sweep all points within each tile. For the first subproblem, we refine upon previous results concerning the computation of new loop bounds of an iteration space that has been transformed by a nonunimodular transformation. For the second subproblem, we transform the initial parallelepiped tile into a rectangular one, in order to generate efficient code with the aid of a nonunimodular transformation matrix and its Hermite Normal Form (HNF). Experimental results show that the proposed method significantly accelerates the compilation process and generates much more efficient code.  相似文献   

15.
This paper presents a new method that can be applied by a parallelizing compiler to find, without user intervention, the iteration and data decompositions that minimize communication and load imbalance overheads in parallel programs targeted at NUMA architectures. One of the key ingredients in our approach is the representation of locality as a locality-communication graph (ICG) and the formulation of the compiler technique as a mixed integer nonlinear programming (MINLP) optimization problem on this graph. The objective function and constraints of the optimization problem model communication costs and load imbalance. The solution to this optimization problem is a decomposition that minimizes the parallel execution overhead. This paper summarizes the process of how the compiler extracts the locality information from a nonannotated code and focuses on how this compiler can derive the optimization problem, solve it, and generate the parallel code with the automatically selected iteration and data distributions. In addition, we include a discussion about our model and the solutions - the decompositions - that it provides. The approach presented in the paper is evaluated using several benchmarks. The experimental results demonstrate that the MINLP formulation does not increase compilation time significantly and that our framework generates very efficient iteration/data distributions for a variety of NUMA machines.  相似文献   

16.
Iteration space tiling is a common strategy used by parallelizing compilers and in performance tuning of parallel codes. We address the problem of determining the tile size that minimizes the total execution time. We restrict our attention to uniform dependency computations with two-dimensional, parallelogram-shaped iteration domain which can be tiled with lines parallel to the domain boundaries. The target architecture is a linear array (or a ring). Our model is developed in two steps. We first abstract each tile by two simple parameters, namely tile periodPtand intertile latencyLt. We formulate and partially resolve the corresponding optimization problem independent of the machine and program. Next, we refine the model with realistic machine and program parameters, yielding a discrete nonlinear optimization problem. We solve this analytically, yielding a closed form solution, which can be used by a compiler before code generation.  相似文献   

17.
全局同步计算模型简单易用,但是路障同步导致收敛速度变慢。以顶点为中心的异步迭代虽然提高了收敛速度,但在计算节点之间需要频繁发送信息。在Spark环境下提出一种基于子图的异步迭代更新方法。在子图之间建立异步消息通信连接后,子图能以异步方式发送数据块;通过多线程同步避免数据读写冲突,保证异步更新时顶点状态的一致性。在大规模样本数据集上分别从收敛结果、收敛速度和通信代价验证方法有效性。实验结果表明,与全局同步迭代相比,该方法有效提高了计算收敛速度。与顶点为中心的异步更新方式相比,该方法在收敛时间上略有增长,但是显著降低了通信开销。  相似文献   

18.
This paper proposes a new method for the problem of minimizing the execution time of nested for-loops using a tiling transformation. In our approach, we are interested not only in tile size and shape according to the required communication to computation ratio, but also in overall completion time. We select a time hyperplane to execute different tiles much more efficiently by exploiting the inherent overlapping between communication and computation phases among successive, atomic tile executions. We assign tiles to processors according to the tile space boundaries, thus considering the iteration space bounds. Our schedule considerably reduces overall completion time under the assumption that some part from every communication phase can be efficiently overlapped with atomic, pure tile computations. The overall schedule resembles a pipelined datapath where computations are not anymore interleaved with sends and receives to nonlocal processors. We survey the application of our schedule to modern communication architectures. We performed two sets of experimental results, one using MPI primitives over FastEthernet and one using the SISCI API over an SCI network. In both cases, the total completion time is significantly reduced.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号