共查询到20条相似文献,搜索用时 46 毫秒
1.
在简化方法的基础上,提出了在MIMD模型上采用异步通信模式求解模糊线性方程组的分布式并行分割算法,算法有效地平衡了负载,并分析了算法的时间复杂性和通信复杂性。 相似文献
2.
《计算机应用与软件》2016,(9)
针对大型实对称正定矩阵的Cholesky分解问题,给出其在图形处理器(GPU)上的具体实现。详细分析了Volkov计算Cholesky分解的混合并行算法,并在此基础上依据自身计算机的CPU以及GPU的计算性能,给出一种更为合理的三阶段混合调度方案,进一步减少CPU的空闲时间以及避免GPU空闲情况的出现。数值实验表明,当矩阵阶数超过7000时,新的混合调度算法相比标准的MKL算法获得了超过5倍的加速比,同时对比原Volkov混合算法获得了显著的性能提升。 相似文献
3.
系统工程计算在科学计算中,单台处理机不能满足需要,为提高计算效率和精度,采用并行处理是一个非常好的块三对角线性方程组的办法,提出了分布式环境下求解块三对角线性方程组的一种并行计算,算法是充分利用系数矩阵结构的特殊性,通过对系数矩阵进行适当地分解构造的迭代算法,使得算法需要在相邻处理机之间进行并行通信三次.并从理论上给出了算法收敛的一个充分条件.最后,在HP rx2600集群上进行了数值仿真,结果表明,实算与理论是一致的,提高了并行效率和精度. 相似文献
4.
针对Cholesky分解算法采用OpenMP并行程序设计时的并行性开销增大和线程负载不平衡的问题,利用并行性能分析工具对串行程序进行热点分析,提出了一种基于任务的Cholesky分解多核并行算法。该算法将大循环问题划分成各个相互独立的小任务,并运用任务窃取技术和动态负载均衡算法使多个任务能够并行完成。采用ParallelAmplifier对并行程序进行调试和优化,实验结果表明,其性能得到较大幅度的提升。 相似文献
5.
Cholesky分解细粒度并行算法 总被引:1,自引:0,他引:1
本文提出了一种Cholesky分解细粒度流水线并行算法,该算法可以处理任意规模的数据,可以充分开发FP-GA加速器提供的细粒度并行。实验表明,该算法具有很好的可扩展性,在Xilinx XC5 VLX330 FPGA上能够集成36个处理单元(PE),当矩阵的阶为16384、运行频率为200MHz时性能达到14.3GFLOPS。 相似文献
6.
本文介绍了一种基于瓦片算法的稠密矩阵并行 QR 分解及其实现方法。瓦片算法的思想是将完整的矩阵分块,并使每个块内的数据连续存储。各个瓦片块先独立进行分解,其他块接收当前块分解产生的数据,来更新自身块内的矩阵。我们分别实现了串行瓦片算法和并行瓦片算法,采用基于 MPI 和 OpenMP 混合并行编程模型,在“元”超级计算机上验证了该并行算法,并与 PLASMA 软件包进行对比,程序效率和可扩展性优于 PLASMA。 在多个节点上运行时,展现了良好的扩展性。 相似文献
7.
在简化方法的基础上,提出了在MIMD模型上采用异步通信模式求解模糊线性方程组的分布式并行分割算法,算法有效地平衡了负载,并分析了算法的时间复杂性和通信复杂性。 相似文献
8.
在大规模网络中发现稠密子图具有极其广泛的应用,如社区发现、垃圾邮件检测等。为了在大规模网络数据中快速、有效地发现稠密子图,本文提出一种基于GAS (Gather-Apply-Scatter)编程模型的分布式k-Truss算法—GASTruss。该算法采用GAS的模式完成数据同步和算法迭代,有效的克服了传统并行算法重复性计算及不能有效处理依赖关系大的数据等问题。本实验选择在GraphLab平台上进行,结果表明:与串行k-Truss算法以及基于MapReduce的GPTruss算法性能相比,GASTruss算法对数据规模具有良好的拓展性,在保持算法效果的同时能有效降低时间复杂度。 相似文献
9.
核外计算中,由于I/O操作速度比较慢,所以对文件的访阿时间占的比例较大.如果使文件操作和计算重叠则可以大幅度地提高运行效率.软件数据预取是一种有效的隐藏存储延迟的技术,通过预取使数据在实际使用之前从硬盘读到缓存中,提高了缓存(cache)的命中率,降低了读取数据的时间.通过设置两个缓冲区来轮流存放本次和下一次读入的数据块,实现访存完全命中cache的效果,使Cholesky分解并行程序执行核外计算的效率得到了大幅度的提高.同时,I/O操作的时间与CPU的执行时间的比例也是影响效率的主要因素. 相似文献
10.
在图形处理单元(GPU)平台的计算中,GPU设备存储器和内存容量相差较大,待处理数据通常无法一次性从内存拷贝至显存中进行运算。为此,提出一种Cholesky分解重叠算法。采用预存取技术,拷贝数据和计算重叠,降低设备的等待时间,将设备存储器划分为 2个缓冲区,轮流存放本次运算数据和下次待运算数据,在设备运算过程中完成设备存储器和内存之间的数据交换。实验结果表明,该算法可以有效提高运算效率。 相似文献
11.
Alan GeorgeJoseph W. H. LiuEsmond Ng 《Parallel Computing》1989,10(3):287-298
We consider the problem of reducing data traffic among processor nodes during the parallel factorization of a sparse matrix on a hypercube multiprocessor. A task assignment strategy based on the structure of an elimination tree is presented. This assignment is aimed at achieving load balancing among the processors and also reducing the amount of processor-to-processor data communication. An analysis of regular grid problems is presented, providing a bound on communication volume generated by the new strategy, and showing that the allocation scheme is optimal in the asymptotic sense. Some experimental results on the performance of this scheme are presented. 相似文献
12.
In this paper, we consider the problem of finding fill-preserving sparse matrix orderings for parallel factorization. That is, given a large sparse symmetric and positive definite matrix A that has been ordered by some fill-reducing ordering, we want to determine a reordering that is appropriate in terms of preserving the sparsity and minimizing the cost to perform the Cholesky factorization in parallel. Past researches on this problem all are based on the elimination tree model, in which each node represents the task for factoring a column, and thus, can be seen as a coarse-grained task dependence model. To exploit more parallelism, Joseph Liu proposed a medium-grained task model, called the column task graph, and showed that it is amenable to the shared-memory supercomputers. Based on the column task graph, we devise a greedy reordering algorithm, and show that our algorithm can find the optimal ordering among the class of all fill-preserving orderings of the given sparse matrix A. 相似文献
13.
This paper considers four parallel Cholesky factorization algorithms, including SPOTRF from the February 1992 release of LAPACK, each of which call parallel Level 2 or 3 BLAS, or both. A fifth parallel Cholesky algorithm that calls serial Level 3 BLAS is also described. The efficiency of these five algorithms on the CRAY-2, CRAY Y-MP/832, Hitachi Data Systems EX 80, and IBM 3090-600J is evaluated and compared with a vendor-optimized parallel Cholesky factorization algorithm. The fifth parallel Cholesky algorithm that calls serial Level 3 BLAS provided the best performance of all algorithms that called BLAS routines. In fact, this algorithm outperformed the Cray-optimized libsci routine (SPOTRF) by 13–44%;, depending on the problem size and the number of processors used.This work was supported by grants from IMSL, Inc., and Hitachi Data Systems. The first version of this paper was presented as a poster session at Supercomputing '90, New York City, November 1990. 相似文献
14.
This paper presents a solution to the problem of partitioning the work for sparse matrix factorization to individual processors on a multiprocessor system. The proposed task assignment strategy is based on the structure of the elimination tree associated with the given sparse matrix. The goal of the task scheduling strategy is to achieve load balancing and a high degree of concurrency among the processors while reduçing the amount of processor-to-processor data comnication, even for arbitrarily unbalanced elimination trees. This is important because popular fill-reducing ordering methods, such as the minimum degree algorithm, often produce unbalanced elimination trees. Results from the Intel iPSC/2 are presented for various finite-element problems using both nested dissection and minimum degree orderings.Research supported by the Applied Mathematical Sciences Research Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems Inc. 相似文献
15.
块三对角矩阵的并行局部块分解预条件 总被引:5,自引:0,他引:5
该文首先分析了并行局部块分解预条件的特征分布,分析表明其与串行局部块分解预条件的特征分布基本相当,从而从理论上保证了利用该预条件进行并行计算时的高效性.其次分析了利用该预条件进行并行计算时影响加速比的因素,由此说明了当问题规模不大而处理机台数增加时,计算效率必然逐渐下降的原因.最后在由6台微机连成的机群系统上将该预条件与利用多分裂技术构造的多种预条件进行了比较,实验结果说明该预条件效率高于其它预条件方法.同时在某巨型机上进行的实验表明对处理机台数比较多时,该预条件也仍然很有效. 相似文献
16.
通过对Cholesky分解法求解线性方程组的分析,建立Cholesky分解法三角化对称正定阵的图模型,并基于该模型及Mesh结构P/G网络的自身特点,提出一个P/G网快速分析算法.实验证明,该算法能大大降低Mesh结构P/G网络的分析运算时间和内存占用. 相似文献
17.
Marc Baboulin Luc Giraud Serge Gratton Julien Langou 《Concurrency and Computation》2007,19(4):483-502
In this paper we propose a distributed packed storage format that exploits the symmetry or the triangular structure of a dense matrix. This format stores only half of the matrix while maintaining most of the efficiency compared with a full storage for a wide range of operations. This work has been motivated by the fact that, in contrast to sequential linear algebra libraries (e.g. LAPACK), there is no routine or format that handles packed matrices in the currently available parallel distributed libraries. The proposed algorithms exclusively use the existing ScaLAPACK computational kernels, which proves the generality of the approach, provides easy portability of the code and provides efficient re‐use of existing software. The performance results obtained for the Cholesky factorization show that our packed format performs as good as or better than the ScaLAPACK full storage algorithm for a small number of processors. For a larger number of processors, the ScaLAPACK full storage routine performs slightly better until each processor runs out of memory. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献
18.
John M. Conroy 《Parallel Computing》1990,16(2-3):139-156
Nested dissection is a very popular direct method for solving sparse linear systems that arise from finite difference and finite element methods. Worley and Schreiber [16] give a fine grain algorithm for a square array of processors. Their algorithm uses O(N2) processors, each with O(N) memory, to factor an N2 by N2 sparse matrix whose graphs is an N × N mesh. The efficiency of their method is between 1/46 and 1/12. George et al. [6] [8] give a medium grain algorithm for hypercube architecture, while George et al. [7] give an algorithm for shared memory machines. These papers present a column oriented approach which can exploit O(N) parallelism and yield efficiencies up to 50%. Lucas [11] also gives a column oriented scheme which achieves up to 75% efficiency and O(N) parallelism. In this paper, we present a medium to fine grain algorithm for a P × P array of processors with local memory. This algorithm can exploit up to O(N2) parallelism. The efficiency of the fine grain version is comparable to [16] while as a medium grain algorithm achieves about 49% efficiency. The strength of the method is due to three factors: its ability to pipeline much of the computation, overlapping computation and communication, and the use of level 3 BLAS like primitives. In addition to its high efficiency its memory requirement is optimal, only O(N2 log N/P2) words memory is needed per processor. 相似文献
19.
This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works
well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and some implementations
based on bucketing that work well for uniform distributions, there has been little work on implementations for general distributions.
We use the well known reduction of 2D Delaunay triangulation to find the 3D convex hull of points on a paraboloid. Based on
this reduction we developed a variant of the Edelsbrunner and Shi 3D convex hull algorithm, specialized for the case when
the point set lies on a paraboloid. This simplification reduces the work required by the algorithm (number of operations)
from O(n log
2
n) to O(n log n) . The depth (parallel time) is O( log
3
n) on a CREW PRAM. The algorithm is simpler than previous O(n log n) work parallel algorithms leading to smaller constants.
Initial experiments using a variety of distributions showed that our parallel algorithm was within a factor of 2 in work
from the best sequential algorithm. Based on these promising results, the algorithm was implemented using C and an MPI-based
toolkit. Compared with previous work, the resulting implementation achieves significantly better speedups over good sequential
code, does not assume a uniform distribution of points, and is widely portable due to its use of MPI as a communication mechanism.
Results are presented for the IBM SP2, Cray T3D, SGI Power Challenge, and DEC AlphaCluster.
Received June 1, 1997; revised March 10, 1998. 相似文献
20.
In this paper new parallel algorithms to solve the Lyapunov equations for the Cholesky factor using Hammarling's method on message passing multiprocessors are described. These algorithms are based on previous work carried out on the parallel solution of triangular linear systems by using row block data distribution and a wavefront of antidiagonals. The algorithms are theoretically analyzed and experimental results obtained on an SGI Power Challenge and a Cray T3D are presented. 相似文献