首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 187 毫秒
1.
对称矩阵三对角化的有效并行块算法设计   总被引:1,自引:0,他引:1  
在矩阵数值计算中,块算法通常比非块算法更有效,但这也增加了并行算法设计和实现的难度.在广义稠密对称矩阵特征问题并行求解器中,并行块算法的构造可应用到正定对称矩阵的Choleski分解、对称矩阵的三对角化和回代转化(back-transiation)操作中.本文将并行块算法的讨论集中在具有代表性的对称矩阵三对角化上,给出在非块存储方式下对称矩阵三对角化的并行块算法设计方法.分析块算法大小同矩阵规模和处理器数量的关系.在深腾6800上的试验表明,我们的算法具有很好的性能,并得到了比ScaLAPACK更高的性能.  相似文献   

2.
沈雁  戴瑜兴 《计算机工程》2019,45(2):284-289
在OpenCL并行计算框架的clMAGMA库中,Cholesky分解算法采用大尺寸分块并行方法,不能充分利用GPU的高速局部存储器,且在计算过程中存在多次GPU-CPU间的数据传递。为此,提出采用小尺寸分块并行方法,充分利用GPU中的高速局部存储器,使矩阵子块的逆矩阵得到复用,完成对称正定矩阵的高效Cholesky分解,并且其能够应用于三维视觉光束平差问题中的大型正定矩阵的分解。实验结果表明,该方法的Cholesky分解速度比clMAGMA提升50%以上,针对光束平差问题,比Ceres Solver中使用的Eigen库速度提升约38倍。  相似文献   

3.
广义稠密对称特征问题的求解是许多应用科学和工程的主要任务,并且是计算电磁学、电子结构、有限元模型和量子化学等计算中的重要部分。将广义对称特征问题转化为标准对称特征问题是求解广义稠密对称特征问题的关键计算步骤。针对GPU集群,文中给出了广义稠密对称特征问题标准化块算法在GPU集群上基于MPI+CUDA的实现。为了适应GPU集群的架构,广义对称特征问题标准化算法将正定矩阵的Cholesky分解与传统的广义特征问题标准化块算法相结合,降低了标准化算法中不必要的通信开销,并且增强了算法的并行性。在基于MPI+CUDA的标准化算法中,GPU与CPU之间的数据传输操作被用来掩盖GPU内的数据拷贝操作,这消除了拷贝所花费的时间,进而提高了程序的性能。同时,文中还给出了矩阵在二维通信网格中行通信域和列通信域之间完全并行的点对点的转置算法和基于MPI+CUDA的具有多个右端项的三角矩阵方程BX=A求解的并行块算法。在中科院计算机网络信息中心的超级计算机系统“元”上,每个计算节点配置2块Nvidia Tesla K20 GPGPU卡及2颗Intel E5-2680 V2处理器,使用多达32个GPU对不同规模矩阵的基于MPI+CUDA的广义对称特征问题标准化算法进行测试,取得了较好的加速效果与性能,并且具有良好的可扩展性。当使用32个GPU对50000×50000阶的矩阵进行测试时,峰值性能达到了约9.21 Tflops。  相似文献   

4.
本文研究利用多处理机的MIMD型计算机计算方型实对称正定矩阵的三角因子问题。文章从算法的加速比和效率的角度,导出并分析了当所用的处理机台数为(ⅰ)无限和(ⅱ)O(n)时,基于Cholesky分解法和Gaussian消元法的并行算法,其中n是矩阵的阶。对于第(ⅱ)种情况,它表明并行消元法能够达到与并行Cholesky方法相同的加速比,而仅需Cholesky方法所用处理机台数的一半。  相似文献   

5.
不完全 Cholesky 分解预条件共轭梯度(incomplete Cholesky factorization preconditioned conjugate gradient ,ICCG)法是求解大规模稀疏对称正定线性方程组的有效方法。然而ICCG法要求在每次迭代中求解2个稀疏三角方程组,稀疏三角方程组求解固有的串行性成为了ICCG法在GPU上并行求解的瓶颈。针对稀疏三角方程组求解,给出了一种利用GPU 加速的有效方法。为了增加稀疏三角方程组求解在GPU上的多线程并行性,提出了对不完全Cholesky分解产生的稀疏三角矩阵进行分层调度(level scheduling )的方法。为了进一步提高稀疏三角方程组求解的并行性能,提出了在分层调度前通过近似最小度(approximate minimum degree ,AMD)算法对系数矩阵进行重排序、在分层调度后对稀疏三角矩阵进行层排序的方法,降低了分层调度过程中产生的层数,优化了稀疏三角方程组求解的GPU内存访问模式。数值实验表明,与利用NVIDIA CUSPARSE实现的ICCG法相比,采用上述方法性能可以获得平均1倍以上的提升。  相似文献   

6.
广义Hermitian特征问题并行求解器的性能依赖于所选择的并行算法和矩阵的分布策略等诸多方面.基于块存储和快算法策略,提出了一个新的标准化转化的并行算法,该并行算法将Cholesky分解结合到广义特征问题标准化转换中,降低了已有并行算法的通信开销,并增加了算法的并行性.新算法可显著改善已有并行算法的性能和可扩展性.另外给出了一个有效求解具有多个右端项的三角矩阵方程AX=B的并行块算法.通过自主开发的特征问题并行软件包PSEPS的测试结果表明,并行算法比传统的并行算法快大约1倍,并具有较好的可扩展性.  相似文献   

7.
张云泉  施巍松 《软件学报》2000,11(12):1674-1680
用户在编写并行程序时,通常是把物理处理器看成逻辑的处理器(进程)网格,以便于算法的实现.随着用户可用处理器的不断增多,可选择的网格形状也随之增加,如何为基于消息传递的并行程序选择合适的、能发挥出并行机潜在性能的处理器网格形状,是一个迫切需要解决的问题.在提出基于通信点概念的最小度数通信点集合法之后,通过对并行程序通信模式的分析,试图解决与负载平衡无关的并行程序的最适处理器网格选择问题.通过对ScaLAPACK软件包中的一个并行测试程序——并行Cholesky(对称正定矩阵分解)通信点集合度的分析,此方法成功地选择了最适处理器网格形状,并与实验结果相一致.  相似文献   

8.
在阵列信号抗干扰算法中,常常需要求解协方差矩阵的逆矩阵。Cholesky分解利用了协方差矩阵的厄米特(Hermitian)正定的特性,大大简化了矩阵求逆运算的计算量。论文介绍了Cholesky分解数学原理,并提出了一种适合FPGA实现的结构。基于浮点数的算法实现相比传统的定点数,大大提高了结果的精度。由于Cholesky分解需要涉及浮点数的开方运算,论文引入了平方根倒数法来提高开方运算的速度。通过仿真与实测,选取了最优的资源与速度的实现方案。  相似文献   

9.
针对大型实对称正定矩阵的Cholesky分解问题,给出其在图形处理器(GPU)上的具体实现。详细分析了Volkov计算Cholesky分解的混合并行算法,并在此基础上依据自身计算机的CPU以及GPU的计算性能,给出一种更为合理的三阶段混合调度方案,进一步减少CPU的空闲时间以及避免GPU空闲情况的出现。数值实验表明,当矩阵阶数超过7000时,新的混合调度算法相比标准的MKL算法获得了超过5倍的加速比,同时对比原Volkov混合算法获得了显著的性能提升。  相似文献   

10.
为满足大规模线性方程组对内存容量的要求,针对对称方程组提出一种高斯消去的并行化方案。对称方程组在高斯消去过程中其子方阵的对称性仍然存在,因此在并行计算时只读入和计算三角部分的数据,从而减少储存空间的大小,提高并行效率。测试表明,该方案的并行效率优于传统算法,可应用于对称方程组的大规模数值计算中。  相似文献   

11.
In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached.  相似文献   

12.
基于内点算法((Interior Point Method,IPM)框架,导出具有分块带边结构系数矩阵的线性规划(Linear Programming, I_P)问题的简化和最简修正方程,并证明最简修正方程的对角分块具有正定性。结合正定矩阵的Cholcsky分解和解藕技术设计了修正方程的并行求解方法,给出了LP的并行内点算法结构。集群环境下的数值实验表明,所提算法具有很好的加速比和可扩展性,适合求解大规模结构化工尹问题。  相似文献   

13.
Jürgen Garloff 《Computing》2012,94(2-4):97-107
The paper considers systems of linear interval equations, i.e., linear systems where the coefficients of the matrix and the right hand side vary between given bounds. We focus on symmetric matrices and consider direct methods for the enclosure of the solution set of such a system. One of these methods is the interval Cholesky method, which is obtained from the ordinary Cholesky decomposition by replacing the real numbers by the related intervals and the real operations by the respective interval operations. We present a method by which the diagonal entries of the interval Cholesky factor can be tightened for positive definite interval matrices, such that a breakdown of the algorithm can be prevented. In the case of positive definite symmetric Toeplitz matrices, a further tightening of the diagonal entries and also of other entries of the Cholesky factor is possible. Finally, we numerically compare the interval Cholesky method with interval variants of two methods which exploit the Toeplitz structure with respect to the computing time and the quality of the enclosure of the solution set.  相似文献   

14.
LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Most of these factorization routines are implemented with block‐partitioned algorithms in order to perform matrix–matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have different performance ratios of computation and communication, the optimal computational block sizes are different from one another in order to generate the maximum performance of an algorithm. Therefore, the ata matrix should be distributed with the machine specific optimal block size before the computation. Too small or large a block size makes achieving good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an ‘algorithmic blocking’ on two‐dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines. Copyright © 2001 John Wiley & Sons, Ltd.  相似文献   

15.
In this paper, we introduce and analyze a modification of the Hermitian and skew-Hermitian splitting iteration method for solving a broad class of complex symmetric linear systems. We show that the modified Hermitian and skew-Hermitian splitting (MHSS) iteration method is unconditionally convergent. Each iteration of this method requires the solution of two linear systems with real symmetric positive definite coefficient matrices. These two systems can be solved inexactly. We consider acceleration of the MHSS iteration by Krylov subspace methods. Numerical experiments on a few model problems are used to illustrate the performance of the new method.  相似文献   

16.
The problem of computing the triangular factors of a square, real, symmetric, and positive definite matrix by using the facilities of a multiprocessor MIMD-type computer is considered. The parallel algorithms based on Cholesky decomposition and Gaussian elimination are derived and analyzed in terms of their speedup and efficiency, when the available number of processors is O(n), where n is the size of the matrix. It is shown that the parallel elimination method can achieve the same speedup as the parallel Cholesky method while using only half the number of processors required by the Cholesky method.  相似文献   

17.
In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer  相似文献   

18.
This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain.  相似文献   

19.
This paper describes an efficient implementation and evaluation of a parallel eigensolver for computing all eigenvalues of dense symmetric matrices. Our eigensolver uses a Householder tridiagonalization method, which has higher parallelism and performance than conventional methods when problem size is relatively small, e.g., the order of 10,000. This is very important for relevant practical applications, where many diagonalizations for such matrices are required so often. The routine was evaluated on the 1024 processors HITACHI SR2201, and giving speedup ratios of about 2–5 times as compared to the ScaLAPACK library on 1024 processors of the HITACHI SR2201.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号