期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

SMP集群系统上矩阵特征问题并行求解器的有效算法 总被引：2，自引：0，他引：2

赵永华迟学斌程强《计算机研究与发展》2007,44(2):334-340

对称矩阵三对角化和三对角对称矩阵的特征值求解是稠密对称矩阵特征问题并行求解器的关键步 .针对SMP集群系统的多级体系结构,基于Householder变换的矩阵三对角化和三对角矩阵特征值问题的分而治之算法,给出了它们的MPI OpenMP混合并行算法 .算法研究集中在SMP集群系统环境下的负载平衡、通信开销和性能评价 .混合并行算法的设计结合了粗粒度线程并行模式和任务共享的动态调用方法,改善了MPI算法中的负载平衡问题、降低了通信开销 .在深腾6800上的实验表明,基于混合并行算法的求解器比纯MPI版本的求解器具有更好的性能和可扩展性 . 相似文献

2.

对称矩阵三对角化的混合并行算法设计 总被引：2，自引：0，他引：2

赵永华迟学斌陈江《计算机工程》2005,31(22):39-41,53

基于Householder转换,给出了稠密对称矩阵三对角化的MPI＋OpenMP混合并行算法。内容集中在SMP集群系统环境下算法的负载平衡、通信开销和性能评价。OpenMP共享内存并行采用了粗粒度方法,解决了MPI算法中的负载平衡问题,降低了通信开销。在深腾6800上的试验结果表明,MPI＋OpenMP版本比纯MPI版本具有更好的性能和可扩展性。相似文献

3.

基于两步对角化的对称稠密矩阵特征值问题快速求解算法

郑啸天赵永华《数据与计算发展前沿》2015,6(4):39-46

计算对称矩阵中的某些特定的特征值和特征向量问题是很多科学计算领域中都存在的重要课题。特别在电子结构的计算中,特征值计算成为计算瓶颈。以往在需要求解大部分特征值和特征向量的应用场合,一般使用直接求解的方式。为了更好地利用存储器性能优势,我们设计了对角化算法,对规约与逆变换过程进行拆分处理,通过对整个过程的重新设计,充分利用存储器结构上的优势,提升单核计算速度,同时改进并行效率。本文中我们重点讨论三对角矩阵到带状矩阵逆变换过程。本文中所提及到的算法应用于MESIA电子结构计算软件包之中,取得了一定的性能提升。相似文献

4.

广义稠密对称特征问题标准化算法在GPU集群上的有效实现

刘世芳赵永华于天禹黄荣锋《计算机科学》2020,47(4):6-12

广义稠密对称特征问题的求解是许多应用科学和工程的主要任务,并且是计算电磁学、电子结构、有限元模型和量子化学等计算中的重要部分。将广义对称特征问题转化为标准对称特征问题是求解广义稠密对称特征问题的关键计算步骤。针对GPU集群,文中给出了广义稠密对称特征问题标准化块算法在GPU集群上基于MPI+CUDA的实现。为了适应GPU集群的架构,广义对称特征问题标准化算法将正定矩阵的Cholesky分解与传统的广义特征问题标准化块算法相结合,降低了标准化算法中不必要的通信开销,并且增强了算法的并行性。在基于MPI+CUDA的标准化算法中,GPU与CPU之间的数据传输操作被用来掩盖GPU内的数据拷贝操作,这消除了拷贝所花费的时间,进而提高了程序的性能。同时,文中还给出了矩阵在二维通信网格中行通信域和列通信域之间完全并行的点对点的转置算法和基于MPI+CUDA的具有多个右端项的三角矩阵方程BX=A求解的并行块算法。在中科院计算机网络信息中心的超级计算机系统“元”上,每个计算节点配置2块Nvidia Tesla K20 GPGPU卡及2颗Intel E5-2680 V2处理器,使用多达32个GPU对不同规模矩阵的基于MPI+CUDA的广义对称特征问题标准化算法进行测试,取得了较好的加速效果与性能,并且具有良好的可扩展性。当使用32个GPU对50000×50000阶的矩阵进行测试时,峰值性能达到了约9.21 Tflops。相似文献

5.

求解实对称带状矩阵特征值问题的一种分治算法 总被引：2，自引：0，他引：2

罗晓广李晓梅《数值计算与计算机应用》1998,(3)

gi．引言考虑矩阵特征值问题AX二AX，其中A是半带宽为则1＜，＜＜。）的。Xu实对称带状矩阵，表示如下：即a；j二民当k－j＞，求解上述问题的经典算法是：先用稳定的正交变换（Householder变换或Gi、us变换）将带状矩阵三对角化，然后，用QR算法求对称三对角矩阵的特征对．经典算法的缺点是并行实现困难，尤其是分布式并行机上难度更大．文＊3］提出的同伦分治算法速度快，并行效率高，但它仅适合对称三对角矩阵．本文推广K3］的结果，提出求解实对称带状矩阵特征值问题的一种同伦分治算法．92．算法的理论背景把矩阵A划分如下：其中A… 相似文献

6.

块三对角线性方程组的一种并行迭代算法

樊艳红吕全义《计算机仿真》2011,28(2)

系统工程计算在科学计算中,单台处理机不能满足需要,为提高计算效率和精度,采用并行处理是一个非常好的块三对角线性方程组的办法,提出了分布式环境下求解块三对角线性方程组的一种并行计算,算法是充分利用系数矩阵结构的特殊性,通过对系数矩阵进行适当地分解构造的迭代算法,使得算法需要在相邻处理机之间进行并行通信三次.并从理论上给出了算法收敛的一个充分条件.最后,在HP rx2600集群上进行了数值仿真,结果表明,实算与理论是一致的,提高了并行效率和精度. 相似文献

7.

基于稳健联合分块对角化的卷积盲分离 总被引：1，自引：0，他引：1

汤辉王殊《自动化学报》2013,39(9):1502-1510

针对卷积盲分离问题,提出一种新的矩阵联合分块对角化(Joint block diagonalization, JBD)算法. 现有的迭代非正交联合分块对角化算法都存在不收敛的情况,本文利用分离矩阵的特殊结构确保其可逆性,使得算法的迭代过程稳定. 在已知矩阵分块结构的条件下,首先,将卷积盲分离模型写成瞬时形式,并说明其满足联合分块对角化结构; 然后,提出联合分块对角化的代价函数,依据代价函数的最小化等价于矩阵中每个分块的范数最小化, 将整个分离矩阵的迭代更新转化成每个分块的迭代更新;最后,利用最小化条件得到迭代算法. 实数和复数两种情况下的算法都进行了推导.基本实验验证了新算法在不同条件下的性能; 仿真实验中对在时域和频域都重叠的信号的卷积混合进行盲分离,实验结果验证了新算法具有更好的分离性能和更稳定的分离能力. 相似文献

8.

块带状线性方程组的分布式并行算法 总被引：3，自引：0，他引：3

下载免费PDF全文

迟利华李晓梅《计算机工程与科学》1999,21(3):61-65

本文首先根据分而治之的思想提出一种新的求解块三地角线性方程组的分布式并行算法,然后将该算法推广到块五对角线性方程组和块七地角线方程组的并行求解,并对算法进行了性能分析。ＳＧＩ工作站机群和５８６微机群上试算表明,加速比呈线性增加。相似文献

9.

计算对称带状矩阵特征值问题的并行二分／多分法 总被引：1，自引：1，他引：0

魏立峰李晓梅《计算机工程与设计》2001,22(1):51-55

文中提出了在分布式环境下并行求解对称带状矩阵特征值问题的并行二分．多分法及其改进,该算法利用变形高斯消去法计算对称带状矩阵的Sturm序列,并利用Rayleigh商迭代对二分／多分法加以改进,在算法的并行执行过程中,各处理机间不需通信,特别适用在分布式环境下的并行计算,最后给出了数值实验结果。相似文献

10.

求分块周期三对角矩阵逆矩阵的新算法

杜永恩陆全徐仲《计算机工程与应用》2012,48(17):41-43

根据分块三对角矩阵逆矩阵的特殊结构,利用其LU和UL分解,并使用Sheman-Morrison-Woodbury公式,得到一个求分块周期三对角矩阵逆矩阵的新算法,并由该算法得到求周期三对角矩阵和对称周期三对角矩阵逆矩阵的新算法。新算法比传统算法的计算复杂度和计算时间要低。相似文献

11.

Parallel block tridiagonalization of real symmetric matrices

Yihua Bai Robert C. Ward 《Journal of Parallel and Distributed Computing》2008

Two parallel block tridiagonalization algorithms and implementations for dense real symmetric matrices are presented. Block tridiagonalization is a critical pre-processing step for the block tridiagonal divide-and-conquer algorithm for computing eigensystems and is useful for many algorithms desiring the efficiencies of block structure in matrices. For an “effectively” sparse matrix, which frequently results from applications with strong locality properties, a heuristic parallel algorithm is used to transform it into a block tridiagonal matrix such that the eigenvalue errors remain bounded by some prescribed accuracy tolerance. For a dense matrix without any usable structure, orthogonal transformations are used to reduce it to block tridiagonal form using mostly level 3 BLAS operations. Numerical experiments show that block tridiagonal structure obtained from this algorithm directly affects the computational complexity of the parallel block tridiagonal divide-and-conquer eigensolver. Reduction to block tridiagonal form provides significantly lower execution times, as well as memory traffic and communication cost, over the traditional reduction to tridiagonal form for eigensystem computations. 相似文献

12.

基于BLACS的2.5D并行矩阵乘法

廖霞李胜国卢宇彤杨灿群《计算机学报》2021,44(5):1037-1050

并行矩阵乘法是线性代数中最重要的基本运算之一,同时也是许多科学应用的基石.随着高性能计算(HPC)向E级计算发展,并行矩阵乘法的通信开销所占比重越来越大.如何降低并行矩阵乘法的通信开销,提高并行矩阵乘的可扩展性是当前研究的热点之一.本文提出一种新型的分布式并行稠密矩阵乘算法,即2.5D版本的PUMMA(Parallel Universal Matrix Multiplication Algorithm)算法,该算法是通过将初始的进程分成c组,利用计算节点的额外内存,在每个进程组上同时存储矩阵A、B和执行1/c的PUMMA算法,最后通过规约操作来得到矩阵乘的最终结果.本文基于BLACS(Basic Linear Algebra Communication Subprograms)通信库实现了一种从2D到2.5D的新型数据重分配算法,与PUMMA算法相结合,最终得到2.5D PUMMA算法,可直接替换PDGEMM(Parallel Double-precision General Matrix-matrix Multiplication),具有良好的可移植性.与国际标准算法库ScaLAPACK(Scalable Linear Algebra PACKage)中的PDGEMM等经典2D算法相比,本文算法缩减了通信次数,提高了数据局部性,具有更好的可扩展性.在进程数较多时,例如4096进程时,系统测试表明相对PDGEMM的加速比可达到2.20~2.93.进一步地,本文将2.5D PUMMA算法应用于加速计算对称三对角矩阵的特征值分解,其加速比可达到1.2以上.本文通过大量数值算例分析了2.5D PUMMA算法的性能,并给出了实用性建议和总结了未来的工作. 相似文献

13.

Parallel solution of partial symmetric eigenvalue problems from electronic structure calculations

T. Auckenthaler H. Lederer 《Parallel Computing》2011,37(12):783-794

The computation of selected eigenvalues and eigenvectors of a symmetric (Hermitian) matrix is an important subtask in many contexts, for example in electronic structure calculations. If a significant portion of the eigensystem is required then typically direct eigensolvers are used. The central three steps are: reduce the matrix to tridiagonal form, compute the eigenpairs of the tridiagonal matrix, and transform the eigenvectors back. To better utilize memory hierarchies, the reduction may be effected in two stages: full to banded, and banded to tridiagonal. Then the back transformation of the eigenvectors also involves two stages. For large problems, the eigensystem calculations can be the computational bottleneck, in particular with large numbers of processors. In this paper we discuss variants of the tridiagonal-to-banded back transformation, improving the parallel efficiency for large numbers of processors as well as the per-processor utilization. We also modify the divide-and-conquer algorithm for symmetric tridiagonal matrices such that it can compute a subset of the eigenpairs at reduced cost. The effectiveness of our modifications is demonstrated with numerical experiments. 相似文献

14.

The symmetric tridiagonal eigenvalue problem on a transputer network

T. Z. Kalamboukis 《Parallel Computing》1990,15(1-3):101-106

In this paper we present an implementation of multisection and parallel bisection method on a transputer network for finding the eigenvalues and corresponding eigenvectors of a symmetric tridiagonal matrix which lie in a specified interval (a, b). Although several similar studies in the literature have been reported significant speedups over the sequential versions of the algorithms, it remains to be determined which multiprocessor configuration is the most advantageous for these problems. 相似文献

15.

Tridiagonalization of a symmetric matrix on a square array of mesh-connected processors

A. Bojańczyk R.P. Brent 《Journal of Parallel and Distributed Computing》1985,2(3):261-276

A parallel algorithm for transforming an n × n symmetric matrix to tridiagonal form is described. The algorithm implements Givens rotations on a square array of n × n processors in such a way that the transformation can be performed in time O(n log n). The processors require only nearest-neighbor communication. The reduction to tridiagonal form could be the first step in the parallel solution of the symmetric eigenvalue problem in time O(n log n). 相似文献

16.

Applying parallel computer systems to solve symmetric tridiagonal eigenvalue problems

Mi Lu Xiangzhen Qiao 《Parallel Computing》1992,18(12):1301-1315

A block parallel partitioning method for computing the eigenvalues of symmetric tridiagonal matrix is presented. The algorithm is based on partitioning, in a way that ensures load balance during computation. This method is applicable to both shared memory- and distributed memory-MIMD systems. Compared with other parallel tridiagonal eigenvalue algorithms existing in the literature, the proposed algorithm achieves a higher speedup of O(p) on a parallel computer with p-fold parallelism, which is linear, and the data communication between processors is less than that required for other methods. The results were tested and evaluated on an MIMD machine, and were within 62% to 98% of the predicted performance. 相似文献

17.

Resolving small random symmetric linear systems on graphics processing units

L. A. Abbas-Turki S. Graillat 《The Journal of supercomputing》2017,73(4):1360-1386

This paper focuses on the resolution of a large number of small random symmetric linear systems and its parallel implementation in single precision on graphics processing units (GPUs). The computations involved by each linear system are independent from the others, and the number of unknowns does not exceed 64. For this purpose, we present the adaptation to our context of largely used methods that include: LDLt factorization, Householder reduction to a tridiagonal matrix, parallel cyclic reduction (PCR) that is not a power of two and the divide and conquer algorithm for tridiagonal eigenproblems. We not only detail the implementation and optimization of each method, but we also compare the sustainability of each solution and its performance which include both parallel complexity and cache memory occupation. In the context of solving a large number of small random linear systems on GPUs with no information about their conditioning, our research indicates that the best strategy requires the use of Householder tridiagonalization + PCR followed if necessary by a divide and conquer diagonalization. 相似文献

18.

SMW算法和DAC算法的并行实现及分析

李娜蔡大用《数值计算与计算机应用》2001,22(1):9-21

５１．引言随着科学技术的发展,对大规模科学计算提出的需求越来越高．一是求解问题的规模越来越大,例如,三维油正模拟、大气和海洋之间的相互作用和核安全分析等都要求解超大规模的非线性方程组（未知数个数高达１０６～１０８）．另一方面是实时性要求越来越迫切,电力系统安全分析、气象预报等方面提出的实时性需求是最好的铭证．传统的单机串行式地解决问题的方法已经无法满足客观需求,因此各种形式的向量化和并行（乃至并行十向量化）算法的研究受到普遍的重视．无论用什么方法求解非线性偏微分方程（组）;最终都导致成千上万… 相似文献