期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

基于曙光并行机的超大规模非线性方程组并行算法研究 总被引：8，自引：0，他引：8

杨庚王绍棣沈金龙《计算机学报》2002,25(4):397-402

该文讨论了一类求解大规模非线性方程组算法的并行性能及其在曙光并行机上的实现过程，与传统的算法不同之处是用一个块对角矩阵作为迭代矩阵，且该矩阵可由一个仅包含向量内积和矩阵与向量乘积的递推关系简便计算得到，在对算法进行描述之后，分析了算法的并行加速比和存储需求，讨论了算法在基于消息传递的MPI并行环境下的实现流程，数值计算表明理论分析与数值结果相比，算法在分布式并行环境下具有有较好的并行主攻较低的存储要求，可适用于大规模科学与工程的高性能计算。相似文献

2.

对称矩阵三对角化的有效并行块算法设计 总被引：1，自引：0，他引：1

赵永华迟学斌程强陈江赵涛《数值计算与计算机应用》2006,27(2):123-132

在矩阵数值计算中,块算法通常比非块算法更有效,但这也增加了并行算法设计和实现的难度．在广义稠密对称矩阵特征问题并行求解器中,并行块算法的构造可应用到正定对称矩阵的Choleski分解、对称矩阵的三对角化和回代转化(back-transiation)操作中．本文将并行块算法的讨论集中在具有代表性的对称矩阵三对角化上,给出在非块存储方式下对称矩阵三对角化的并行块算法设计方法．分析块算法大小同矩阵规模和处理器数量的关系．在深腾6800上的试验表明,我们的算法具有很好的性能,并得到了比ScaLAPACK更高的性能．相似文献

3.

三角形方程组的一种分布式并行算法 总被引：5，自引：0，他引：5

张学波李晓梅《计算机工程与科学》2003,25(5):69-71

本文提出了一种在分布式存储环境下求解三角形方程组的并行算法,该算法将系数矩阵及右端项以行卷帘方式分布存储到各处理机中。算法中引入了一个一维p阶向量F,该向量的循环传送使处理机间的通信次数明显下降,同时该算法还采用了计算与通信重叠的技术。理论分析与数值实验表明,该算法较列扫描并行算法优越。相似文献

4.

基于多层半可分结构矩阵的快速算法与并行实现

《数值计算与计算机应用》2017,(1)

基于多层半可分(HSS)结构矩阵的快速算法可有效降低具有数值低秩属性的稠密线性方程组求解的复杂度.采用随机取样和保结构秩显(SPRR)分解相结合的方法替代秩显QR(RRQR)分解可以快速构造HSS结构矩阵.该方法将压缩构造HSS结构矩阵转换成小矩阵计算,减少存储和通信开销,使构造HSS结构矩阵的时间复杂度进一步降低.在分布式机群上采用ScaLapack的二维循环块分布方式存储各矩阵块,将HSS树和处理机网格进行映射.构造HSS结构矩阵的并行算法包括对矩阵的多层块压缩,数据交换和重分布,然后结合并行ULV分解和并行三角求解实现快速并行求解,分析了该并行算法的复杂度.最后以二维电磁散射问题为例,数值结果表明该算法不仅比直接LU分解快一个数量级,而且具有良好的并行可扩展性. 相似文献

5.

一类Toeplitz三对角方程组的一种分布式并行算法 总被引：3，自引：0，他引：3

骆志刚李晓梅《计算机学报》2001,24(2):173-178

文中提出一类Toeplitz三对角方程组的一种分布式并行算法。该算法以系数矩阵的分解为基础,充分利用了系数矩阵结构的特殊性,算法因并行化而引入的冗余计算量非常少,算法的通信机制简单,通信量仅与处理机台数p有关,与方程组规模n无关,算法具有很高的并行效率,理论分析和数值试验表明,其加速比Sp(n)→p(n→ ∞）,此为线性加速比的理想情况。文中给出了算法在分布存储多计算机系统上的数值试验结果。相似文献

6.

一种针对GPU上的油藏数值模拟的高效SpMV

《数值计算与计算机应用》2016,(4)

油藏数值模拟和很多其他科学计算问题一样需要求解大型稀疏线性代数方程组.在求解稀疏线性代数方程组的迭代法中,稀疏矩阵向量乘法(SpMV)是影响计算效率的核心函数之一.随着计算机硬件架构异构化,科学计算从单核、多核CPU计算架构逐渐发展到多核CPU+众核加速卡(GPU卡或MIC等)的计算架构.SpMV的实现效率与稀疏矩阵的存储格式及硬件架构关系密切.本文针对油藏模拟中常见的Jacobian矩阵的稀疏模式,利用GPU核心的合并访问和并发计算等特点,结合油藏模拟线性解法器的算法要求,设计了一种BHYB矩阵存储格式及其对应的线程组并行策略.数值实验测得基于该存储格式的SpMV相对串行BCSR格式的SpMV的加速比可达19倍,比cuSPARSE库中效率最高的HYB格式的SpMV快30%到80%.此外,本文所提出的BHYB存储格式对块状矩阵在GPU上的存储以及线程组并行策略对其它GPU并行程序中内核函数的设计和优化能起到一定的借鉴作用. 相似文献

7.

并行计算稀疏矩阵乘以向量的负载平衡算法

下载免费PDF全文

刘杰迟利华胡庆丰李晓梅《计算机工程与科学》2006,28(3):76-77

稀疏矩阵乘以一个向量（SpM×V）的问题是许多大型应用问题的核心计算问题，文中提出了一种在并行计算机上并行计算SpMXV的负载平衡算法，计算复杂性为O（N）（N为稀疏矩阵的阶），而目前计算此类问题的最优负载平衡算法的计算复杂性为O（N·P）（P为处理机台数）。文章最后给出了并行数值实验。相似文献

8.

基于GPU加速的并行WMD算法

胡蓉阳王东王昊天罗辉章李肯立《计算机科学》2021,48(12):24-28

Word Mover's Distance(WMD)是一种度量文本相似度的方法,它将两个文本之间的差异定义为文本的词嵌入向量之间的最小距离.WMD利用词汇表,将文本表示为归一化的词袋向量.文本的单词在语料中所占的比例很小,因此用词袋模型生成的文本向量很稀疏.多个文本可以组成一个高维的稀疏矩阵,这样的稀疏矩阵会生成大量不必要的运算.通过一次性对多个目标文本计算单个源文本的WMD,可以使计算过程高度并行化.针对文本向量的稀疏性,文中提出了一种基于GPU的并行Sinkhorn-WMD算法,采取压缩格式存储目标文本的方式来提高内存利用率,根据稀疏结构减少中间过程的计算.利用预训练词嵌入向量计算单词距离矩阵,对WMD算法进行改进,在两个公开的新闻数据集上进行优化算法的验证.实验结果表明,在NVIDIA TITAN RTX上并行算法与CPU串行相比最高可以达到67.43倍的加速. 相似文献

9.

基于并行机的矩量法研究

孙鹏《计算机光盘软件与应用》2010,(9):104-104

并行处理技术相对于传统的串行处理,具有无可比拟的优越性。以代数方程组和微分方程组的求解为线索,对并行算法在矩阵向量乘法计算中的应用进行了分析。通过在计算机机群上将矩阵向量乘法的并行方法实现,研究其算法性能,并分析了通信量对并行算踟陡能的影响。相似文献

10.

带状线性方程组的并行交替方向算法 总被引：1，自引：1，他引：0

下载免费PDF全文

段治健吕全义马欣荣《计算机工程与应用》2009,45(20):54-56

提出了分布式存储环境下求解带状线性方程组的并行交替方向迭代算法。充分利用系数矩阵的结构特点,给出了在系数矩阵分别为Hermite正定矩阵和M-矩阵时算法的充分条件,并针对采用的分裂方式,讨论了参数的收敛范围,最后在HPrx2600集群系统上进行了数值计算,结果表明实算与理论相一致,算法简便可行且具有良好的并行性。相似文献

11.

自动寻找使多重串行循环并行化的幺模变换 总被引：2，自引：0，他引：2

俞一峻臧斌宇施武朱传琪《软件学报》1999,10(4):366-371

对于已知n维距离向量矩阵的多重串行循环,过去的并行化编译研究还缺乏寻找使循环外层并行化的幺模矩阵的可行算法.文章介绍了多重串行循环并行化的幺模变换方法,不仅从理论上证明满足外层并行化要求的合法幺模矩阵是存在的,而且通过构造性证明给出一个计算外层并行化幺模变换矩阵的可行算法,并探讨了扩大其适用范围于非完全嵌套和非常数相关距离循环的有效途径. 相似文献

12.

Toward an automatic parallelization of sparse matrix computations

《Journal of Parallel and Distributed Computing》2005,65(3):313-330

In this paper, we propose a generic method of automatic parallelization for sparse matrix computation. This method is based on both a refinement of the data-dependence test proposed by Bernstein and an inspector–executor scheme which is specialized to each input program of the compiler. This analysis mixes compilation process and run-time process.The sparsity of underlying data-structure determines a specific parallelism which increases the degree of parallelism of an algorithm. Such a source of parallelism had already been applied to many numerical algorithms such as the usual Cholesky factorization or LU-decomposition algorithms considered as the gold standards of parallelization based on sparsity. The standard automatic parallelization method cannot tackle such source of parallelism because it is based on the value of cells arrays and not merely on the memory addressing function.Addressing the automatization of this parallelism requires to develop a mixed compile-time and runtime approach integrated in a inspector–executor process. The compilation step provides a dedicated inspector devoted to the analyzed program. The inspector computes the dependence graph at runtime which allows a dynamic parallelization of the execution.As expressed just before, the generic scheme developed in this paper follows the design principles which have been applied, but at each time in an ad hoc way, to many sparse parallelization of numerical algorithms such as Cholesky algorithm. As far as we know, no general formal framework has been proposed to automate such a method of sparse parallelization. In this paper, we propose a generic framework of sparse parallelization (i.e. numerical program independent) which can be applied to any numerical programs satisfying the usual syntactic constraints of parallelization. 相似文献

13.

An efficient method for computing eigenvalues of a real normal matrix

《Journal of Parallel and Distributed Computing》2003,63(6):638-648

Jacobi-based algorithms have attracted attention as they have a high degree of potential parallelism and may be more accurate than QR-based algorithms. In this paper we discuss how to design efficient Jacobi-like algorithms for eigenvalue decomposition of a real normal matrix. We introduce a block Jacobi-like method. This method uses only real arithmetic and orthogonal similarity transformations and achieves ultimate quadratic convergence. A theoretical analysis is conducted and some experimental results are presented. 相似文献

14.

Efficient mapping and implementation of matrix algorithms on a hypercube

Vladimir Cherkassky Ross Smith 《The Journal of supercomputing》1988,2(1):7-27

It is well known that parallelism by itself does not lead to higher speeds. This study shows how to put parallelism to best use, that is, how to find an optimal balance between communication and computation overheads for two parallel matrix algorithms. The problem graph for matrix algorithms analyzed in this paper is a two-dimensional grid (toroidal mesh) which is mapped onto a hypercube topology. To perform matrix operations on a hypercube, a matrix is partitioned into several submatrices which are stored and manipulated in the nodes. We seek to find an optimal matrix partitioning to minimize overall execution time. The NCUBE parallel machine is used for experimental performance evaluation. For matrix multiplication, we derive an exact analytical model to determine the optimal partitioning size and perform its experimental verification on the NCUBE parallel processor. For a parallel Gaussian elimination known as the balanced algorithm, we present performance measurements and an approximate analytical model for performance evaluation. Our analyses show that the optimal submatrix size is typically small and does not depend on the original matrix size. 相似文献

15.

Highly parallel execution of production systems: A model,algorithms and architecture

Kemal Oflazer 《New Generation Computing》1992,10(3):287-313

相似文献

16.

一种基于代价子图的子字并行指令选择算法

王淼王志英《计算机工程与科学》2008,30(9):141-144

子字并行能够充分利用多媒体算法的数据精度小、内部循环处理形式规则的特点,是加速多媒体处理的有效方式。然而,如何充分挖掘多媒体应用中的子字并行仍然是一个难题。本文说明传统的并行技术可以有效地开发循环中的子字并行性,同时提出一种基于代价子图的子字并行指令自动识别的方法。与其他方法相比,该方法利用代价模型对子子字并行指令选择进行定量评估。本文在TTA体系结构框架下实现了这一方法。实验结果表明,该方法可以充分地提取循环中的子字并行性。相似文献

17.

Coarse-Grained Parallel Algorithms for Multi-Dimensional Wavelet Transforms

Yang Linda Misra Manavendra 《The Journal of supercomputing》1998,12(1-2):99-118

This paper presents parallel algorithms for computing multi-dimensional wavelet transforms on both shared memory and distributed memory machines. Traditional data partitioning methods for n-dimensional Discrete Wavelet Transforms (DWTs) call for data redistribution once a one dimensional wavelet transform is computed along each dimension. To avoid the data communication inherent in this redistribution, two new partitioning methods called CRBP (Communication Reduced Block Partitioning) and CRLP (Communication Reduced Layer Partitioning) are proposed. The efficiency of the algorithms is compared through several examples implemented on a cluster of SGI workstations. Two kinds of parallel approaches are used to compute multi-dimensional wavelet transforms on shared memory machines: homogeneous parallelism, and heterogeneous parallelism. Homogeneous parallelism uses traditional data partitioning while heterogeneous parallelism uses the CRBP approach. The effectiveness of these approaches is demonstrated through several examples implemented on an SGI Power Challenge. The paper discusses the effectiveness of each of the approaches on the two kinds of architectures. 相似文献

18.

MapReduce框架下并行知识约简算法模型研究 总被引：5，自引：0，他引：5

钱进苗夺谦张泽华张志飞《计算机科学与探索》2013,(1):35-45

面向大规模数据进行知识约简是近年来粗糙集理论研究热点。经典的知识约简算法是一次性将小数据集装入单机主存中进行约简,无法处理海量数据。深入剖析了知识约简算法中的可并行性;设计并实现了数据和任务同时并行的Map和Reduce函数,用于计算不同候选属性集导出的等价类和属性重要性;构建了一种MapReduce框架下并行知识约简算法模型,用于计算基于正区域、基于差别矩阵或基于信息熵的知识约简算法的一个约简。在Hadoop平台上进行了相关实验,实验结果表明,该并行知识约简算法模型可以高效地处理海量数据集。相似文献

19.

Multiprocessing linear algebra algorithms on the CRAY X-MP-2: Experiences with small granularity

Steven S. Chen Jack J. Dongarra Christopher C. Hsiung 《Journal of Parallel and Distributed Computing》1984,1(1):22-31

This paper gives a brief overview of the CRAY X-MP-2 general-purpose multiprocessor system and discusses how it can be used effectively to solve problems that have small granularity. An implementation is described for linear algebra algorithms that solve systems of linear equations when the matrix is general and when the matrix is symmetric and positive definite. 相似文献