期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《计算机工程与设计》2003,24(10):75-77

基于网络机群这一新的并行环境和消息传递界面MPI给出了两种不带平方根的Cholesky并行分解算法,算法采用行卷帘存储方案和提前发送策略,从而减少了负载的不平衡,增加了计算通信的重叠,减少了通信时间.理论分析和数值试验均表明,算法具有较高的并行加速比和效率. 相似文献

2.

对称正定矩阵的并行LDLT分解算法实现 总被引：1，自引：0，他引：1

张健飞姜弘道《计算机工程与设计》2003,24(10):75-77

基于网络机群这一新的并行环境和消息传递界面MPI给出了两种不带平方根的Cholesky并行分解算法，算法采用行卷帘存储方案和提前发送策略，从而减少了负载的不平衡，增加了计算通信的重叠，减少了通信时间。理论分析和数值试验均表明，算法具有较高的并行加速比和效率。相似文献

3.

并行J-变量块Cholesky分解算法的仿真研究

顾耀林刘万龙刘强胡寿伟《计算机仿真》2006,23(8):82-85

该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库，在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能，并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。相似文献

4.

基于GPU的并行Cholesky分解及其应用

沈雁戴瑜兴《计算机工程》2019,45(2):284-289

在OpenCL并行计算框架的clMAGMA库中,Cholesky分解算法采用大尺寸分块并行方法,不能充分利用GPU的高速局部存储器,且在计算过程中存在多次GPU-CPU间的数据传递。为此,提出采用小尺寸分块并行方法,充分利用GPU中的高速局部存储器,使矩阵子块的逆矩阵得到复用,完成对称正定矩阵的高效Cholesky分解,并且其能够应用于三维视觉光束平差问题中的大型正定矩阵的分解。实验结果表明,该方法的Cholesky分解速度比clMAGMA提升50%以上,针对光束平差问题,比Ceres Solver中使用的Eigen库速度提升约38倍。相似文献

5.

正定矩阵在MIMD计算机上的并行因子分解

Swarn P.Kumar Janusz S.Kowalik 吴功广《计算机工程与科学》1988,(4)

本文研究利用多处理机的MIMD型计算机计算方型实对称正定矩阵的三角因子问题。文章从算法的加速比和效率的角度,导出并分析了当所用的处理机台数为(ⅰ)无限和(ⅱ)O(n)时,基于Cholesky分解法和Gaussian消元法的并行算法,其中n是矩阵的阶。对于第(ⅱ)种情况,它表明并行消元法能够达到与并行Cholesky方法相同的加速比,而仅需Cholesky方法所用处理机台数的一半。相似文献

6.

基于鱼群优化算法和Cholesky分解的RELM的基因表达数据分类

陆慧娟魏莎莎关伟缪燕子《计算机科学》2014,41(12):226-230

提出一种基于鱼群优化算法和Cholesky分解的改进的正则极限学习机算法(FSC-RELM)来对基因表达数据进行分类。FSC-RELM算法中,首先用鱼群优化算法对RELM输入层权值进行优化,其中目标函数定义为误差函数的倒数;再对RELM输出层权值矩阵进行分解,采用Cholesky分解法进行优化,以提高算法速度,减少训练时间。为了评价算法性能,对若干标准基因数据集进行了实验,结果表明,FSC-RELM算法在较短的时间内可以获得较高的分类精度,性能优异。相似文献

7.

无约束最优化问题的BFGS并行算法与实现

下载免费PDF全文

李文敬王汝凉廖伟志《计算机工程》2009,35(15):58-60,6

介绍无约束最优化问题的BFGS算法及其收敛性,提出利用行卷帘格式并行Cholesky分解法、同步并行Wolfe-Powell非线性搜索和并行处理BFGS修正公式来构建BFGS的并行算法,并对该算法的时间复杂性、加速比进行分析。在PC机群数值实验的结果表明,BFGS并行算法提高了无约束最优化问题的求解速度,理论分析与实验结果相一致,并行算法具有线性加速比。相似文献

8.

广义Hermitian特征问题标准化转换的有效并行块算法

赵永华迟学斌程强《计算机研究与发展》2007,44(10):1724-1732

广义Hermitian特征问题并行求解器的性能依赖于所选择的并行算法和矩阵的分布策略等诸多方面.基于块存储和快算法策略,提出了一个新的标准化转化的并行算法,该并行算法将Cholesky分解结合到广义特征问题标准化转换中,降低了已有并行算法的通信开销,并增加了算法的并行性.新算法可显著改善已有并行算法的性能和可扩展性.另外给出了一个有效求解具有多个右端项的三角矩阵方程AX=B的并行块算法.通过自主开发的特征问题并行软件包PSEPS的测试结果表明,并行算法比传统的并行算法快大约1倍,并具有较好的可扩展性. 相似文献

9.

基于多层油藏问题负载均衡的并行任务划分S

舒继武赵金熙周维四张德富《软件学报》1999,10(10):19

该文基于分布式并行计算机系统,对一类多层二维二相流油藏数值模拟问题给出了3种任务划分策略-"卷帘"方式、区域分解方式和"卷帘"与区域分解结合的方式,对它们进行了比较,提出了减少求解时间、利于负载均衡和提高并行性能的任务划分方法,并实际应用于有多达72万个网格节点的大规模油藏模拟问题.实算结果表明,该策略划分产生的并行求解任务均衡,有利于加速比的提高.该方法也适用于区域或数据并行的任务划分问题. 相似文献

10.

基于多层油藏问题负载均衡的并行任务划分 总被引：1，自引：0，他引：1

舒继武赵金熙周维四张德富《软件学报》1999,10(10):1061-1066

该文基于分布式并行计算机系统,对一类多层二维二相流油藏数值模拟问题给出了3种任务划分策略—“卷帘”方式、区域分解方式和“卷帘”与区域分解结合的方式,对它们进行了比较,提出了减少求解时间、利于负载均衡和提高并行性能的任务划分方法,并实际应用于有多达72万个网格节点的大规模油藏模拟问题.实算结果表明,该策略划分产生的并行求解任务均衡,有利于加速比的提高.该方法也适用于区域或数据并行的任务划分问题. 相似文献

11.

Supernodal sparse Cholesky factorization on graphics processing units

Dan Zou Yong Dou Song Guo Rongchun Li Lin Deng 《Concurrency and Computation》2014,26(16):2713-2726

Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

12.

Computational models and task scheduling for parallel sparse Cholesky factorization

Joseph W. H. Liu 《Parallel Computing》1986,3(4):327-342

In this paper, a systematic and unified treatment of computational task models for parallel sparse Cholesky factorization is presented. They are classified as fine-, medium-, and large-grained graph models. In particular, a new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix. The task scheduling problem for the various task graphs is also discussed. A practical algorithm to schedule the column tasks of the medium-grained model for multiple processors is described. It is based on a heuristic critical path scheduling method. This will give an overall scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP. 相似文献

13.

Triangularization of a positive definite matrix on a parallel computer

Swarn P. Kumar Janusz S. Kowalik 《Journal of Parallel and Distributed Computing》1986,3(4)

The problem of computing the triangular factors of a square, real, symmetric, and positive definite matrix by using the facilities of a multiprocessor MIMD-type computer is considered. The parallel algorithms based on Cholesky decomposition and Gaussian elimination are derived and analyzed in terms of their speedup and efficiency, when the available number of processors is O(n), where n is the size of the matrix. It is shown that the parallel elimination method can achieve the same speedup as the parallel Cholesky method while using only half the number of processors required by the Cholesky method. 相似文献

14.

Graph grammar‐driven parallel partial differential equation solver

Maciej Paszy&#x;ski Robert Schaefer 《Concurrency and Computation》2010,22(9):1063-1097

The paper presents an extension of the composite programmable graph grammar (CP‐graph grammar) suitable for modeling the parallel direct solver algorithm utilized by the hp finite element method (hp‐FEM). In the proposed graph grammar model, the computational mesh is represented by a CP‐graph. The presented graph grammar models the solver algorithm by a set of graph grammar productions. The graph grammar model makes it possible to examine the concurrency of the algorithm by analyzing the interdependence between the atomic tasks, tasks and super‐tasks. The atomic tasks correspond to the graph grammar productions, representing basic undividable parts of the algorithms. The level of atomic tasks models the concurrency for the shared memory architectures. On the other hand, the tasks correspond to the groups of atomic tasks with predefined inter‐task communication channels. They constitute the grain for the decomposition of the parallel algorithm for the distributed memory architecture. Finally, the super‐tasks correspond to a group of tasks resulting from the execution of load balancing algorithm. The solver algorithm is tested on distributed memory linux cluster for up to 192 processors. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

15.

Cholesky分解细粒度并行算法 总被引：1，自引：0，他引：1

邬贵明窦勇王淼《计算机工程与科学》2010,32(9):102-106

本文提出了一种Cholesky分解细粒度流水线并行算法,该算法可以处理任意规模的数据,可以充分开发FP-GA加速器提供的细粒度并行。实验表明,该算法具有很好的可扩展性,在Xilinx XC5 VLX330 FPGA上能够集成36个处理单元(PE),当矩阵的阶为16384、运行频率为200MHz时性能达到14.3GFLOPS。相似文献

16.

基于改进极限学习机算法的行为识别

周书仁曹思思蔡碧野《计算机工程与科学》2017,39(9):1749-1757

重点研究了极限学习机ELM对行为识别检测的效果。针对在线学习和行为分类上存在计算复杂性和时间消耗大的问题,提出了一种新的行为识别学习算法(ELM-Cholesky)。该算法首先引入了基于Cholesky分解求ELM的方法,接着依据在线学习期间核函数矩阵的更新特点,将分块矩阵Cholesky分解算法用于ELM的在线求解,使三角因子矩阵实现在线更新,从而得出一种新的ELM-Cholesky在线学习算法。新算法充分利用了历史训练数据,降低了计算的复杂性,提高了行为识别的准确率。最后,在基准数据库上采用该算法进行了大量实验,实验结果表明了这种在线学习算法的有效性。相似文献

17.

基于边权约束法实现接触问题多约束图剖分

王晨晖沈浩肖永浩《计算机辅助工程》2012,21(1):36-41

为提高接触问题并行计算的效率,分析内力计算和接触计算过程的并行性,提出基于边权约束法构造接触多约束图的方法,对比和分析多约束图剖分算法和双重区域剖分算法的负载平衡和通信性能.数值实验表明,在典型二维模型中多约束图剖分算法的负载平衡性能略低于双重区域剖分算法,但仍可将负载不平衡度控制在较好的范围内,简化并行计算的通信过程,减少总通信量并降低动态通信量比例. 相似文献