首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In the direct solution of sparse symmetric and positive definite linear systems, finding an ordering of the matrix to minimize the height of the elimination tree (an indication of the number of parallel elimination steps) is crucial for effectively computing the Cholesky factor in parallel. This problem is known to be NP-hard. Though many effective heuristics have been proposed, the problems of how good these heuristics are near optimal and how to further reduce the height of the elimination tree remain unanswered. This paper is an effort for this investigation. We introduce a genetic algorithm tailored to this parallel ordering problem, which is characterized by two novel genetic operators, adaptive merge crossover and tree rotate mutation. Experiments showed that our approach is cost effective in the number of generations evolved to reach a better solution in reducing the height of the elimination tree.  相似文献   

2.
This paper presents a solution to the problem of partitioning the work for sparse matrix factorization to individual processors on a multiprocessor system. The proposed task assignment strategy is based on the structure of the elimination tree associated with the given sparse matrix. The goal of the task scheduling strategy is to achieve load balancing and a high degree of concurrency among the processors while reduçing the amount of processor-to-processor data comnication, even for arbitrarily unbalanced elimination trees. This is important because popular fill-reducing ordering methods, such as the minimum degree algorithm, often produce unbalanced elimination trees. Results from the Intel iPSC/2 are presented for various finite-element problems using both nested dissection and minimum degree orderings.Research supported by the Applied Mathematical Sciences Research Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems Inc.  相似文献   

3.
该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库,在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能,并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。  相似文献   

4.
Large sparse matrices with compound entries, i.e. complex and quaternionic matrices as well as matrices with dense blocks, are a core component of many algorithms in geometry processing, physically based animation and other areas of computer graphics. We generalize several matrix layouts and apply joint schedule and layout autotuning to improve the performance of the sparse matrix-vector product on massively parallel graphics processing units. Compared to schedule tuning without layout tuning, we achieve speedups of up to 5.5 × . In comparison to cuSPARSE, we achieve speedups of up to 4.7 × .  相似文献   

5.
In this paper, a systematic and unified treatment of computational task models for parallel sparse Cholesky factorization is presented. They are classified as fine-, medium-, and large-grained graph models. In particular, a new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix. The task scheduling problem for the various task graphs is also discussed. A practical algorithm to schedule the column tasks of the medium-grained model for multiple processors is described. It is based on a heuristic critical path scheduling method. This will give an overall scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP.  相似文献   

6.
通过对Cholesky分解法求解线性方程组的分析,建立Cholesky分解法三角化对称正定阵的图模型,并基于该模型及Mesh结构P/G网络的自身特点,提出一个P/G网快速分析算法.实验证明,该算法能大大降低Mesh结构P/G网络的分析运算时间和内存占用.  相似文献   

7.
In this paper we present PMORSy—a new parallel software package for symmetric sparse matrix ordering on shared memory systems. The NP-complete fill-in minimization problem is solved by means of multilevel nested dissection algorithm with modifications for vertex separators. Parallel processing is done in a task-based fashion with the granularity tuning. We employ threading techniques on shared memory using OpenMP 3.0 technology as opposed to the Message Passing Interface-based approach widely used for parallel sparse matrix ordering. Experimental results on symmetric matrices from the University of Florida Sparse Matrix Collection and matrices from finite-element analysis of three-dimensional strength problems show that our implementation is competitive to the ParMETIS and PT-Scotch libraries both in ordering quality and performance. The PMORSy library is publicly available from the Lobachevsky State University Supercomputing Center web-site.  相似文献   

8.
We consider the problem of reducing data traffic among processor nodes during the parallel factorization of a sparse matrix on a hypercube multiprocessor. A task assignment strategy based on the structure of an elimination tree is presented. This assignment is aimed at achieving load balancing among the processors and also reducing the amount of processor-to-processor data communication. An analysis of regular grid problems is presented, providing a bound on communication volume generated by the new strategy, and showing that the allocation scheme is optimal in the asymptotic sense. Some experimental results on the performance of this scheme are presented.  相似文献   

9.
稀疏矩阵乘以一个向量(SpM×V)的问题是许多大型应用问题的核心计算问题,文中提出了一种在并行计算机上并行计算SpMXV的负载平衡算法,计算复杂性为O(N)(N为稀疏矩阵的阶),而目前计算此类问题的最优负载平衡算法的计算复杂性为O(N·P)(P为处理机台数)。文章最后给出了并行数值实验。  相似文献   

10.
稀疏矩阵相乘广泛应用于科学和工程计算中,是科学计算中的一种常用的基本运算,其面临着数据量大,非零值分布不规则,负载难均衡,计算结果矩阵的列指数无规则分布等问题.通过矩阵分块,优化数据传输,负载均衡,改良并行快速排序方法来解决上述问题,提高了计算效率.在多线程下计算速度比商业软件Intel MKL(Intel math kernel library)平均提高56%.同时,还通过MPI+OpenMP进行混合并行优化,在共享存储系统上两者有类似的计算速度.  相似文献   

11.
This paper introduces new approaches to the data distribution-partition problem for sparse matrices in a multiprocessor environment. In this work, the data partition problem of a sparse matrix is modeled as a Min-Max Problem subject to the uniformity constrain when the goal is to balance the load for both sparse and dense operations. This problem is NP-Complete and two heuristic solutions (ABO and GPB) are proposed. The key of ABO and GPB is to determine the permutation of rows/columns of the input sparse matrix to obtain a sorted matrix with a homogeneous density of nonzero elements. Due to the heuristic nature of the proposed methods their validation is carried out by a comparative study of the parallel efficiency of two types of problems (sparse and mixed) when ABO, GPB, Block, Cyclic and MRD data distributions are applied.This work has been partially supported by the Spanish CICYT through grant TIC2002-00228.  相似文献   

12.
袁晖坪  米玲 《计算机学报》2012,35(5):1073-1074
研究了行(列)酉对称矩阵的性质,修正了行(列)酉对称矩阵的QR分解的公式和快速算法.结果可减少行(列)酉对称矩阵的QR分解的计算量与存储量,并且不会丧失数值精度.  相似文献   

13.
块三对角矩阵局部块分解及其在预条件中的应用   总被引:3,自引:1,他引:3  
该文利用块三对阵角阵分解因子的估值分析了其局部依赖性,并用其构了一类不完全分解型预条件子,给出了五点差分矩阵预条件后的条件数估计,并比较了条件数估计值与实际值,表明了估计值的准确性与预备件的有效性,在具体实现时,考虑了预条件的6个串行实现方案并提出了一个有效的并行化方法,该并行算法具有通信量少的特点,最后在由4中微机通过高速以太网连成的机群系统上作了大量数值实验,并将其与其它较效的预条件方法进行了。结果表明该预条件方法效果较好,尤其适用于并行计算。  相似文献   

14.
陈道琨  杨超  刘芳芳  马文静 《软件学报》2023,34(11):4941-4951
稀疏三角线性方程组求解(SpTRSV)是预条件子部分的重要操作,其中结构化SpTRSV问题,在以迭代方法求解偏微分方程组的科学计算程序中,是一种较为常见的问题类型,而且通常是科学计算程序的需要解决的一个性能瓶颈.针对GPU平台,目前以CUSPARSE为代表的商用GPU数学库,采用分层调度(level-scheduling)方法并行化SpTRSV操作.该方法不仅预处理耗时较长,而且在处理结构化SpTRSV问题时会出现较为严重GPU线程闲置问题.针对结构化SpTRSV问题,提出一种面向结构化SpTRSV问题的并行算法.该算法利用结构化SpTRSV问题的特殊非零元分布规律进行任务划分,避免对输入问题的非零元结构进行预处理分析.并对现有分层调度方法的逐元素处理策略进行改进,在有效缓解GPU线程闲置问题的基础上,还隐藏了部分矩阵非零元素的访存延迟.还根据算法的任务划分特点,采用状态变量压缩技术,显著提高算法状态变量操作的缓存命中率.在此基础上,还结合谓词执行等GPU硬件特性,对算法实现进行全面的优化.所提算法在NVIDIA V100 GPU上的实测性能,相比CUSPARSE平均有2.71倍的加速效果,有效访存带宽最高可达225.2 GB/s.改进后的逐元素处理策略,配合针对GPU硬件的一系列调优手段,优化效果显著,将算法的有效访存带宽提高了约1.15倍.  相似文献   

15.
大规模稀疏矩阵的主特征向量计算优化方法   总被引:1,自引:0,他引:1  
矩阵主特征向量(principal eigenvectors computing,PEC)的求解是科学与工程计算中的一个重要问题。随着图形处理单元通用计算(general-purpose computing on graphics pro cessing unit,GPGPU)的兴起,利用GPU来优化大规模稀疏矩阵的图形处理单元求解得到了广泛关注。分别从应用特征和GPU体系结构特征两方面分析了PEC运算的性能瓶颈,提出了一种面向GPU的稀疏矩阵存储格式——GPU-ELL和一个针对GPU的线程优化映射策略,并设计了相应的PEC优化执行算法。在ATI HD Radeon5850上的实验结果表明,相对于传统CPU,该方案获得了最多200倍左右的加速,相对于已有GPU上的实现,也获得了2倍的加速。  相似文献   

16.
Fortran 90 provides a rich set of array intrinsic functions. Each of these array intrinsic functions operates on the elements of multi-dimensional array objects concurrently. They provide a rich source of parallelism and play an increasingly important role in automatic support of data parallel programming. However, there is no such support if these intrinsic functions are applied to sparse data sets. In this paper, we address this open gap by presenting an efficient library for parallel sparse computations with Fortran 90 array intrinsic operations. Our method provides both compression schemes and distribution schemes on distributed memory environments applicable to higher-dimensional sparse arrays. This way, programmers need not worry about low-level system details when developing sparse applications. Sparse programs can be expressed concisely using array expressions, and parallelized with the help of our library. Our sparse libraries are built for array intrinsics of Fortran 90, and they include an extensive set of array operations such as CSHIFT, EOSHIFT, MATMUL, MERGE, PACK, SUM, RESHAPE, SPREAD, TRANSPOSE, UNPACK, and section moves. Our work, to our best knowledge, is the first work to give sparse and parallel sparse supports for array intrinsics of Fortran 90. In addition, we provide a complete complexity analysis for our sparse implementation. The complexity of our algorithms is in proportion to the number of nonzero elements in the arrays, and that is consistent with the conventional design criteria for sparse algorithms and data structures. Our current testbed is an IBM SP2 workstation cluster. Preliminary experimental results with numerical routines, numerical applications, and data-intensive applications related to OLAP (on-line analytical processing) show that our approach is promising in speeding up sparse matrix computations on both sequential and distributed memory environments if the programs are expressed with Fortran 90 array expressions.  相似文献   

17.
金光浩  莫则尧 《计算机学报》2005,28(12):2045-2051
在以离散网格为基础的某些数值模拟中,网格间的数据依赖关系可以抽象为有向图.如何剖分这些有向图成多个子图,将各子图对应的数值模拟任务映射到不同的处理机,是该类数值模拟并行计算的基础.剖分算法中,需要综合考虑连通性、并行度、负载平衡、通信开销四个目标.文章在传统有向图剖分算法的基础上,提出了一个权衡这四个目标的有向图多目标剖分区域分解算法.应用于二维非结构网格上的柱对称中子输运并行计算中,通量扫描并行算法在该区域剖分算法上获得的并行效率比原来的无向图区域剖分算法高50%以上.  相似文献   

18.
图划分广泛地应用在许多科学与工程领域,但它应用于并行计算任务分配时,使用无向图表示数据依赖关系,这限制了它的应用(例如,无向图不能表示矩形和非对称依赖关系的应用).为了克服图划分的这个缺点,我们对数据间的依赖关系进行区分(即同一条边区分通信的发送方与接收方),然后基于0-1规划模型化这个问题,并通过互联网上求解优化问题常用的NEOS服务器进行求解,在一些数据集上的实验表明,0-1规划方法优于求解图划分流行的多层划分方法.  相似文献   

19.
稀疏矩阵向量乘(SpMV)是科学计算中常用的内核之一,其运行速率跟非零元分布相关.针对对角线稀疏矩阵,提出了压缩行片段对角(compressed row segment diagonal, CRSD)存储格式.它利用“对角线格式”有效描述矩阵的对角线分布,区别于以往通用的计算方法,CRSD通过对给定应用的对角线稀疏矩阵采样再进行特定的优化.并且在软件安装阶段,通过自适应的方法选取适合具体运行平台的最优SpMV实现.在CPU端进行多线程并行化实现时,自适应调优过程中收集的信息还被用于线程间任务划分,以实现负载平衡.同时完成CRSD存储格式在GPU端的实现,并根据GPU端计算与访存的特点进行优化.实验结果表明:在Intel和AMD的多核平台使用相同线程数的情况下,与DIA相比,使用CRSD的加速比可以达到2.37X(平均1.7X);与CSR相比,可以达到4.6X(平均2.1X).  相似文献   

20.
提出了一个并行矩阵乘算法IPBPMM(Interconnected Processor-Based Parallel Matrix Multiplication).该算法运行在以五角形、Petersen图和Hoffman-Singleton图等直径为2的摩尔图(满足n=d2+1,n为节点数,d为度)为拓扑结构的由n个独立处理器构成的机群并行计算环境中.与基于二维环绕网孔阵列拓扑结构的Cannon和Fox等并行矩阵乘法算法相比较,IPBPMM算法通信开销较小,加速比更高,同时还具有矩阵分块可随机分布在各个节点中,无需事先按一定规律装入各节点中的特点.同时IPBPMM算法也能很好地扩充到由多个直径为2的摩尔图为拓扑结构组合构成的并行计算环境中,且随着网络的扩大,算法的并行加速比更高.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号