首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents a solution to the problem of partitioning the work for sparse matrix factorization to individual processors on a multiprocessor system. The proposed task assignment strategy is based on the structure of the elimination tree associated with the given sparse matrix. The goal of the task scheduling strategy is to achieve load balancing and a high degree of concurrency among the processors while reduçing the amount of processor-to-processor data comnication, even for arbitrarily unbalanced elimination trees. This is important because popular fill-reducing ordering methods, such as the minimum degree algorithm, often produce unbalanced elimination trees. Results from the Intel iPSC/2 are presented for various finite-element problems using both nested dissection and minimum degree orderings.Research supported by the Applied Mathematical Sciences Research Program, Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with Martin Marietta Energy Systems Inc.  相似文献   

2.
We consider the problem of reducing data traffic among processor nodes during the parallel factorization of a sparse matrix on a hypercube multiprocessor. A task assignment strategy based on the structure of an elimination tree is presented. This assignment is aimed at achieving load balancing among the processors and also reducing the amount of processor-to-processor data communication. An analysis of regular grid problems is presented, providing a bound on communication volume generated by the new strategy, and showing that the allocation scheme is optimal in the asymptotic sense. Some experimental results on the performance of this scheme are presented.  相似文献   

3.
This paper considers four parallel Cholesky factorization algorithms, including SPOTRF from the February 1992 release of LAPACK, each of which call parallel Level 2 or 3 BLAS, or both. A fifth parallel Cholesky algorithm that calls serial Level 3 BLAS is also described. The efficiency of these five algorithms on the CRAY-2, CRAY Y-MP/832, Hitachi Data Systems EX 80, and IBM 3090-600J is evaluated and compared with a vendor-optimized parallel Cholesky factorization algorithm. The fifth parallel Cholesky algorithm that calls serial Level 3 BLAS provided the best performance of all algorithms that called BLAS routines. In fact, this algorithm outperformed the Cray-optimized libsci routine (SPOTRF) by 13–44%;, depending on the problem size and the number of processors used.This work was supported by grants from IMSL, Inc., and Hitachi Data Systems. The first version of this paper was presented as a poster session at Supercomputing '90, New York City, November 1990.  相似文献   

4.
不完全Cholesky分解预条件共轭梯度(incomplete Cholesky factorization preconditioned conjugate gradient, ICCG)法是求解大规模稀疏对称正定线性方程组的有效方法.然而ICCG法要求在每次迭代中求解2个稀疏三角方程组,稀疏三角方程组求解固有的串行性成为了ICCG法在GPU上并行求解的瓶颈.针对稀疏三角方程组求解,给出了一种利用GPU加速的有效方法.为了增加稀疏三角方程组求解在GPU上的多线程并行性,提出了对不完全Cholesky分解产生的稀疏三角矩阵进行分层调度(level scheduling)的方法.为了进一步提高稀疏三角方程组求解的并行性能,提出了在分层调度前通过近似最小度(approximate minimum degree, AMD)算法对系数矩阵进行重排序、在分层调度后对稀疏三角矩阵进行层排序的方法,降低了分层调度过程中产生的层数,优化了稀疏三角方程组求解的GPU内存访问模式.数值实验表明,与利用NVIDIA CUSPARSE实现的ICCG法相比,采用上述方法性能可以获得平均1倍以上的提升.  相似文献   

5.
刘亚楠  涂铮铮  罗斌 《计算机应用》2013,33(10):2871-2873
为了充分利用图像本身的结构信息并充分压缩图像数据,把得到的子空间中数据(反馈)的稀疏性作为约束项加入非负张量分解目标函数中,即采用基于反馈稀疏约束的非负张量分解算法对图像集合进行降维。最后,将该算法应用于手写数字图像库中,实验结果表明所提出的方法能有效改善图像分类的准确性  相似文献   

6.
In this paper, a systematic and unified treatment of computational task models for parallel sparse Cholesky factorization is presented. They are classified as fine-, medium-, and large-grained graph models. In particular, a new medium-grained model based on column-oriented tasks is introduced, and it is shown to correspond structurally to the filled graph of the given sparse matrix. The task scheduling problem for the various task graphs is also discussed. A practical algorithm to schedule the column tasks of the medium-grained model for multiple processors is described. It is based on a heuristic critical path scheduling method. This will give an overall scheme for parallel sparse Cholesky factorization, appropriate for parallel machines with shared-memory architecture like the Denelcor HEP.  相似文献   

7.
Sparse factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. For factorizations which involve sufficient dense math, the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost. However, for many other cases, the prevalence of small/irregular dense math and the relatively slow communication between the host and device over the PCIe bus, make it challenging to significantly accelerate sparse factorization using the GPU.In this paper we describe a left-looking supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream subtrees of the elimination tree through the GPU and perform the factorization of each subtree entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these subtrees, many independent, small, dense operations are batched to minimize kernel launch overhead and many of these batched kernels are executed concurrently to maximize device utilization.Performance results for commonly studied matrices are presented along with suggested actions for further optimization.  相似文献   

8.
提出一种基于交替方向乘子法的(Alternating Direction Method of Multipliers,ADMM)稀疏非负矩阵分解语音增强算法,该算法既能克服经典非负矩阵分解(Nonnegative Matrix Factorization,NMF)语音增强算法存在收敛速度慢、易陷入局部最优等问题,也能发挥ADMM分解矩阵具有的强稀疏性。算法分为训练和增强两个阶段:训练时,采用基于ADMM非负矩阵分解算法对噪声频谱进行训练,提取噪声字典,保存其作为增强阶段的先验信息;增强时,通过稀疏非负矩阵分解算法,从带噪语音频谱中对语音字典和语音编码进行估计,重构原始干净的语音,实现语音增强。实验表明,该算法速度更快,增强后语音的失真更小,尤其在瞬时噪声环境下效果显著。  相似文献   

9.
We report the first CUDA™ graphics-processing-unit (GPU) implementation of the polymer field-theoretic simulation framework for determining fully fluctuating expectation values of equilibrium properties for periodic and select aperiodic polymer systems. Our implementation is suitable both for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating simulations using the complex Langevin approach. Running on NVIDIA®  Tesla T20 series GPUs, we find double-precision speedups of up to 30×30× compared to single-core serial calculations on a recent reference CPU, while single-precision calculations proceed up to 60×60× faster than those on the single CPU core. Due to intensive communications overhead, an MPI implementation running on 64 CPU cores remains two times slower than a single GPU.  相似文献   

10.
Algorithms and software for solving sparse symmetric positive definite systems on serial computers have reached a high state of development. In this paper, we present algorithms for performing sparse Cholesky factorization and sparse triangular solutions on a shared-memory multiprocessor computer, along with some numerical experiments demonstrating their performance on a Sequent Balance 8000 system.Research was supported by the Applied Mathematical Sciences Research Program of the Office of Energy Research, U.S. Department of Energy under contract DE-AC05-84OR21400 with Martin Marieta Energy Systems, Inc., by the U.S. Air Force Office of Scientific Research under contract AFOSR-ISSA-85-00083 and by the Canadian Natural Sciences and Engineering Research Council under grants A8111 and A5509.  相似文献   

11.
In this paper, we consider the problem of finding fill-preserving sparse matrix orderings for parallel factorization. That is, given a large sparse symmetric and positive definite matrix A that has been ordered by some fill-reducing ordering, we want to determine a reordering that is appropriate in terms of preserving the sparsity and minimizing the cost to perform the Cholesky factorization in parallel. Past researches on this problem all are based on the elimination tree model, in which each node represents the task for factoring a column, and thus, can be seen as a coarse-grained task dependence model. To exploit more parallelism, Joseph Liu proposed a medium-grained task model, called the column task graph, and showed that it is amenable to the shared-memory supercomputers. Based on the column task graph, we devise a greedy reordering algorithm, and show that our algorithm can find the optimal ordering among the class of all fill-preserving orderings of the given sparse matrix A.  相似文献   

12.
下一代观测望远镜将会产生数以亿计的星系测量数据值,这将导致使用中央处理器处理数据时效率低下、成本较高。为了解决这一问题,提出了基于宇宙计算的图形处理器算法。研究了两点式角相关函数以及孔径质量统计这两种宇宙学的计算方法,构建算法代码,并使用统一计算设备架构在图形处理器上实现了这两种算法;比较了算法在中央处理器和图形处理器上使用的运行速度。实验结果表明,与中央处理器相比,使用图形处理器的计算速度得到了显著提高。  相似文献   

13.
The incomplete Cholesky (IC) factorization preconditioning technique is applied to the Krylov subspace methods for solving large systems of linear equations resulted from the use of edge-based finite element method (FEM). The construction of the preconditioner is based on the fact that the coefficient matrix is represented in an upper triangular compressed sparse row (CSR) form. An efficient implementation of the IC factorization is described in detail for complex symmetric matrices. With some ordering schemes our IC algorithm can greatly reduce the memory requirement as well as the iteration numbers. Numerical tests on harmonic analysis for plane wave scattering from a metallic plate and a metallic sphere coated by a lossy dielectric layer show the efficiency of this method.  相似文献   

14.
目的 各类终端设备获取的大量数据往往由于信息丢失而导致数据不完整,或经常受到降质问题的困扰。为有效恢复缺损或降质数据,低秩张量补全备受关注。张量分解可有效挖掘张量数据的内在特征,但传统分解方法诱导的张量秩函数无法探索张量不同模式之间的相关性;另外,传统张量补全方法通常将全变分约束施加于整体张量数据,无法充分利用张量低维子空间的平滑先验。为解决以上两个问题,提出了基于稀疏先验与多模式张量分解的低秩张量恢复方法。方法 在张量秩最小化模型基础上,融入多模式张量分解技术以及分解因子局部稀疏性。首先对原始张量施加核范数约束,以此捕获张量的全局低秩性,然后,利用多模式张量分解将整体张量沿着每个模式分解为一组低维张量和一组因子矩阵,以探索不同模式之间的相关性,对因子矩阵施加因子梯度稀疏正则化约束,探索张量子空间的局部稀疏性,进一步提高张量恢复性能。结果 在高光谱图像、多光谱图像、YUV(也称为YCbCr)视频和医学影像数据上,将本文方法与其他8种修复方法在3种丢失率下进行定量及定性比较。在恢复4种类型张量数据方面,本文方法与深度学习GP-WLRR方法(global prior refined weighted low-rank representation)的修复效果基本持平,本文方法的MPSNR(mean peak signal-to-noise ratio)在所有丢失率及张量数据上的总体平均高0.68dB,MSSIM(mean structural similarity)总体平均高0.01;与其他6种张量建模方法相比,本文方法的MPSNR及MSSIM均取得最优结果。结论 提出的基于稀疏先验与多模式张量分解的低秩张量恢复方法,可同时利用张量的全局低秩性与局部稀疏性,能够对受损的多维视觉数据进行有效修复。  相似文献   

15.
传统的多标签分类算法是以二值标签预测为基础的,而二值标签由于仅能指示数据是否具有相关类别,所含语义信息较少,无法充分表示标签语义信息。为充分挖掘标签空间的语义信息,提出了一种基于非负矩阵分解和稀疏表示的多标签分类算法(MLNS)。该算法结合非负矩阵分解与稀疏表示技术,将数据的二值标签转化为实值标签,从而丰富标签语义信息并提升分类效果。首先,对标签空间进行非负矩阵分解以获得标签潜在语义空间,并将标签潜在语义空间与原始特征空间结合以形成新的特征空间;然后,对此特征空间进行稀疏编码来获得样本间的全局相似关系;最后,利用该相似关系重构二值标签向量,从而实现二值标签与实值标签的转化。在5个标准多标签数据集和5个评价指标上将所提算法与MLBGM、ML2、LIFT和MLRWKNN等算法进行对比。实验结果表明,所提MLNS在多标签分类中优于对比的多标签分类算法,在50%的案例中排名第一,在76%的案例中排名前二,在全部的案例中排名前三。  相似文献   

16.
Current generations of graphics processing units have turned into highly parallel devices with general computing capabilities. Thus, graphics processing units may be utilized, for example, to solve time dependent partial differential equations by the Fourier split operator method. In this contribution, we demonstrate that graphics processing units are capable to calculate fast Fourier transforms much more efficiently than traditional central processing units. Thus, graphics processing units render efficient implementations of the Fourier split operator method possible. Performance gains of more than an order of magnitude as compared to implementations for traditional central processing units are reached in the solution of the time dependent Schrödinger equation and the time dependent Dirac equation.  相似文献   

17.
Molecular dynamics (MD) methods compute the trajectory of a system of point particles in response to a potential function by numerically integrating Newton?s equations of motion. Extending these basic methods with rigid body constraints enables composite particles with complex shapes such as anisotropic nanoparticles, grains, molecules, and rigid proteins to be modeled. Rigid body constraints are added to the GPU-accelerated MD package, HOOMD-blue, version 0.10.0. The software can now simulate systems of particles, rigid bodies, or mixed systems in microcanonical (NVE), canonical (NVT), and isothermal-isobaric (NPT) ensembles. It can also apply the FIRE energy minimization technique to these systems. In this paper, we detail the massively parallel scheme that implements these algorithms and discuss how our design is tuned for the maximum possible performance. Two different case studies are included to demonstrate the performance attained, patchy spheres and tethered nanorods. In typical cases, HOOMD-blue on a single GTX 480 executes 2.5–3.6 times faster than LAMMPS executing the same simulation on any number of CPU cores in parallel. Simulations with rigid bodies may now be run with larger systems and for longer time scales on a single workstation than was previously even possible on large clusters.  相似文献   

18.
We present a computation method to accelerate the calculation of the Hamiltonian of a three-body time independent Schrödinger equation for collisions. The Hamiltonian is constructed with one dimensional (basis overlaps) and two dimensional (interparticle interaction) integrals that are mapped into a computational grid in a Graphics Processing Unit (GPU). We illustrate the method for the case of an electron impact single ionization of a two electron atom. This proposal makes use of a Generalized Sturmian Basis set for each electron, which are obtained numerically on a quadrature grid that is used to compute the integrals in the GPU. The optimal computation is more than twenty times faster in the GPU than the calculation in CPU. The method can be easily scaled to computers with several Graphics Processing Units or clusters.  相似文献   

19.
In this study, we discover the parallelism of the forward/backward substitutions (FBS) for two cases and thus propose an efficient preconditioned conjugate gradient algorithm with the modified incomplete Cholesky preconditioner on the GPU (GPUMICPCGA). For our proposed GPUMICPCGA, the following are distinct characteristics: (1) the vector operations are optimized by grouping several vector operations into single kernels, (2) a new kernel of inner product and a new kernel of the sparse matrix–vector multiplication with high optimization are presented, and (3) an efficient parallel implementation of FBS on the GPU (GPUFBS) for two cases are suggested. Numerical results show that our proposed kernels outperform the corresponding ones presented in CUBLAS or CUSPARSE, and GPUFBS is almost 3 times faster than the implementation of FBS using the CUSPARSE library. Furthermore, GPUMICPCGA has better behavior than its counterpart implemented by the CUBLAS and CUSPARSE libraries.  相似文献   

20.
该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库,在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能,并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号