期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel Implementation of Linear Algebra Problems on Dawning-1000

Chi Xuebin 《计算机科学技术学报》1998,13(2):141-146

In this paper,some parallel algorithms are described for solving numerical linear algebra problems on Dawning-1000.They include matrix multiplication,LU factorization of a dense matrix,Cholesky factorization of a symmetric matrix,and eigendecomposition of symmetric matrix for real and complex data types.These programs are constructed based on fast BLAS library of Dawning-1000 under NX environment.Some comparison results under different parallel environments and implementing methods are also given for Cholesky factorization.The execution time,measured performance and speedup for each problem on Dawning-1000 are shown.For matrix multiplication and IU factorization,1.86GFLOPS and 1.53GFLOPS are reached. 相似文献

2.

Supernodal sparse Cholesky factorization on graphics processing units

Dan Zou Yong Dou Song Guo Rongchun Li Lin Deng 《Concurrency and Computation》2014,26(16):2713-2726

Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

3.

Scalability Issues Affecting the Design of a Dense Linear Algebra Library

Dongarra J. J. Vandegeijn R. A. Walker D. W. 《Journal of Parallel and Distributed Computing》1994,22(3)

This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain. 相似文献

4.

分块带边结构线性规划并行算法

杨林峰李陶深李捷陈燕《计算机科学》2011,38(9):204-207

基于内点算法((Interior Point Method,IPM)框架,导出具有分块带边结构系数矩阵的线性规划(Linear Programming, I_P)问题的简化和最简修正方程,并证明最简修正方程的对角分块具有正定性。结合正定矩阵的Cholcsky分解和解藕技术设计了修正方程的并行求解方法,给出了LP的并行内点算法结构。集群环境下的数值实验表明,所提算法具有很好的加速比和可扩展性,适合求解大规模结构化工尹问题。相似文献

5.

Improving the scalability of a symmetric iterative eigensolver for multi‐core platforms

Hasan Metin Aktulga Chao Yang Esmond G. Ng Pieter Maris James P. Vary 《Concurrency and Computation》2014,26(16):2631-2651

We describe an efficient and scalable symmetric iterative eigensolver developed for distributed memory multi‐core platforms. We achieve over 80% parallel efficiency by major reductions in communication overheads for the sparse matrix‐vector multiplication and basis orthogonalization tasks. We show that the scalability of the solver is significantly improved compared to an earlier version, after we carefully reorganize the computational tasks and map them to processing units in a way that exploits the network topology. We discuss the advantage of using a hybrid OpenMP/MPI programming model to implement such a solver. We also present strategies for hiding communication on a multi‐core platform. We demonstrate the effectiveness of these techniques by reporting the performance improvements achieved when we apply our solver to large‐scale eigenvalue problems arising in nuclear structure calculations. Because sparse matrix‐vector multiplication and inner product computation constitute the main kernels in most iterative methods, our ideas are applicable in general to the solution of problems involving large‐scale symmetric sparse matrices with irregular sparsity patterns. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

6.

并行J-变量块Cholesky分解算法的仿真研究

顾耀林刘万龙刘强胡寿伟《计算机仿真》2006,23(8):82-85

该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库，在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能，并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。相似文献

7.

Exploiting hardware capabilities in interior point methods

Csaba Mészáros 《Optimization methods & software》2016,31(2):435-443

The increase of computer performance continues to support the practice of large-scale optimization. Computers with multiple computing cores and vector processing capabilities are now widely available. We investigate how the recently introduced Advanced Vector Instruction (AVX) set on Intel-compatible architectures can be exploited in interior point methods for linear and nonlinear optimization. We focus on data structures and implementation techniques that utilize the new vector instructions. Our numerical experiments demonstrate that the AVX instruction set provides a significant performance boost in our implementation on large-scale problem that have significant fill-in in the sparse Cholesky factorization, achieving up to 100 gigaflops performance on a standard desktop computer on linear optimization problems for which the required Cholesky factorization is relatively dense. 相似文献

8.

Parallel direct method for linear equations

《Advances in Engineering Software》1999,30(9-11):839-845

A simple direct method and its variants, based on matrix multiplications, to invert square non-singular matrices (need not be positive definite) and to solve systems of linear equations are developed. Since the method and its variants involve only matrix multiplications they are straightforward to parallelise and hence have an advantage over some existing well known direct methods. Theoretical background, analysis of the proposed method and the complexity of the algorithm for sparse and dense matrices are given. Two significantly different parallel algorithms for dense and sparse matrices are developed. In the case of the dense matrix standard parallelisation of matrix multiplications was undertaken. However, for sparse matrices the standard parallel matrix multiplication is not effective which led us to develop a parallel algorithm different from that for the dense matrix. The performances of our algorithms are tested via numerical examples. 相似文献

9.

Application of the incomplete Cholesky factorization preconditioned Krylov subspace method to the vector finite element method for 3-D electromagnetic scattering problems

Liang Li Yan-Fei Jing 《Computer Physics Communications》2010,181(2):271-276

The incomplete Cholesky (IC) factorization preconditioning technique is applied to the Krylov subspace methods for solving large systems of linear equations resulted from the use of edge-based finite element method (FEM). The construction of the preconditioner is based on the fact that the coefficient matrix is represented in an upper triangular compressed sparse row (CSR) form. An efficient implementation of the IC factorization is described in detail for complex symmetric matrices. With some ordering schemes our IC algorithm can greatly reduce the memory requirement as well as the iteration numbers. Numerical tests on harmonic analysis for plane wave scattering from a metallic plate and a metallic sphere coated by a lossy dielectric layer show the efficiency of this method. 相似文献

10.

Finding Optimal Ordering of Sparse Matrices for Column-Oriented Parallel Cholesky Factorization

Lin Wen-Yang 《The Journal of supercomputing》2003,24(3):259-277

In this paper, we consider the problem of finding fill-preserving sparse matrix orderings for parallel factorization. That is, given a large sparse symmetric and positive definite matrix A that has been ordered by some fill-reducing ordering, we want to determine a reordering that is appropriate in terms of preserving the sparsity and minimizing the cost to perform the Cholesky factorization in parallel. Past researches on this problem all are based on the elimination tree model, in which each node represents the task for factoring a column, and thus, can be seen as a coarse-grained task dependence model. To exploit more parallelism, Joseph Liu proposed a medium-grained task model, called the column task graph, and showed that it is amenable to the shared-memory supercomputers. Based on the column task graph, we devise a greedy reordering algorithm, and show that our algorithm can find the optimal ordering among the class of all fill-preserving orderings of the given sparse matrix A. 相似文献

11.

A distributed packed storage for large dense parallel in‐core calculations

Marc Baboulin Luc Giraud Serge Gratton Julien Langou 《Concurrency and Computation》2007,19(4):483-502

In this paper we propose a distributed packed storage format that exploits the symmetry or the triangular structure of a dense matrix. This format stores only half of the matrix while maintaining most of the efficiency compared with a full storage for a wide range of operations. This work has been motivated by the fact that, in contrast to sequential linear algebra libraries (e.g. LAPACK), there is no routine or format that handles packed matrices in the currently available parallel distributed libraries. The proposed algorithms exclusively use the existing ScaLAPACK computational kernels, which proves the generality of the approach, provides easy portability of the code and provides efficient re‐use of existing software. The performance results obtained for the Cholesky factorization show that our packed format performs as good as or better than the ScaLAPACK full storage algorithm for a small number of processors. For a larger number of processors, the ScaLAPACK full storage routine performs slightly better until each processor runs out of memory. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

12.

PoLAPACK: parallel factorization routines with algorithmic blocking

Jaeyoung Choi 《Concurrency and Computation》2001,13(12):1033-1047

LU, QR, and Cholesky factorizations are the most widely used methods for solving dense linear systems of equations, and have been extensively studied and implemented on vector and parallel computers. Most of these factorization routines are implemented with block‐partitioned algorithms in order to perform matrix–matrix operations, that is, to obtain the highest performance by maximizing reuse of data in the upper levels of memory, such as cache. Since parallel computers have different performance ratios of computation and communication, the optimal computational block sizes are different from one another in order to generate the maximum performance of an algorithm. Therefore, the ata matrix should be distributed with the machine specific optimal block size before the computation. Too small or large a block size makes achieving good performance on a machine nearly impossible. In such a case, getting a better performance may require a complete redistribution of the data matrix. In this paper, we present parallel LU, QR, and Cholesky factorization routines with an ‘algorithmic blocking’ on two‐dimensional block cyclic data distribution. With the algorithmic blocking, it is possible to obtain the near optimal performance irrespective of the physical block size. The routines are implemented on the Intel Paragon and the SGI/Cray T3E and compared with the corresponding ScaLAPACK factorization routines. Copyright © 2001 John Wiley & Sons, Ltd. 相似文献

13.

Parallel Domain Decomposition Methods with Mixed Order Discretization for Fully Implicit Solution of Tracer Transport Problems on the Cubed-Sphere

Haijian Yang Chao Yang Xiao-Chuan Cai 《Journal of scientific computing》2014,61(2):258-280

In this paper, a fully implicit finite volume Eulerian scheme and a corresponding scalable parallel solver are developed for some tracer transport problems on the cubed-sphere. To efficiently solve the large sparse linear system at each time step on parallel computers, we introduce a Schwarz preconditioned Krylov subspace method using two discretizations. More precisely speaking, the higher order method is used for the residual calculation and the lower order method is used for the construction of the preconditioner. The matrices from the two discretizations have similar sparsity pattern and eigenvalue distributions, but the matrix from the lower order method is a lot sparser, as a result, excellent scalability results (in total computing time and the number of iterations) are obtained. Even though Schwarz preconditioner is originally designed for elliptic problems, our experiments indicate clearly that the method scales well for this class of purely hyperbolic problems. In addition, we show numerically that the proposed method is highly scalable in terms of both strong and weak scalabilities on a supercomputer with thousands of processors. 相似文献

14.

Algorithms of parallel computations for linear algebra problems with irregularly structured matrices

A. N. Khimich A. V. Popov V. V. Polyankoa 《Cybernetics and Systems Analysis》2011,47(6):973-985

Parallel algorithms for direct methods of analysis and solution of linear algebra problems with sparse symmetric irregularly structured matrices are considered. The performance of such algorithms is investigated. Upper estimates of the speedup and efficiency factors are obtained for a parallel algorithm for triangular decomposition of sparse matrices. Some results of numerical experiments carried out on a MIMD computer are given. 相似文献

15.

Basker: Parallel sparse LU factorization utilizing hierarchical parallelism and data layouts

《Parallel Computing》2017

Transient simulation in circuit simulation tools, such as SPICE and Xyce, depend on scalable and robust sparse LU factorizations for efficient numerical simulation of circuits and power grids. As the need for simulations of very large circuits grow, the prevalence of multicore architectures enable us to use shared memory parallel algorithms for such simulations. A parallel factorization is a critical component of such shared memory parallel simulations. We develop a parallel sparse factorization algorithm that can solve problems from circuit simulations efficiently, and map well to architectural features. This new factorization algorithm exposes hierarchical parallelism to accommodate irregular structure that arise in our target problems. It also uses a hierarchical two-dimensional data layout which reduces synchronization costs and maps to memory hierarchy found in multicore processors. We present an OpenMP based implementation of the parallel algorithm in a new multithreaded solver called Basker in the Trilinos framework. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulation. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to state-of-the-art solver KLU. Basker outperforms Intel MKL Pardiso solver (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation. 相似文献

16.

Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems

Ichitaro Yamazaki Tingxing Dong Raffaele Solc Stanimire Tomov Jack Dongarra Thomas Schulthess 《Concurrency and Computation》2014,26(16):2652-2666

For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)‐2 symmetric matrix‐vector multiplication, and the BLAS‐3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi‐GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU‐GPU kernel into computational kernels at higher‐levels of software stacks, that is, a shared‐memory dense eigensolver and a distributed‐memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher‐level kernels, not only reducing the solution time but also enabling the solution of larger‐scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

17.

Parallel solution of sparse linear least squares problems on distributed-memory multiprocessors

《Parallel Computing》1997,23(13):2075-2093

This paper studies the parallel solution of large-scale sparse linear least squares problems on distributed-memory multiprocessors. The key components required for solving a sparse linear least squares problem are sparse QR factorization and sparse triangular solution. A block-oriented parallel algorithm for sparse QR factorization has already been described in the literature. In this paper, new block-oriented parallel algorithms for sparse triangular solution are proposed. The arithmetic and communication complexities of the new algorithms applied to regular grid problems are analyzed. The proposed parallel sparse triangular solution algorithms together with the block-oriented parallel sparse QR factorization algorithm result in a highly efficient approach to the parallel solution of sparse linear least squares problems. Performance results obtained on an IBM Scalable POWERparallel system SP2 are presented. The largest least squares problem solved has over two million rows and more than a quarter million columns. The execution speed for the numerical factorization of this problem achieves over 3.7 gigaflops per second on an IBM SP2 machine with 128 processors. 相似文献

18.

A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems

Fengguang Song Jack Dongarra 《Concurrency and Computation》2015,27(14):3702-3723

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

19.

An alternative use of the Riccati recursion for efficient optimization

Daniel Axehill Manfred Morari 《Systems & Control Letters》2012,61(1):37-40

In optimization routines used for on-line Model Predictive Control (MPC), linear systems of equations are solved in each iteration. This is true both for Active Set (AS) solvers as well as for Interior Point (IP) solvers, and for linear MPC as well as for nonlinear MPC and hybrid MPC. The main computational effort is spent while solving these linear systems of equations, and hence, it is of great interest to solve them efficiently. In high performance solvers for MPC, this is performed using Riccati recursions or generic sparsity exploiting algorithms. To be able to get this performance gain, the problem has to be formulated in a sparse way which introduces more variables. The alternative is to use a smaller formulation where the objective function Hessian is dense. In this work, it is shown that it is possible to exploit the structure also when using the dense formulation. More specifically, it is shown that it is possible to efficiently compute a standard Cholesky factorization for the dense formulation. This results in a computational complexity that grows quadratically in the prediction horizon length instead of cubically as for the generic Cholesky factorization. 相似文献

20.

Parallel iterative solvers for sparse linear systems in circuit simulation

A. U. M. K. 《Future Generation Computer Systems》2005,21(8):1275-1284

For the solution of sparse linear systems from circuit simulation whose coefficient matrices include a few dense rows and columns, a parallel iterative algorithm with distributed Schur complement preconditioning is presented. The parallel efficiency of the solver is increased by transforming the equation system into a problem without dense rows and columns as well as by exploitation of parallel graph partitioning methods. The costs of local, incomplete LU decompositions are decreased by fill-in reducing reordering methods of the matrix and a threshold strategy for the factorization. The efficiency of the parallel solver is demonstrated with real circuit simulation problems on PC clusters. 相似文献