期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A locally optimized reordering algorithm and its application to a parallel sparse linear system solver

K. Gallivan P. C. Hansen Tz. Ostromsky Z. Zlatev 《Computing》1995,54(1):39-67

A coarse-grain parallel solver for systems of linear algebraic equations with general sparse matrices by Gaussian elimination is discussed. Before the factorization two other steps are performed. A reordering algorithm is used during the first step in order to obtain a permuted matrix with as many zero elements under the main diagonal as possible. During the second step the reordered matrix is partitioned into blocks for asynchronous parallel processing (normally the number of blocks is equal to the number of processors). It is possible to obtain blocks with nearly the same number of rows, because there is no requirement to produce square diagonal blocks. The first step is much more important than the second one and has a significant influence on the performance of the solver. A straightforward implementation of the reordering algorithm will result inO(n ²) operations. By using binary trees this cost can be reduced toO(NZ logn), whereNZ is the number of non-zero elements in the matrix andn is its order (normallyNZ is much smaller thann ²). Some experiments on parallel computers with shared memory have been performed. The results show that a solver based on the proposed reordering performs better than another solver based on a cheaper (but at the same time rather crude) reordering whose cost is onlyO(NZ) operations. 相似文献

2.

Highly scalable parallel algorithms for sparse matrix factorization

Gupta A. Karypis G. Kumar V. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(5):502-520

In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer 相似文献

3.

An optimal storage format for sparse matrices

《Information Processing Letters》2004,90(2):87-92

The irregular nature of sparse matrix-vector multiplication, Ax=y, has led to the development of a variety of compressed storage formats, which are widely used because they do not store any unnecessary elements. One of these methods, the Jagged Diagonal Storage format (JDS) is, in addition, considered appropriate for the implementation of iterative methods on parallel and vector processors. In this work we present the Transpose Jagged Diagonal Storage format (TJDS) which drew inspiration from the Jagged Diagonal Storage scheme but requires less storage space than JDS. We propose an alternative storage scheme which makes no assumptions about the sparsity pattern of the matrix and only needs three linear arrays instead of the four linear arrays required by JDS. Specifically, the data is aligned in such a way that the permutation array used in JDS, to permute the solution vector back to the original ordering, is unnecessary. This allow us to save the memory space required to store an integer vector of length n, where n stands for the number of columns in the sparse matrix A. This storage saving reaches, for the selection of matrices used in this work, from 14% up to 45% of the number of non-zero values of the sparse matrices. We present a case study of a 6×6 sparse matrix to show the data structures and the algorithm to compute Ax=y using the TJDS format. 相似文献

4.

A recursive algorithm for generating the transition matrices of multistation multiserver exponential reliable queueing networks

《Computers & Operations Research》2001,28(9):853-883

This paper is concerned with reliable multistation series queueing networks. Items arrive at the first station according to a Poisson distribution and an operation is performed on each item by a server at each station. Every station is allowed to have more than one server with the same characteristics. The processing times at each station are exponentially distributed. Buffers of nonidentical finite capacities are allowed between successive stations. The structure of the transition matrices of these specific type of queueing networks is examined and a recursive algorithm is developed for generating them. The transition matrices are block-structured and very sparse. By applying the proposed algorithm the transition matrix of a K-station network can be created for any K. This process allows one to obtain the exact solution of the large sparse linear system by the use of the Gauss–Seidel method. From the solution of the linear system the throughput and other performance measures can be calculated.Scope and purposeThe exact analysis of queueing networks with multiple servers at each workstation and finite capacities of the intermediate queues is extremely difficult as for even the case of exponential operation (service or processing) times the Markovian chain that models the system consists of a huge number of states which grows exponentially with the number of stations, the number of servers at each station and the queue capacity of each intermediate queue of the resulting system. The scope and purpose of the present paper is to analyze and provide a recursive algorithm for generating the transition matrices of multistation multiserver exponential reliable queueing networks. By applying the proposed algorithm one may create the transition matrix of a K-station queueing network for any K. This process allows one to obtain the exact solution of the resulting large sparse linear system by the use of the Gauss–Seidel method. From the solution of the linear system the throughput and other performance measures of the system can be obtained. 相似文献

5.

Parallel sparse matrix vector multiply software for matrices with data locality

RAY S. TUMINARO JOHN N. SHADID SCOTT A. HUTCHINSON 《Concurrency and Computation》1998,10(3):229-247

In this paper we describe general software utilities for performing unstructured sparse matrix–vector multiplications on distributed-memory message-passing computers. The matrix–vector multiply comprises an important kernel in the solution of large sparse linear systems by iterative methods. Our focus is to present the data structures and communication parameters required by these utilities for general sparse unstructured matrices with data locality. These types of matrices are commonly produced by finite difference and finite element approximations to systems of partial differential equations. In this discussion we also present representative examples and timings which demonstrate the utility and performance of the software. © 1998 John Wiley & Sons, Ltd. 相似文献

6.

Multicoloring of grid-structured PDE solvers on shared-memorymultiprocessors

Hwang-Cheng Wang Kai Hwang 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(11):1195-1205

In order to execute a parallel PDE (partial differential equation) solver on a shared-memory multiprocessor, we have to avoid memory conflicts in accessing multidimensional data grids. A new multicoloring technique is proposed for speeding sparse matrix operations. The new technique enables parallel access of grid-structured data elements in the shared memory without causing conflicts. The coloring scheme is formulated as an algebraic mapping which can be easily implemented with low overhead on commercial multiprocessors. The proposed multicoloring scheme bas been tested on an Alliant FX/80 multiprocessor for solving 2D and 3D problems using the CGNR method. Compared to the results reported by Saad (1989) on an identical Alliant system, our results show a factor of 30 times higher performance in Mflops. Multicoloring transforms sparse matrices into ones with a diagonal diagonal block (DDB) structure, enabling parallel LU decomposition in solving PDE problems. The multicoloring technique can also be extended to solve other scientific problems characterized by sparse matrices 相似文献

7.

Auto-tuned Krylov methods on cluster of graphics processing unit

Frédéric Magoulès Abal-Kassim Cheik Ahamed Roman Putanowicz 《国际计算机数学杂志》2015,92(6):1222-1250

Exascale computers are expected to have highly hierarchical architectures with nodes composed by multiple core processors (CPU; central processing unit) and accelerators (GPU; graphics processing unit). The different programming levels generate new difficult algorithm issues. In particular when solving extremely large linear systems, new programming paradigms of Krylov methods should be defined and evaluated with respect to modern state of the art of scientific methods. Iterative Krylov methods involve linear algebra operations such as dot product, norm, addition of vectors and sparse matrix–vector multiplication. These operations are computationally expensive for large size matrices. In this paper, we aim to focus on the best way to perform effectively these operations, in double precision, on GPU in order to make iterative Krylov methods more robust and therefore reduce the computing time. The performance of our algorithms is evaluated on several matrices arising from engineering problems. Numerical experiments illustrate the robustness and accuracy of our implementation compared to the existing libraries. We deal with different preconditioned Krylov methods: Conjugate Gradient for symmetric positive-definite matrices, and Generalized Conjugate Residual, Bi-Conjugate Gradient Conjugate Residual, transpose-free Quasi Minimal Residual, Stabilized BiConjugate Gradient and Stabilized BiConjugate Gradient (L) for the solution of sparse linear systems with non symmetric matrices. We consider and compare several sparse compressed formats, and propose a way to implement effectively Krylov methods on GPU and on multicore CPU. Finally, we give strategies to faster algorithms by auto-tuning the threading design, upon the problem characteristics and the hardware changes. As a conclusion, we propose and analyse hybrid sub-structuring methods that should pave the way to exascale hybrid methods. 相似文献

8.

Efficient Incremental Algorithms for the Sparse Resultant and the Mixed Volume

《Journal of Symbolic Computation》1995,20(2):117-149

We propose a new and efficient algorithm for computing the sparse resultant of a system of n + 1 polynomial equations in n unknowns. This algorithm produces a matrix whose entries are coefficients of the given polynomials and is typically smaller than the matrices obtained by previous approaches. The matrix determinant is a non-trivial multiple of the sparse resultant from which the sparse resultant itself can be recovered. The algorithm is incremental in the sense that successively larger matrices are constructed until one is found with the above properties. For multigraded systems, the new algorithm produces optimal matrices, i.e. expresses the sparse resultant as a single determinant. An implementation of the algorithm is described and experimental results are presented. In addition, we propose an efficient algorithm for computing the mixed volume of n polynomials in n variables. This computation provides an upper bound on the number of common isolated roots. A publicly available implementation of the algorithm is presented and empirical results are reported which suggest that it is the fastest mixed volume code to date. 相似文献

9.

Explicit semi-direct methods based on approximate inverse matrix techniques for solving boundary-value problems on parallel processors

《Mathematics and computers in simulation》1987,29(1):1-17

Generalized approximate inverse matrix techniques and sparse Gauss-Jordan elimination procedures based on the concept of sparse product form of the inverse are introduced for calculating explicitly approximate inverses of large sparse unsymmetric (n × n) matrices. Explicit first and second order semi-direct methods in conjunction with the derived approximate inverse matrix techniques are presented for solving Parabolic and Elliptic difference equations on parallel processors. Application of the new methods on a 2D-model problem is discussed and numerical results are given. 相似文献

10.

Preconditioned CG methods for sparse matrices on massively parallel machines

《Parallel Computing》1997,23(3):381-398

Conjugate gradient (CG) methods to solve sparse systems of linear equations play an important role in numerical methods for solving discretized partial differential equations. The large size and the condition of many technical or physical applications in this area result in the need for efficient parallelization and preconditioning techniques of the CG method, in particular on massively parallel machines. Here, the data distribution and the communication scheme for the sparse matrix operations of the preconditioned CG are based on the analysis of the indices of the non-zero elements. Polynomial preconditioning is shown to reduce global synchronizations considerably, and a fully local incomplete Cholesky preconditioner is presented. On a PARAGON XP/S 10 with 138 processors, the developed parallel methods outperform diagonally scaled CG markedly with respect to both scaling behavior and execution time for many matrices from real finite element applications. 相似文献

11.

Run-Time Techniques for Exploiting Irregular Task Parallelism on Distributed Memory Architectures

Cong Fu Tao Yang 《Journal of Parallel and Distributed Computing》1997,42(2):486

Automatic scheduling for directed acyclic graphs (DAG) and its applications for coarse-grained irregular problems such as largen-body simulation have been studied in the literature. However, solving irregular problems with mixed granularities such as sparse matrix factorization is challenging since it requires efficient run-time support to execute a DAG schedule. In this paper, we investigate run-time optimization techniques for executing general asynchronous DAG schedules on distributed memory machines and discuss an approach for exploiting parallelism from commuting operations in the DAG model. Our solution tightly integrates the run-time scheme with a fast communication mechanism to eliminate unnecessary overhead in message buffering and copying. We present a consistency model incorporating the above optimizations, and take advantage of task dependence properties to ensure the correctness of execution. We demonstrate the applications of this scheme in sparse matrix factorizations and triangular equation solving for which actual speedups are difficult to obtain. We provide a detailed experimental study on Meiko CS-2 to show that the automatically scheduled code has achieved good performance for these difficult problems, and the run-time overhead is small compared to total execution times. 相似文献

12.

Decomposition in Multidimensional Boolean-Optimization Problems with Sparse Matrices

D. V. Kovkov D. V. Lemtyuzhnikova 《Journal of Computer and Systems Sciences International》2018,57(1):97-108

In this paper, we review problems associated with sparse matrices. We formulate several theorems on the allocation of a quasi-block structure in a sparse matrix, as well as on the relation of the degree of the quasi-block structure and the number of its blocks, depending on the dimension of the matrix and the number of nonzero elements in it. Algorithms for the solution of integer optimization problems with sparse matrices that have the quasi-block structure are considered. Algorithms for allocating the quasi-block structures are presented. We describe the local elimination algorithm, which is efficient for problems with matrices that have a quasi-block structure. We study the problem of an optimal sequence for the elimination of variables in the local elimination algorithm. For this purpose, we formulate a series of notions and prove the properties of graph structures corresponding to the order of the solution of subproblems. Different orders of the elimination of variables are tested. 相似文献

13.

Solution of linear equations with a symmetrically skyline-stored nonsymmetric matrix

Eliezer Mendelssohn Menachem Baruch 《Computers & Structures》1984,18(2):215-246

Fortran IV subroutines for the in-core solution of linear algebraic systems with a sparse, symmetrically skylined-stored nonsymmetric coefficient matrix are presented. Such systems arise in various computations, among which are the finite element discretization in conjunction with incremental continuum mechanics, or space-time finite elements for dynamical systems. These routines can be used for constrained systems without prearranging. The feature of partial decomposition is installed and its application to the analysis of singular matrices is discussed. 相似文献

14.

Symbolic and Numeric Methods for Exploiting Structure in Constructing Resultant Matrices

《Journal of Symbolic Computation》2002,33(4):393-413

Resultants characterize the existence of roots of systems of multivariate nonlinear polynomial equations, while their matrices reduce the computation of all common zeros to a problem in linear algebra. Sparse elimination theory has introduced the sparse (or toric) resultant, which takes into account the sparse structure of the polynomials. The construction of sparse resultant, or Newton, matrices is the critical step in the computation of the multivariate resultant and the solution of a nonlinear system. We reveal and exploit the quasi-Toeplitz structure of the Newton matrix, thus decreasing the time complexity of constructing such matrices by roughly one order of magnitude to achieve quasi-quadratic complexity in the matrix dimension. The space complexity is also decreased analogously. These results imply similar improvements in the complexity of computing the resultant polynomial itself and of solving zero-dimensional systems. Our approach relies on fast vector-by-matrix multiplication and uses the following two methods as building blocks. First, a fast and numerically stable method for determining the rank of rectangular matrices, which works exclusively over floating point arithmetic. Second, exact polynomial arithmetic algorithms that improve upon the complexity of polynomial multiplication under our model of sparseness, offering bounds linear in the number of variables and the number of non-zero terms. 相似文献

15.

The effects on communication of data representation of nested preconditionings for massively parallel architectures

《Computing Systems in Engineering》1995,6(4-5):437-441

The effect which the representation of the data (matrices and vectors) has on the communication patterns of preconditionings for exploitation of massively parallel architectures is discussed. Preconditioned iterative methods are used to solve the sparse linear systems generated by discretizations of partial differential equations in many areas of science and engineering. The preconditionings considered are based on nested incomplete factorization with approximate tridiagonal inverses using a two color line ordering of the discretization grid. These preconditionings can be described in terms of vector-vector to vector operations of dimension equal to half the total number of grid points. 相似文献

16.

Parallel Solutions for Large-Scale General Sparse Nonlinear Systems of Equations

下载免费PDF全文

HU Chengyi 《计算机科学技术学报》1996,11(3):257-271

In solving application problems,many large-scale nonlinear systems of equaions result in sparse Jacobian matrices.Such nonlinear systems are called sparse nonlinear systems.The irregularity of the locations of nonzrero elements of a general sparse matrix makes it very difficult to generally map sparse matrix computations to multiprocessors for parallel processing in a well balanced manner.To overcome this difficulty,we define a new storage scheme for general sparse matrices in this paper,With the new storage scheme,we develop parallel algorithms to solve large-scale general sparse systems of equations by interval Newton/Generalized bisection methods which reliably find all numerical solutions within a given domain.I n Section 1,we provide an introduction to the addressed problem and the interval Newton‘s methods.In Section 2,some currently used storage schemes for sparse systems are reviewed.In Section 3,new index schemes to store general sparse matrices are reported.In Section 4,we present a parallel algorithm to evaluate a general sparse Jacobian matrix.In Section 5,we present a parallel algorithm to solve the corresponding interval linear system by the all-row preconditioned scheme.Conclusions and future work are discussed in Section 6. 相似文献

17.

Accelerating sparse Cholesky factorization on GPUs

《Parallel Computing》2016

Sparse factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. For factorizations which involve sufficient dense math, the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost. However, for many other cases, the prevalence of small/irregular dense math and the relatively slow communication between the host and device over the PCIe bus, make it challenging to significantly accelerate sparse factorization using the GPU.In this paper we describe a left-looking supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream subtrees of the elimination tree through the GPU and perform the factorization of each subtree entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these subtrees, many independent, small, dense operations are batched to minimize kernel launch overhead and many of these batched kernels are executed concurrently to maximize device utilization.Performance results for commonly studied matrices are presented along with suggested actions for further optimization. 相似文献

18.

Algorithms for the structural diagnosis and decomposition of sparse,underconstrained design systems

《Computer aided design》1996,28(4):237-249

Most design systems and variational geometry scenarios are represented by a set of sparse constraints. Typically, such a system is underconstrained and is solved in an interactive framework by the specification of additional constraints (inputs) by the designer. At the onset of the design solution process, and also after every input operation, it must be determined if the complete set of design constraints (both initial and input) has any redundancies. This defines the structural diagnosis problem. The set of variables and constraints that can then be solved (solution set) has to be identified, along with the most efficient order of solving them. This is the structural decomposition problem. It has been demonstrated in literature that occurrence matrices can be used to represent the structure of design systems. In this context, the diagnosis of the design system can be accomplished by the maximum transversal, and the decomposition process involves the block decomposition of the occurrence matrix representing the system of design constraints. This paper uses certain existing sparse matrix decomposition algorithms to derive algorithms for the structural diagnosis and decomposition of sparse, underconstrained design systems using occurrence matrices. 相似文献

19.

Distributed generic approximate sparse inverses

George A. Gravvanis Christos K. Filelis-Papadopoulos 《The Journal of supercomputing》2014,70(1):365-384

The need for accuracy in the solution of linear systems derived from the discretization of partial differential equations leads to large sparse linear systems. The solution of sparse linear systems requires efficient scalable methods. Iterative solvers require efficient parallel preconditioning methods to solve effectively sparse linear systems. Herewith, a new parallel algorithm for the generic approximate sparse inverse matrix method for distributed memory systems is proposed. The computation of the distributed generic approximate sparse inverse matrix is based on a column-wise approach, which allows the separation to independent problems that can be handled in parallel without synchronization points or intermediate communications. This is achieved by reforming the generic approximate sparse inverse matrix algorithm and its process of computation with a new partial solution method for the computation of the nonzero elements of each column dictated by the approximate inverse sparsity pattern. Moreover, an algorithmic scheme is proposed for the efficient distribution of data amongst the available workstations, along with a load balancing scheme for problems with large standard deviation in the number of nonzero elements per column. Numerical results are presented for the proposed schemes for various model problems. 相似文献

20.

Efficient GPU Data Structures and Methods to Solve Sparse Linear Systems in Dynamics Applications

Daniel Weber Jan Bender Markus Schnoes André Stork Dieter Fellner 《Computer Graphics Forum》2013,32(1):16-26

We present graphics processing unit (GPU) data structures and algorithms to efficiently solve sparse linear systems that are typically required in simulations of multi‐body systems and deformable bodies. Thereby, we introduce an efficient sparse matrix data structure that can handle arbitrary sparsity patterns and outperforms current state‐of‐the‐art implementations for sparse matrix vector multiplication. Moreover, an efficient method to construct global matrices on the GPU is presented where hundreds of thousands of individual element contributions are assembled in a few milliseconds. A finite‐element‐based method for the simulation of deformable solids as well as an impulse‐based method for rigid bodies are introduced in order to demonstrate the advantages of the novel data structures and algorithms. These applications share the characteristic that a major computational effort consists of building and solving systems of linear equations in every time step. Our solving method results in a speed‐up factor of up to 13 in comparison to other GPU methods. 相似文献