期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel solution of sparse linear least squares problems on distributed-memory multiprocessors

《Parallel Computing》1997,23(13):2075-2093

This paper studies the parallel solution of large-scale sparse linear least squares problems on distributed-memory multiprocessors. The key components required for solving a sparse linear least squares problem are sparse QR factorization and sparse triangular solution. A block-oriented parallel algorithm for sparse QR factorization has already been described in the literature. In this paper, new block-oriented parallel algorithms for sparse triangular solution are proposed. The arithmetic and communication complexities of the new algorithms applied to regular grid problems are analyzed. The proposed parallel sparse triangular solution algorithms together with the block-oriented parallel sparse QR factorization algorithm result in a highly efficient approach to the parallel solution of sparse linear least squares problems. Performance results obtained on an IBM Scalable POWERparallel system SP2 are presented. The largest least squares problem solved has over two million rows and more than a quarter million columns. The execution speed for the numerical factorization of this problem achieves over 3.7 gigaflops per second on an IBM SP2 machine with 128 processors. 相似文献

2.

Decomposing Monomial Representations of Solvable Groups 总被引：1，自引：0，他引：1

Markus Püschel 《Journal of Symbolic Computation》2002,34(6):561

We present an efficient algorithm that decomposes a monomial representation of a solvable groupG into its irreducible components. In contradistinction to other approaches, we also compute the decomposition matrixA in the form of a product of highly structured, sparse matrices. This factorization provides a fast algorithm for the multiplication with A. In the special case of a regular representation, we hence obtain a fast Fourier transform forG . Our algorithm is based on a constructive representation theory that we develop. The term “constructive" signifies that concrete matrix representations are considered and manipulated, rather than equivalence classes of representations as it is done in approaches that are based on characters. Thus, we present well-known theorems in a constructively refined form and derive new results on decomposition matrices of representations. Our decomposition algorithm has been implemented in the GAP share package AREP. One application of the algorithm is the automatic generation of fast algorithms for discrete linear signal transforms. 相似文献

3.

Iterative algorithms for Gram-Schmidt orthogonalization

Walter Hoffmann 《Computing》1989,41(4):335-348

The algorithms that are treated in this paper are based on the classical and the modified Gram-Schmidt algorithms. It is shown that Gram-Schmidt orthogonalization for constructing aQR factorization should be carried out iteratively to obtain a matrixQ that is orthogonal in almost full working precision. In the formulation of the algorithms, the parts that express manipulations with matrices or vectors are clearly identified to enable an optimal implementation of the algorithms on parallel and/or vector machines. An extensive error analysis is presented. It shows, for instance, that the iterative classical algorithm is not inferior to the iterative modified algorithm when full precision ofQ is required. Experiments are reported to support the outcomes of the analysis. 相似文献

4.

A family of parallel QR factorization algorithms

Gerard G.L. Meyer Mike Pascale 《Concurrency and Computation》1996,8(6):461-473

Rapid computation of the QR factorization of a matrix is fundamental to many scientific and engineering problems. The paper presents a family of algorithms parameterized by the number of processors available P, arithmetic grain aggregation parameters g₁, g₂, …, g_P, and communication grain aggregation parameter h, which computer the QR factorization of a matrix A ∈ C^{m × n} with minimal latency. The approach is particularly well suited for dedicated distributed memory architectures such as linear arrays of INMOS Transputers, Texas Instruments C40s or Analog Devices 21060s. 相似文献

5.

Scalability Issues Affecting the Design of a Dense Linear Algebra Library

Dongarra J. J. Vandegeijn R. A. Walker D. W. 《Journal of Parallel and Distributed Computing》1994,22(3)

This paper discusses the scalability of Cholesky, LU, and QR factorization routines on MIMD distributed memory concurrent computers. These routines form part of the ScaLAPACK mathematical software library that extends the widely used LAPACK library to run efficiently on scalable concurrent computers. To ensure good scalability and performance, the ScaLAPACK routines are based on block-partitioned algorithms that reduce the frequency of data movement between different levels of the memory hierarchy, and particularly between processors. The block cyclic data distribution, that is used in all three factorization algorithms, is described. An outline of the sequential and parallel block-partitioned algorithms is given. Approximate models of algorithms′ performance are presented to indicate which factors in the design of the algorithm have an impact upon scalability. These models are compared with timings results on a 128-node Intel iPSC/860 hypercube. It is shown that the routines are highly scalable on this machine for problems that occupy more than about 25% of the memory on each processor, and that the measured timings are consistent with the performance model. The contribution of this paper goes beyond reporting our experience: our implementations are available in the public domain. 相似文献

6.

VBLAST-OFDM系统中改进的QR分解检测算法

傅洪亮李永杰张元管爱红《数据采集与处理》2011,26(1)

基于VBLAST-OFDM系统中传统QR分解算法,将最大似然和并行干扰消除的思想引入QR分解算法中,提出了对QR分解算法的改进方法,克服了传统QR分解算法检测性能差的缺点,运用QR分解从最后两层信号开始,等效为2收2发的MIMO系统。用最优算法检测判决后,回代到原QR分解算法检测余下层信号;或者依次并行消除已判决信号的影响,再进行下个等效MIMO系统的判决,直至所有信号检测完毕。仿真结果表明,本文改进的QR分解检测算法比传统的QR算法和迫零算法在误码性能上得到改善。相似文献

7.

Parallel tiled QR factorization for multicore architectures

Alfredo Buttari Julien Langou Jakub Kurzak Jack Dongarra 《Concurrency and Computation》2008,20(13):1573-1590

As multicore systems continue to gain ground in the high‐performance computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine‐grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data (referred to as ‘tiles’). These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out‐of‐order execution of the tasks that will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithm for QR factorization where parallelism can be exploited only at the level of the BLAS operations and with vendor implementations. Copyright © 2008 John Wiley & Sons, Ltd. 相似文献

8.

Approximating the inverse of a matrix for use in iterative algorithms on vector processors

P. F. Dubois A. Greenbaum G. H. Rodrigue 《Computing》1979,22(3):257-268

Most iterative techniques for solving the symmetric positive-definite systemAx=b involve approximating the matrixA by another symmetric positive-definite matrixM and then solving a system of the formMz=d at each iteration. On a vector machine such as the CDC-STAR-100, the solution of this new system can be very time consuming. If, however, an approximationM ^?1 can be given toA ^?1, the solutionz=M ^?1 d can be computed rapidly by matrix multiplication, a fast operation on the STAR. Approximations using the Neumann expansion of the inverse ofA give reasonable forms forM ^?1 and are presented. Computational results using the conjugate gradient method for the “5-point” matrixA are given. 相似文献

9.

On the use of parallel processors for implicit Runge-Kutta methods

G. J. Cooper R. Vignesvaran 《Computing》1993,51(2):135-150

An iteration scheme, for solving the non-linear equations arising in the implementation of implicit Runge-Kutta methods, is proposed. This scheme is particularly suitable for parallel computation and can be applied to any method which has a coefficient matrixA with all eigenvalues real (and positive). For such methods, the efficiency of a modified Newton scheme may often be improved by the use of a similarity transformation ofA but, even when this is the case, the proposed scheme can have advantages for parallel computation. Numerical results illustrate this. The new scheme converges in a finite number of iterations when applied to linear systems of differential equations, achieving this by using the nilpotency of a strictly lower triangular matrixS ^?1 AS — Λ, with Λ a diagonal matrix. The scheme reduces to the modified Newton scheme whenS ^?1 AS is diagonal.A convergence result is obtained which is applicable to nonlinear stiff systems. 相似文献

10.

Matching and Perturbation Theories for Affine-Invariant Shapes Using QR Factorization with Column Pivoting

Zhaozhong Wang Yuanyuan Yu 《Journal of Mathematical Imaging and Vision》2014,49(3):633-651

Affine-invariant shape matching aims to find correspondence of shapes invariant under affine transformations. This problem is helpful to many computer vision tasks such as stereo matching and object recognition. Since the solution of this problem is affected by perturbations such as noise, the analysis for the perturbation bounds is crucial to evaluate the performance of algorithm from a mathematical point of view. In this paper we first introduce a shape matching algorithm using QR factorization with column pivoting (QRP), which gives a closed-form solution to the problem. In the QRP-based algorithm, the Householder transform plays a central role. So a sharp perturbation bound for the Householder transform is derived, then theoretical perturbation bounds for the matching algorithm are proven. Based upon the perturbation analysis a point clustering scheme (PCS) is proposed to improve the algorithm robustness. Experimental results on synthetic and real-world images, as well as application results for logo retrieval are provided to demonstrate the proposed method and compare it with existing algorithms. 相似文献

11.

Algebraic aspects of some Riordan arrays related to binary words avoiding a pattern

D. Merlini R. Sprugnoli 《Theoretical computer science》2011,412(27):2988-3001

We consider some Riordan arrays related to binary words avoiding a pattern p, which can be easily studied by means of an A-matrix rather than their A-sequence. Both concepts allow us to define every element as a linear combination of other elements in the array; the A-sequence is unique and corresponds to a linear dependence from the previous row. The A-matrix is not unique and corresponds to a linear dependence from several previous rows. However, for the problems considered in the present paper, we show that the A-matrix approach is more convenient. We provide explicit algebraic generating functions for these Riordan arrays and obtain many statistics on the corresponding languages. We thus obtain a deeper insight of the languages L^[p] of binary words avoiding p having a number of 0s less or equal to the number of 1s. 相似文献

12.

A new and fast implementation for null space based linear discriminant analysis

Delin Chu^{Author Vitae} Goh Siong Thye 《Pattern recognition》2010,43(4):1373-1379

In this paper we present a new implementation for the null space based linear discriminant analysis. The main features of our implementation include: (i) the optimal transformation matrix is obtained easily by only orthogonal transformations without computing any eigendecomposition and singular value decomposition (SVD), consequently, our new implementation is eigendecomposition-free and SVD-free; (ii) its main computational complexity is from a economic QR factorization of the data matrix and a economic QR factorization of a n×n matrix with column pivoting, here n is the sample size, thus our new implementation is a fast one. The effectiveness of our new implementation is demonstrated by some real-world data sets. 相似文献

13.

A posteriori-Fehlerabschätzungen für die Pseudoinverse und die Lösung minimaler Länge

Dr. W. Sautter 《Computing》1975,14(1-2):37-44

The pseudoinverseA ^I of a matrixA is characterized through two inA ^I linear equations and rank (A ^I)≤rank(A). A posteriori error bounds are developped for the derivation of an approximationX ofA ^I and the errors of the residuesAA ^I-AX andA ^I A-XA. The results are extended to the best least squares solution. A numerical example illustrates the technique. 相似文献

14.

On the RAS-algorithm

A. Bachem B. Korte 《Computing》1979,23(2):189-198

Given a nonnegative real (m, n) matrixA and positive vectorsu, v, then the biproportional constrained matrix problem is to find a nonnegative (m, n) matrixB such thatB=diag (x) A diag (y) holds for some vectorsx ∈ ?^m andy ∈ ?ⁿ and the row (column) sums ofB equalu _i (v _j)i=1,...,m(j=1,..., n). A solution procedure (called the RAS-method) was proposed by Bacharach [1] to solve this problem. The main disadvantage of this algorithm is, that round-off errors slow down the convergence. Here we present a modified RAS-method which together with several other improvements overcomes this disadvantage. 相似文献

15.

Polynomial approximation of functions of matrices and applications

Hillel Tal-Ezer 《Journal of scientific computing》1989,4(1):25-60

In solving a mathematical problem numerically, we frequently need to operate on a vector by an operator that can be expressed asf(A), whereA is anN ×N matrix [e.g., exp(A), sin(A), A^–-]. Except for very simple matrices, it is impractical to construct the matrixf (A) explicitly. Usually an approximation to it is used. This paper develops an algorithm based upon a polynomial approximation tof (A). First the problem is reduced to a problem of approximatingf (z) by a polynomial in z, where z belongs to a domainD in the complex plane that includes all the eigenvalues ofA. This approximation problem is treated by interpolatingf (z) in a certain set of points that is known to have some maximal properties. The approximation thus achieved is almost best. Implementing the algorithm to some practical problems is described. Since a solution to a linear systemAx=b isx=A ^–1 b, an iterative solution algorithm can be based upon a polynomial approximation tof (A)=A ^–1. We give special attention to this important problem. 相似文献

16.

Implementing QR factorization updating algorithms on GPUs

Robert Andrew Nicholas Dingle 《Parallel Computing》2014

Linear least squares problems are commonly solved by QR factorization. When multiple solutions need to be computed with only minor changes in the underlying data, knowledge of the difference between the old data set and the new can be used to update an existing factorization at reduced computational cost. We investigate the viability of implementing QR updating algorithms on GPUs and demonstrate that GPU-based updating for removing columns achieves speed-ups of up to 13.5× compared with full GPU QR factorization. We characterize the conditions under which other types of updates also achieve speed-ups. 相似文献

17.

On the computational complexity of the solution of linear systems with moduli

Anatoly V. Lakeyev 《Reliable Computing》1996,2(2):125-131

A problem of solvability for the system of equations of the formAx=D|x|+δ is investigated. This problem is proved to beNP-complete even in the case when the number of equations is equal to the number of variables, the matrixA is nonsingular,A≥D≥0,δ≥0, and it is initially known that the system has a finite (possibly zero) number of solutions. For an arbitrary system ofm equations ofn variables, under additional conditions that the matrixD is nonnegative and its rank is one, a polynomial-time algorithm (of the orderO((max{m, n})³)) has been found which allows to determine whether the system is solvable or not and to find one of such solutions in the case of solvability. 相似文献

18.

Efficient Parallel Algorithms for Some Graph Theory Problems

下载免费PDF全文

Ma Jun Ma Shaohan 《计算机科学技术学报》1993,8(4):76-80

In this paper,a sequential algorithm computing the aww vertex pair distance matrix D and the path matrix Pis given.On a PRAM EREW model with p,1≤p≤n^2,processors,a parallel version of the sequential algorithm is shown.This method can also be used to get a parallel algorithm to compute transitive closure array A^* of an undirected graph.The time complexity of the parallel algorithm is O(n^3/p).If D,P and A^* are known,it is shown that the problems to find all connected components,to compute the diameter of an undirected graph,to determine the center of a directed graph and to search for a directed cycle with the minimum(maximum)length in a directed graph can all be solved in O(n^2/p logp)time. 相似文献

19.

Efficient unsteady high Reynolds number flow computations on unstructured grids

Peter Lucas Hester Bijl Alexander H. van Zuijlen 《Computers & Fluids》2010,39(2):271-9215

Despite the advances in computer power and numerical algorithms over the last decades, solutions to unsteady flow problems remain computing time intensive. Especially for high Reynolds number flows, nonlinear multigrid, which is commonly used to solve the nonlinear systems of equations, converges slowly. The stiffness induced by the high-aspect ratio cells and turbulence is not tackled well by this solution method.In this paper, it is investigated if a Jacobian-free Newton-Krylov (jfnk) solution method can speed up unsteady flow computations at high Reynolds numbers. Preconditioning of the linear systems that arise after Newton linearization is commonly performed with matrix-free preconditioners or approximate factorizations based on crude approximations of the Jacobian. Approximate factorizations based on a Jacobian that matches the target residual operator are unpopular because these preconditioners consume a large amount of memory and can suffer from robustness issues. However, these preconditioners remain appealing because they closely resemble A^-1.In this paper, it is shown that a jfnk solution method with an approximate factorization preconditioner based on a Jacobian that approximately matches the target residual operator enables a speed up of a factor 2.5-12 over nonlinear multigrid for two-dimensional high Reynolds number flows. The solution method performs equally well as nonlinear multigrid for three-dimensional laminar problems. A modest memory consumption is achieved with partly lumping the Jacobian before constructing the approximate factorization preconditioner, whereas robustness is ensured with enhanced diagonal dominance. 相似文献

20.

A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems

Fengguang Song Jack Dongarra 《Concurrency and Computation》2015,27(14):3702-3723

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献