期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast and highly scalable parallel computations for fundamental matrix problems on distributed memory systems

Keqin Li 《The Journal of supercomputing》2010,54(3):271-297

We present fast and highly scalable parallel computations for a number of important and fundamental matrix problems on distributed memory systems (DMS). These problems include matrix multiplication, matrix chain product, and computing the powers, the inverse, the characteristic polynomial, the determinant, the rank, the Krylov matrix, and an LU- and a QR-factorization of a matrix, and solving linear systems of equations. Our highly scalable parallel computations for these problems are based on a highly scalable implementation of the fastest sequential matrix multiplication algorithm on DMS. We show that compared with the best known parallel time complexities on parallel random access machines (PRAM), the most powerful but unrealistic shared memory model of parallel computing, our parallel matrix computations achieve the same speeds on distributed memory parallel computers (DMPC), and have an extra polylog factor in the time complexities on DMS with hypercubic networks. Furthermore, our parallel matrix computations are fully scalable on DMPC and highly scalable over a wide range of system size on DMS with hypercubic networks. Such fast (in terms of parallel time complexity) and highly scalable (in terms of our definition of scalability) parallel matrix computations were rarely seen before on any distributed memory systems. 相似文献

2.

GPU-LMDDA: a bit-vector GPU-based deadlock detection algorithm for multi-unit resource systems

Stephen Abell Nhan Do 《International Journal of Parallel, Emergent and Distributed Systems》2016,31(6):562-590

相似文献

3.

Efficient parallel algorithms for graph problems

Clyde P. Kruskal Larry Rudolph Marc Snir 《Algorithmica》1990,5(1-4):43-64

We present an efficient technique for parallel manipulation of data structures that avoids memory access conflicts. That is, this technique works on the Exclusive Read/Exclusive Write (EREW) model of computation, which is the weakest shared memory, MIMD machine model. It is used in a new parallel radix sort algorithm that is optimal for keys whose values are over a small range. Using the radix sort and known results for parallel prefix on linked lists, we develop parallel algorithms that efficiently solve various computations on trees and “unicycular graphs.” Finally, we develop parallel algorithms for connected components, spanning trees, minimum spanning trees, and other graph problems. All of the graph algorithms achieve linear speedup for all but the sparsest graphs. 相似文献

4.

Tuning the Schur complement computations for finite element partitions

G. P. Nikishkov H. Kanda A. Makinouchi 《Advances in Engineering Software》2000,31(12):913-920

The domain decomposition method (DDM) is an efficient algorithmic tool for the parallelization of finite element computer codes. A variant of the DDM with direct solution algorithm is based on computation of Schur complement matrices for finite element partitions. This paper describes a simple technique that considerably improves execution rate of computationally intensive routines of the Schur complement computations. The technique uses ‘block of columns’ matrix operations and loop unrolling to reduce load instructions from cache memory and to increase instruction-level parallelism. For superscalar RISC processors, experimental results show that it is possible to improve performance of the DDM solution procedure by several times. 相似文献

5.

Solving Linear Recurrences with Loop Raking

《Journal of Parallel and Distributed Computing》1995,25(1):91-97

We present a variation of the partition method for solving linear recurrences that is well-suited to vector multiprocessors. The algorithm fully utilizes both vector and multiprocessor capabilities, and reduces the number of memory accesses and temporary memory requirements as compared to the more commonly used version of the partition method. Our variation uses a general loop restructuring technique called loop raking. We describe an implementation of this technique on the CRAY Y-MP C90, and present performance results for first- and second-order linear recurrences. On a single processor of the C90 our implementations are up to 7.3 times faster than the corresponding optimized library routines in SCILIB, an optimized mathematical library supplied by Cray Research. On four processors, we gain an additional speedup of at least 3.7. 相似文献

6.

On the computation of quotients and factors of regular languages

Mircea Marin Temur Kutsia 《Frontiers of Computer Science in China》2010,4(2):173-184

Quotients and factors are important notions in the design of various computational procedures for regular languages and for the analysis of their logical properties. We propose a new representation of regular languages, by linear systems of language equations, which is suitable for the following computations: language reversal, left quotients and factors, right quotients and factors, and factor matrices. We present algorithms for the computation of all these notions, and indicate an application of the factor matrix to the computation of solutions of a particular language reconstruction problem. The advantage of these algorithms is that they all operate only on linear systems of language equations, while the design of the same algorithms for other representations often require translation to other representations. 相似文献

7.

Computing programs containing band linear recurrences on vectorsupercomputers

Haigeng Wang Nicolau A. Keung S. Kai-Yeung Siu 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(8):769-782

Many large-scale scientific and engineering computations, e.g., some of the Grand Challenge problems, spend a major portion of execution time in their core loops computing band linear recurrences (BLRs). Conventional compiler parallelization techniques cannot generate scalable parallel code for this type of computation because they respect loop-carried dependences (LCDs) in programs, and there is a limited amount of parallelism in a BLR with respect to LCDs. For many applications, using library routines to replace the core BLR requires the separation of BLR from its dependent computation, which usually incurs significant overhead. In this paper, we present a new scalable algorithm called the Regular Schedule, for parallel evaluation of BLRs. We describe our implementation of the Regular Schedule and discuss how to obtain maximum memory throughput in implementing the schedule on vector supercomputers. We also illustrate our approach, based on our Regular Schedule, to parallelizing programs containing BLR and other kinds of code. Significant improvements in CPU performance for a range of programs containing BLR implemented using the Regular Schedule in C over the same programs implemented using highly optimized coded-in-assembly BLAS routines [11] are demonstrated on Convex C240. Our approach can be used both at the user level in parallel programming code containing BLRs, and in compiler parallelization of such programs combined with recurrence recognition techniques for vector supercomputers 相似文献

8.

Sparse Matrix Computations on the Hypercube and Related Networks

《Journal of Parallel and Distributed Computing》1994,21(2):169-183

In this paper we present some parallel algorithms for matrix addition, matrix multiplication, Gaussian elimination, and other related computations on sparse matrices. Our algorithms are designed for the hypercube and related networks, but they can be easily implemented on any other local memory machine. We prove that, under certain assumptions, on a hypercube or related network with p processors our algorithms achieve a speedup proportional to p/log p. 相似文献

9.

Array Organization in Parallel Memories

Mayez Al-Mouhamed 《International journal of parallel programming》2004,32(2):123-163

The bandwidth mismatch between processor and main memory is one major throughput limiting problem. Although streamed computations have predictable access patterns their references have little temporal locality and are generally too long to cache. A memory and compiler co-optimization aimed at reducing low-level memory accesses using software and hardware locality optimizations is presented. We propose a scalable and predictable parallel memory based on a compiler synthesis of storage schemes for multi-dimensional arrays that are accessed by an arbitrary but known set of data access patterns. Using algebra of non-singular Boolean matrices, we present analysis of conflict-free access to (1) parallel memories, and (2) alignment networks. Finding a multi-pattern storage scheme is one NP-complete problem. An effective compiler heuristic is proposed for finding a storage matrix that minimizes overall memory access time. This applies to arbitrary linear patterns and arbitrary alignment networks. It is shown that the proposed storage scheme finds an optimal storage scheme for parallel (1) FFT, and (2) bitonic sorting. The proposed storage scheme outperforms statically optimized storages in the case of power-of-2 multi-stride access. The case of non power-of-2 strides is also addressed. The performance and scalability of the proposed parallel memory and its predictable access time are presented using numerical and multimedia algorithms. It is shown that a memory utilization above 83% is achieved by our storage scheme for 64 memories, which largely outperforms previous proposals. Our approach provides a tool for matching the storage pattern with the data access patterns needed for embedded systems running streamed computations with predictable data access patterns. 相似文献

10.

PB-BLAS: a set of parallel block basic linear algebra subprograms

Jaeyoung Choi Jack J. Dongarra David W. Walker 《Concurrency and Computation》1996,8(7):517-535

We propose a new software package which would be very useful for implementing dense linear algebra algorithms on block-partitioned matrices. The routines are referred to as block basic linear algebra subprograms (BLAS), and their use is restricted to computations in which one or more of the matrices involved consists of a single row or column of blocks, and in which no more than one of the matrices consists of an unrestricted two-dimensional array of blocks. The functionality of the block BLAS routines can also be provided by Level 2 and 3 BLAS routines. However, for non-uniform memory access machines the use of the block BLAS permits certain optimizations in memory access to be taken advantage of. This is particularly true for distributed memory machines, for which the block BLAS are referred to as the parallel block basic linear algebra subprograms (PB-BLAS). The PB-BLAS are the main focus of this paper, and for a block-cyclic data distribution, in a single row or column of blocks lies in a single row or column of the processor template. The PB-BLAS consist of calls to the sequential BLAS for local computations, and calls to the BLACS for communication. The PB-BLAS are the building blocks for implementing ScaLAPACK, the distributed-memory version of LAPACK, and provide the same ease-of-use and portability for ScaLAPACK that the BLAS provide for LAPACK. The PB-BLAS consist of all Level 2 and 3 BLAS routines for dense matrix computations (not for banded matrix) and four auxiliary routines for transposing and copying of a vector and/or a block vector. The PB-BLAS are currently available for all numeric data types, i.e., single and double precision, real and complex. 相似文献

11.

Multicoloring of grid-structured PDE solvers on shared-memorymultiprocessors

Hwang-Cheng Wang Kai Hwang 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(11):1195-1205

In order to execute a parallel PDE (partial differential equation) solver on a shared-memory multiprocessor, we have to avoid memory conflicts in accessing multidimensional data grids. A new multicoloring technique is proposed for speeding sparse matrix operations. The new technique enables parallel access of grid-structured data elements in the shared memory without causing conflicts. The coloring scheme is formulated as an algebraic mapping which can be easily implemented with low overhead on commercial multiprocessors. The proposed multicoloring scheme bas been tested on an Alliant FX/80 multiprocessor for solving 2D and 3D problems using the CGNR method. Compared to the results reported by Saad (1989) on an identical Alliant system, our results show a factor of 30 times higher performance in Mflops. Multicoloring transforms sparse matrices into ones with a diagonal diagonal block (DDB) structure, enabling parallel LU decomposition in solving PDE problems. The multicoloring technique can also be extended to solve other scientific problems characterized by sparse matrices 相似文献

12.

Inverse system design for LPV systems using parameter-dependent Lyapunov functions

Masayuki Sato^{Author Vitae} 《Automatica》2008,44(4):1072-1077

This paper addresses the design problem of gain-scheduled inverse systems (GSISs) for linear parameter-varying (LPV) systems, whose state-space matrices are represented as parametrically affine matrices, using parameter-dependent Lyapunov functions (PDLFs), and proposes a method for them via parametrically affine linear matrix inequalities (LMIs). Our method includes robust inverse system (RIS) design as a special case. For RIS design, our method theoretically encompasses the method using constant Lyapunov functions. A design example is included to illustrate our conclusions. 相似文献

13.

Parallel block tridiagonalization of real symmetric matrices

Yihua Bai Robert C. Ward 《Journal of Parallel and Distributed Computing》2008

Two parallel block tridiagonalization algorithms and implementations for dense real symmetric matrices are presented. Block tridiagonalization is a critical pre-processing step for the block tridiagonal divide-and-conquer algorithm for computing eigensystems and is useful for many algorithms desiring the efficiencies of block structure in matrices. For an “effectively” sparse matrix, which frequently results from applications with strong locality properties, a heuristic parallel algorithm is used to transform it into a block tridiagonal matrix such that the eigenvalue errors remain bounded by some prescribed accuracy tolerance. For a dense matrix without any usable structure, orthogonal transformations are used to reduce it to block tridiagonal form using mostly level 3 BLAS operations. Numerical experiments show that block tridiagonal structure obtained from this algorithm directly affects the computational complexity of the parallel block tridiagonal divide-and-conquer eigensolver. Reduction to block tridiagonal form provides significantly lower execution times, as well as memory traffic and communication cost, over the traditional reduction to tridiagonal form for eigensystem computations. 相似文献

14.

Stochastic approximation and realization of Hankel matrices

TOMMASO ISOLA ANTONIO TORNAMBÈ 《International journal of systems science》2013,44(7):1229-1240

Dynamic systems are considered whose outputs can be represented either by a deterministic series of the input variables or by a stochastic series of brownian motion processes. Approximate representations of such series are proposed. The representations are based on the Hankel matrix representation of the series. First, it is shown that any Hankel matrix, belonging to suitable classes, can be approximated in the deterministic and stochastic sense by finite rank Hankel matrices. Then, a method is proposed to design bilinear realizations of finite rank Hankel matrices, effective in the deterministic and stochastic sense. 相似文献

15.

On solving Diophantine equations by real matrix manipulation

Yamada M. Piao Chung Zun Funahashi Y. 《Automatic Control, IEEE Transactions on》1995,40(1):118-122

This note presents simple algorithms for obtaining the solutions of the Diophantine equation. Our methods can produce classes of all solutions with lower degree than a specified number. The previous algorithms involve some troublesome computations, e.g., the calculation of both the controllability indexes and the observability indexes or the solution of a pole assignment problem, etc. Our contribution is that our algorithm requires only basic matrix operations such as addition, subtraction, multiplication, and inversion of given real matrices. In addition, by solving simple linear equations, the class of all minimum degree solutions can be given. Therefore the computational efforts are reduced compared with previous algorithms 相似文献

16.

GPU-OSDDA: a bit-vector GPU-based deadlock detection algorithm for single-unit resource systems

Stephen Abell Nhan Do 《International Journal of Parallel, Emergent and Distributed Systems》2016,31(5):450-468

This article presents a GPU-based single-unit deadlock detection methodology and its algorithm, GPU-OSDDA. Our GPU-based design utilizes parallel hardware of GPU to perform computations and thus is able to overcome the major limitation of prior hardware-based approaches by having the capability of handling thousands of processes and resources, whilst achieving real-world run-times. By utilizing a bit-vector technique for storing algorithm matrices and designing novel, efficient algorithmic methods, we not only reduce memory usage dramatically but also achieve two orders of magnitude speedup over CPU equivalents. Additionally, GPU-OSDDA acts as an interactive service to the CPU, because all of the aforementioned computations and matrix management techniques take place on the GPU, requiring minimal interaction with the CPU. GPU-OSDDA is implemented on three GPU cards: Tesla C2050, Tesla K20c, and Titan X. Our design shows overall speedups of 6-595X over CPU equivalents. 相似文献

17.

Constant Time Boolean Matrix Multiplication on a Linear Array with a Reconfigurable Pipelined Bus System

Li Keqin 《The Journal of supercomputing》1997,11(4):391-403

We show that the product of two N × N boolean matrices can be calculated in constant time on an LARPBS with O(N3 / log N) processors. All data communications and computations are performed on the bit level. To the best of the author's knowledge, this is the first parallel boolean matrix multiplication algorithm that has constant execution time, and is executed on a distributed memory system with (N3) processors. By using our boolean matrix multiplication algorithm, it is shown that the transitive closure of a directed graph can be obtained in O(log N) time ( measured by bit level operations) on an LARPBS with O (N3 / log N) processors. To the best of our knowledge, this is the first parallel algorithm for tansitive closure of directed graphs with time complexity O(log N) (comparable to that of CRCW PRAM) and cost O (N3) on a realistic parallel computing model, which has no shared memory, and interprocessor communications are dealt with explicitly and efficiently. 相似文献

18.

Distributed optimal component fusion weighted by scalars for fixed-lag Kalman smoother 总被引：2，自引：0，他引：2

Shu-Li Sun 《Automatica》2005,41(12):2153-2159

Based on the optimal fusion criterion in the linear minimum variance sense, a distributed optimal fusion fixed-lag Kalman smoother with a three-layer fusion structure is given for the discrete time-varying linear stochastic control systems with multiple sensors and correlated noises. Its components are estimated by scalar weighting fusion, respectively. It only requires in parallel a series of computations of the weighted scalars, and avoids the computations of the weighted matrices, so that the computational burden can obviously be reduced. Further, the steady-state fusion smoother is also given for the discrete time-invariant linear stochastic control systems. The scalar weights can be obtained by fusing once after all local estimations reach steady state. It can reduce the online computational burden. Also, the computation formulas of smoothing error cross-covariance matrices are given. Two simulation examples show the performance. 相似文献

19.

Learning low-rank Mercer kernels with fast-decaying spectrum

Binbin PanAuthor Vitae Pong C. Yuen^{Author Vitae} 《Neurocomputing》2011,74(17):3028-3035

Low-rank representations have received a lot of interest in the application of kernel-based methods. However, these methods made an assumption that the spectrum of the Gaussian or polynomial kernels decays rapidly. This is not always true and its violation may result in performance degradation. In this paper, we propose an effective technique for learning low-rank Mercer kernels (LMK) with fast-decaying spectrum. What distinguishes our kernels from other classical kernels (Gaussian and polynomial kernels) is that the proposed always yields low-rank Gram matrices whose spectrum decays rapidly, no matter what distribution the data are. Furthermore, the LMK can control the decay rate. Thus, our kernels can prevent performance degradation while using the low-rank approximations. Our algorithm has favorable in scalability—it is linear in the number of data points and quadratic in the rank of the Gram matrix. Empirical results demonstrate that the proposed method learns fast-decaying spectrum and significantly improves the performance. 相似文献

20.

Partial eigenstructure assignment and its application to largescale systems

Lu J. Chiang H.-D. Thorp J.S. 《Automatic Control, IEEE Transactions on》1991,36(3):340-347

The notion of partial eigenstructure assignment (PEA) via linear state feedback control in linear multivariable systems is introduced. This notion is a natural extension of eigenstructure assignment and partial eigenvalue assignment. Some theoretical basis for PEA is provided, and a parametric expression for feedback gain matrices achieving PEA is derived. An effective numerical algorithm for PEA tailored to large-scale systems is presented. As an extension of the algorithm, a recursive algorithm for eigenstructure assignment is presented. These algorithms possess the following desired properties: (1) compared to existing methods, the presented algorithms significantly reduce the required computation time via transforming high-dimensional matrix computations into low-dimensional matrix computations; (2) they can be implemented in a parallel fashion. The proposed algorithm for PEA is applied to modal control of large flexible space structure systems 相似文献