期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallel solution of dense linear systems using diagonalization methods

《国际计算机数学杂志》2012,89(3-4):249-270

We study the parallel implementation of two diagonalization methods for solving dense linear systems: the well known Gauss-Jordan method and a new one introduced by Huard. The number of arithmetic operations performed by the Huard method is the same as for Gaussian elimination, namely 2n ³/3, less than for the Jordan method, namely n ³. We introduce parallel versions of these methods, compare their performances and study their complexity. We assume a shared memory computer with a number of processors p of the order of n, the size of the problem to be solved, We show that the best parallel version for Jordan's method is by rows whereas the best one for Huard's method is by columns. Our main result states that for a small number of processors the parallel Huard method is faster than the parallel Jordan method and slower otherwise. The separation is obtained for p = 0.44n. 相似文献

2.

Fast linear systolic matrix vector multiplication

《国际计算机数学杂志》2012,89(3-4):231-248

The systolic concept in the parallel architecture design proposed by the H. T. Kung [1,2] obtains high throughput and speedups. The linear array for the matrix vector multiplication executes the algorithm in 2n ? 1 time steps using 2n ? 1 processors. Although the speedup obtained is very high, the efficiency is very poor (typical values of 25% efficiency for problem size greater than 10). H. T. Kung proposed an idea for a linear systolic array using two data streams flowing in opposite directions. However, the processors in the array perform operations in every second time moment.

Attempts to improve this design have been made by many researchers. Nonlinear and folding transformations techniques [3,4,5] only decrease the number of processors used to half the size, but do not affect the time.

We propose the use of a fast linear systolic computation procedure to obtain a solution that uses 3n/2 processors and executes the algorithm in 3n/2 time steps for the same cells, the same communication and the same regular data flow as the H. T. Kung linear array. Only the algorithm is restructured and more efficiently organized. Now the processors are utilized in every time step and no idle steps are required. 相似文献

3.

Performance of parallel Cholesky factorization algorithms using BLAS

Glenn R. Luecke Jae Heon Yun Philip W. Smith 《The Journal of supercomputing》1992,6(3-4):315-329

This paper considers four parallel Cholesky factorization algorithms, including SPOTRF from the February 1992 release of LAPACK, each of which call parallel Level 2 or 3 BLAS, or both. A fifth parallel Cholesky algorithm that calls serial Level 3 BLAS is also described. The efficiency of these five algorithms on the CRAY-2, CRAY Y-MP/832, Hitachi Data Systems EX 80, and IBM 3090-600J is evaluated and compared with a vendor-optimized parallel Cholesky factorization algorithm. The fifth parallel Cholesky algorithm that calls serial Level 3 BLAS provided the best performance of all algorithms that called BLAS routines. In fact, this algorithm outperformed the Cray-optimized libsci routine (SPOTRF) by 13–44%;, depending on the problem size and the number of processors used.This work was supported by grants from IMSL, Inc., and Hitachi Data Systems. The first version of this paper was presented as a poster session at Supercomputing '90, New York City, November 1990. 相似文献

4.

Parallel Option Price Valuations with the Explicit Finite Difference Method

Alexandros V. Gerbessiotis 《International journal of parallel programming》2010,38(2):159-182

We show how computations such as those involved in American or European-style option price valuations with the explicit finite difference method can be performed in parallel. Towards this we introduce a latency tolerant parallel algorithm for performing such computations efficiently that achieves optimal theoretical speedup p, where p is the number of processor of the parallel system. An implementation of the parallel algorithm has been undertaken, and an evaluation of its performance is carried out by performing an experimental study on a high-latency PC cluster, and at a smaller scale, on a multi-core processor using in addition the SWARM parallel computing framework for multi-core processors. Our implementation of the parallel algorithm is not only architecture but also communication library independent: the same code works under LAM-MPI and Open MPI and also BSPlib, two sets of library frameworks that facilitate parallel programming. The suitability of our approach to multi-core processors is also established. 相似文献

5.

Improving Parallel Ordering of Sparse Matrices Using Genetic Algorithms

Wen-Yang?Lin Email author 《Applied Intelligence》2005,23(3):257-265

In the direct solution of sparse symmetric and positive definite linear systems, finding an ordering of the matrix to minimize the height of the elimination tree (an indication of the number of parallel elimination steps) is crucial for effectively computing the Cholesky factor in parallel. This problem is known to be NP-hard. Though many effective heuristics have been proposed, the problems of how good these heuristics are near optimal and how to further reduce the height of the elimination tree remain unanswered. This paper is an effort for this investigation. We introduce a genetic algorithm tailored to this parallel ordering problem, which is characterized by two novel genetic operators, adaptive merge crossover and tree rotate mutation. Experiments showed that our approach is cost effective in the number of generations evolved to reach a better solution in reducing the height of the elimination tree. 相似文献

6.

Massively parallel processing in coastal ocean circulation model

José Gómez-Valdés Dong-Ping Wang 《Journal of scientific computing》1995,10(3):305-323

Massively parallel computers are becoming quite common in their use in computational fluid dynamics. In this study, a parallel algorithm of a 3-D primitive-equation coastal ocean circulation model is designed on the hypercube MIMD computer architecture. The grid is partitioned using one-dimensional domain decomposition. The code is tested in a uniform rectangular grid problem for which the model domain in each node is a cube. For the problem where the grain size (n _y) is fixed, the speedup is linear and is close to ideal forP 8 processors. The overhead (F _C) increases as the number of processors increases. The background overhead is inversely proportional to the size of the grain. The slopeF _C vs.P is a measure of the fraction of non-parallel code. For the problem where the domain is fixed, the speedup is 7.8 using 8-processors and 29.6 using 32-processors. The overhead increases linearly withP. The slopeF _C is a measure of the communication cost. The load balancing problem is examined for a model of the Gulf of California whose computational domain is irregular. 相似文献

7.

A locally optimized reordering algorithm and its application to a parallel sparse linear system solver

K. Gallivan P. C. Hansen Tz. Ostromsky Z. Zlatev 《Computing》1995,54(1):39-67

A coarse-grain parallel solver for systems of linear algebraic equations with general sparse matrices by Gaussian elimination is discussed. Before the factorization two other steps are performed. A reordering algorithm is used during the first step in order to obtain a permuted matrix with as many zero elements under the main diagonal as possible. During the second step the reordered matrix is partitioned into blocks for asynchronous parallel processing (normally the number of blocks is equal to the number of processors). It is possible to obtain blocks with nearly the same number of rows, because there is no requirement to produce square diagonal blocks. The first step is much more important than the second one and has a significant influence on the performance of the solver. A straightforward implementation of the reordering algorithm will result inO(n ²) operations. By using binary trees this cost can be reduced toO(NZ logn), whereNZ is the number of non-zero elements in the matrix andn is its order (normallyNZ is much smaller thann ²). Some experiments on parallel computers with shared memory have been performed. The results show that a solver based on the proposed reordering performs better than another solver based on a cheaper (but at the same time rather crude) reordering whose cost is onlyO(NZ) operations. 相似文献

8.

An optimal parallel algorithm for triangulating a set of points in the plane 总被引：1，自引：0，他引：1

Merks Ed 《International journal of parallel programming》1986,15(5):399-411

This paper presents an optimal parallel algorithm for triangulating an arbitrary set ofn points in the plane. The algorithm runs inO(logn) time usingO(n) space andO(_n) processors on a Concurrent-Read, Exclusive-Write Parallel RAM model (CREW PRAM). The parallel lower bound on triangulation is (logn) time so the best possible linear speedup has been achieved. A parallel divide-and-conquer technique of subdividing a problem into subproblems is employed. 相似文献

9.

Parallel state-space search for a first solution with consistent linear speedups

L. V. Kalé Vikram A. Saletore 《International journal of parallel programming》1990,19(4):251-293

Consider the problem of exploring a large state-space for a goal state where although many such states may exist in the state-space, finding any one state satisfying the requirements is sufficient. All the methods known until now for conducting such search in parallel using multiprocessors fail to provide consistent linear speedups over sequential execution. The speedups vary between sublinear to superlinear and from one execution to another. Further, adding more processors may sometimes lead to a slow-down rather than speedup, giving rise to speedup anomalies reported in literature. We present a prioritizing strategy which yields consistent speedups that are close toP withP processors, and that monotonically increase with the additon of processors. This is achieved by keeping the total number of nodes expanded during parallel search very close to that of a sequential search. In addition, the strategy requires substantially smaller memory relative to other methods. The performance of this strategy is demonstrated on a multiprocessor with several state-space search problems.This research has been supported in part by the National Science Foundation under Contract No. CCR-89-02496. 相似文献

10.

Efficient Parallel Computation of the Characteristic Polynomial of a Sparse, Separable Matrix

J. H. Reif 《Algorithmica》2001,29(3):487-510

{This paper is concerned with the problem of computing the characteristic polynomial of a matrix. In a large number of applications, the matrices are symmetric and sparse : with O(n) non-zero entries. The problem has an efficient sequential solution in this case, requiring O(n ² ) work by use of the sparse Lanczos method. A major remaining open question is: to find a polylog time parallel algorithm with matching work bounds. Unfortunately, the sparse Lanczos method cannot be parallelized to faster than time Ω (n) using n processors. Let M(n) be the processor bound to multiply two n \times n matrices in O(log n) parallel time. Giesbrecht [G2] gave the best previous polylog time parallel algorithms for the characteristic polynomial of a dense matrix with O (M(n)) processors. There is no known improvement to this processor bound in the case where the matrix is sparse. Often, in addition to being symmetric and sparse, the matrix has a sparsity graph (which has edges between indices of the matrix with non-zero entries) that has small separators. This paper gives a new algorithm for computing the characteristic polynomial of a sparse symmetric matrix, assuming that the sparsity graph is s(n) -separable and has a separator of size s(n)=O(n ^γ ) , for some γ , 0 < γ < 1 , that when deleted results in connected components of ≤α n vertices, for some 0 < α < 1 , with the same property. We derive an interesting algebraic version of Nested Dissection, which constructs a sparse factorization of the matrix A-λ I _n where A is the input matrix and I _n is the n \times n identity matrix. While Nested Dissection is commonly used to minimize the fill-in in the solution of sparse linear systems, our innovation is to use the separator structure to bound also the work for manipulation of rational functions in the recursively factored matrices. The matrix elements are assumed to be over an arbitrary field. We compute the characteristic polynomial of a sparse symmetric matrix in polylog time using P(n)(n+M(s(n))) ≤ P(n)(n+ s(n) ^2.376 ) processors, where P(n) is the processor bound to multiply two degree n polynomials in O(log n) parallel time using a PRAM (P(n) = O(n) if the field supports an FFT of size n but is otherwise O(nlog log n) [CK]. Our method requires only that a matrix be symmetric and non-singular (it need not be positive definite as usual for Nested Dissection techniques). For the frequently occurring case where the matrix has small separator size, our polylog parallel algorithm has work bounds competitive with the best known sequential algorithms (i.e., the Ω(n ² ) work of sparse Lanczos methods), for example, when the sparsity graph is a planar graph, s(n) ≤ O( \sqrt n ) , and we require polylog time with only P(n)n ^1.188 processors. } Received September 26, 1997; revised June 5, 1999. 相似文献

11.

Sparse Matrix Computations on the Hypercube and Related Networks

《Journal of Parallel and Distributed Computing》1994,21(2):169-183

In this paper we present some parallel algorithms for matrix addition, matrix multiplication, Gaussian elimination, and other related computations on sparse matrices. Our algorithms are designed for the hypercube and related networks, but they can be easily implemented on any other local memory machine. We prove that, under certain assumptions, on a hypercube or related network with p processors our algorithms achieve a speedup proportional to p/log p. 相似文献

12.

Parallel bisecting <Emphasis Type="Italic">k</Emphasis>-means with prediction clustering algorithm

Yanjun Li Soon M. Chung 《The Journal of supercomputing》2007,39(1):19-37

In this paper, we propose a new parallel clustering algorithm, named Parallel Bisecting k-means with Prediction (PBKP), for message-passing multiprocessor systems. Bisecting k-means tends to produce clusters of similar sizes, and according to our experiments, it produces clusters with smaller entropy (i.e., purer clusters) than k-means does. Our PBKP algorithm fully exploits the data-parallelism of the bisecting k-means algorithm, and adopts a prediction step to balance the workloads of multiple processors to achieve a high speedup. We implemented PBKP on a cluster of Linux workstations and analyzed its performance. Our experimental results show that the speedup of PBKP is linear with the number of processors and the number of data points. Moreover, PBKP scales up better than the parallel k-means with respect to the dimension and the desired number of clusters. This research was supported in part by AFRL/Wright Brothers Institute (WBI). 相似文献

13.

An optimal scheduling procedure for matrix inversion on linear array at a processor level

M. K. Stojčev E. I. Milovanović I. Ž. Milovanović 《International journal of parallel programming》1994,22(4):435-448

This paper presents a parallel algorithm for computing the inversion of a dense matrix based on modified Jordan's elimination which requires fewer calculation steps than the standard one. The algorithm is proposed for the implementation on the linear array with a small to moderate number of processors which operate in a parallel-pipeline fashion. A communication between neighboring processors is achieved by a common memory module implemented as a FIFO memory module. For the proposed algorithm we define a task scheduling procedure and prove that it is time optimal. In order to compute the speedup and efficiency of the system, two definitions (Amdahl's and Gustafson's) were used. For the proposed architecture, involving two to 16 processors, estimated Gustafson's (Amdahl's) speedups are in the range 1.99 to 13.76 (1.99 to 9.69). 相似文献

14.

Pursuit and Evasion on a Ring: An Infinite Hierarchy for Parallel Real-Time Systems

S. D. Bruda S. G. Akl 《Theory of Computing Systems》2001,34(6):565-576

We show that, for any positive integer n , there exists at least one timed ω -language L _n which is accepted by a 2n -processor real-time algorithm using arbitrarily slow processors, but cannot be accepted by a (2n-1) -processor real-time algorithm. It follows therefore that real-time algorithms form an infinite hierarchy with respect to the number of processors used. Furthermore, such a result holds for any model of parallel computation. Received October 9, 2000 and in final form August 9, 2001. Online publication November 23, 2001. 相似文献

15.

A systolic algorithm for riccati and lyapunov equations

J. -P. Charlier P. Van Dooren 《Mathematics of Control, Signals, and Systems (MCSS)》1989,2(2):109-136

Riccati and Lyapunov equations can be solved using the recursive matrix sign method applied to symmetric matrices constructed from the corresponding Hamiltonian matrices. In this paper we derive an efficient systolic implementation of that algorithm where theLDL ^T andUDU ^T decompositions of those symmetric matrices are propagated. As a result the solution of a class of Riccati and Lyapunov equations can be obtained inO(n) time steps on a bidimensional (triangular) grid ofO(n ²) processors, leading to an optimal speedup. 相似文献

16.

Approximate algorithms for the knapsack problem on parallel computers

P. S. Gopalakrishnan I. V. Ramakrishnan L. N. Kanal 《Information and Computation》1991,91(2)

Computing an optimal solution to the knapsack problem is known to be NP-hard. Consequently, fast parallel algorithms for finding such a solution without using an exponential number of processors appear unlikely. An attractive alternative is to compute an approximate solution to this problem rapidly using a polynomial number of processors. In this paper, we present an efficient parallel algorithm for finding approximate solutions to the 0–1 knapsack problem. Our algorithm takes an , 0 < < 1, as a parameter and computes a solution such that the ratio of its deviation from the optimal solution is at most a fraction of the optimal solution. For a problem instance having n items, this computation uses O(n^5/2/^3/2) processors and requires O(log³n + log²nlog(1/)) time. The upper bound on the processor requirement of our algorithm is established by reducing it to a problem on weighted bipartite graphs. This processor complexity is a significant improvement over that of other known parallel algorithms for this problem. 相似文献

17.

A method for updating Cholesky factorization of a band matrix

Wei H. Yang 《Computer Methods in Applied Mechanics and Engineering》1977,12(3):281-288

A method is presented for updating the Cholesky factorization of a band symmetric matrix modified by a rank-one matrix which has the same band width. Problems which could involve applications of such a method arise frequently in plasticity and structural optimization where repeated solutions of a band algebraic system with a changing matrix are needed. The Cholesky factorization of a stiffness matrix can be updated after modifying a local stiffness matrix which can be written as a sum of a few rank-one matrices. The number of operations required for the updating is of the order mn or less, where n is the dimension of the global matrix and m is its half band width (including the diagonal). 相似文献

18.

A Parallel Algorithm for Linear Programs with an Additional Reverse Convex Constraint

Shih-Mim Liu G.P. Papavassilopoulos 《Journal of Parallel and Distributed Computing》1997,45(2):91

A parallel method for globally minimizing a linear program with an additional reverse convex constraint is proposed which combines the outer approximation technique and the cutting plane method. Basicallyp(≤n) processors are used for a problem withnvariables and a globally optimal solution is found effectively in a finite number of steps. Computational results are presented for test problems with a number of variables up to 80 and 63 linear constraints (plus nonnegativity constraints). These results were obtained on a distributed-memory MIMD parallel computer, DELTA, by running both serial and parallel algorithms with double precision. Also, based on 40 randomly generated problems of the same size, with 16 variables and 32 linear constraints (plusx≥ 0), the numerical results from different number processors are reported, including the serial algorithm's. 相似文献

19.

Processor-time optimal parallel algorithms for digitized images on mesh-connected processor arrays

Hussein M. Alnuweiri V. K. Prasanna Kumar 《Algorithmica》1991,6(1):698-733

We present processor-time optimal parallel algorithms for several problems onn ×n digitized image arrays, on a mesh-connected array havingp processors and a memory of sizeO(n ²) words. The number of processorsp can vary over the range [1,n ^3/2] while providing optimal speedup for these problems. The class of image problems considered here includes labeling the connected components of an image; computing the convex hull, the diameter, and a smallest enclosing box of each component; and computing all closest neighbors. Such problems arise in medium-level vision and require global operations on image pixels. To achieve optimal performance, several efficient data-movement and reduction techniques are developed for the proposed organization.This research was supported in part by the National Science Foundation under Grant IRI-8710836 and in part by DARPA under Contract F33615-87-C-1436 monitored by the Wright Patterson Airforce Base. 相似文献

20.

在消息传递并行机上的高效的最小生成树算法 总被引：5，自引：0，他引：5

王光荣顾乃杰《软件学报》2000,11(7):889-898

基于传统的Borǔ vka串行最小生成树算法,提出了一个在消息传递并行机上的高效的最小生成树算法.并且采用3种方法来提高该算法的效率,即通过两趟合并及打包收缩的方法来减少通信开销,通过平衡数据分布的办法使各个处理器的计算量平衡.该算法的计算和通信复杂度分别为O(n²/p)和O((t_sp+t_wn)n/p).在曙光-1000并行机上运行的实际效果是,对于有10 000个顶点的稀疏图,通过16个节点的运行加速比是12. 相似文献