期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

OpenMP based parallel normalized direct methods for sparse finite element linear systems 总被引：1，自引：0，他引：1

George A. Gravvanis 《The Journal of supercomputing》2009,47(1):44-52

A new parallel normalized exact inverse algorithm is presented for solving sparse symmetric finite element linear systems on symmetric multiprocessor systems (SMP), based upon an antidiagonal motion approach (“wave”-like pattern) for overcoming the data dependencies. The proposed algorithm was implemented using OpenMP directives. Numerical results, such as speedups and efficiency, are presented illustrating the efficient performance on a symmetric multiprocessor computer system, where the proposed algorithmic solution method achieves good speedups.

George A. GravvanisEmail:

相似文献

2.

并行J-变量块Cholesky分解算法的仿真研究

顾耀林刘万龙刘强胡寿伟《计算机仿真》2006,23(8):82-85

该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库,在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能,并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。相似文献

3.

Efficient External Memory Algorithms by Simulating Coarse-Grained Parallel Algorithms

Dehne Dittrich Hutchinson 《Algorithmica》2008,36(2):97-122

Abstract. External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACM Working Group on Storage I/ O for Large-Scale Computing. In this paper we provide a simulation technique which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms. The techniques obtained can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. When applied to existing BSP-like algorithms, our simulation technique produces improved parallel EM algorithms for a large number of problems. 相似文献

4.

Parallel nested dissection

John M. Conroy 《Parallel Computing》1990,16(2-3):139-156

Nested dissection is a very popular direct method for solving sparse linear systems that arise from finite difference and finite element methods. Worley and Schreiber [16] give a fine grain algorithm for a square array of processors. Their algorithm uses O(N²) processors, each with O(N) memory, to factor an N² by N² sparse matrix whose graphs is an N × N mesh. The efficiency of their method is between 1/46 and 1/12. George et al. [6] [8] give a medium grain algorithm for hypercube architecture, while George et al. [7] give an algorithm for shared memory machines. These papers present a column oriented approach which can exploit O(N) parallelism and yield efficiencies up to 50%. Lucas [11] also gives a column oriented scheme which achieves up to 75% efficiency and O(N) parallelism. In this paper, we present a medium to fine grain algorithm for a P × P array of processors with local memory. This algorithm can exploit up to O(N²) parallelism. The efficiency of the fine grain version is comparable to [16] while as a medium grain algorithm achieves about 49% efficiency. The strength of the method is due to three factors: its ability to pipeline much of the computation, overlapping computation and communication, and the use of level 3 BLAS like primitives. In addition to its high efficiency its memory requirement is optimal, only O(N² log N/P²) words memory is needed per processor. 相似文献

5.

The Role of Arithmetic in Fast Parallel Matrix Inversion

B. Codenotti M. Leoncini F. P. Preparata 《Algorithmica》2001,30(4):685-707

In the last two decades several NC algorithms for solving basic linear algebraic problems have appeared in the literature. This interest was clearly motivated by the emergence of a parallel computing technology and by the wide applicability of matrix computations. The traditionally adopted computation model, however, ignores the arithmetic aspects of the applications, and no analysis is currently available demonstrating the concrete feasibility of many of the known fast methods. In this paper we give strong evidence to the contrary, on the sole basis of the issue of robustness, indicating that some theoretically brilliant solutions fail the severe test of the ``Engineering of Algorithms.' We perform a comparative analysis of several well-known numerical matrix inversion algorithms under both fixed- and variable-precision models of arithmetic. We show that, for most methods investigated, a typical input leads to poor numerical performance, and that in the exact-arithmetic setting no benefit derives from conditions usually deemed favorable in standard scientific computing. Under these circumstances, the only algorithm admitting sufficiently accurate NC implementations is Newton's iterative method, and the word size required to guarantee worst-case correctness appears to be the critical complexity measure. Our analysis also accounts for the observed instability of the considered superfast methods when implemented with the same floating-point arithmetic that is perfectly adequate for the fixed-precision approach. Received March 28, 1998; revised February 2, 1999, and April 21, 1999. 相似文献

6.

Parallel,iterative solution of sparse linear systems: Models and architectures

Daniel A Reed Merrell L Patrick 《Parallel Computing》1985,2(1):45-67

Solving large, sparse, linear systems of equations is a fundamental problems in large scale scientific and engineering computation. A model of a general class of asynchronous, iterative solution methods for linear systems is developed. In the model, the system is solved by creating several cooperating tasks that each compute a portion of the solution vector. A data transfer model predicting both the probability that data must be transferred between two tasks and the amount of data to be transferred is presented. This model is used to derive an execution time model for predicting parallel execution time and an optimal number of tasks given the dimension and sparsity of the coefficient matrix and the costs of computation, synchronization, and communication.The suitability of different parallel architectures for solving randomly sparse linear systems is discussed. Based on the complexity of task scheduling, one parallel architecture, based on a broadcast bus, is presented and analyzed. 相似文献

7.

Structure-adaptive parallel solution of sparse triangular linear systems

Ehsan Totoni Michael T. HeathLaxmikant V. Kale 《Parallel Computing》2014

Solving sparse triangular systems of linear equations is a performance bottleneck in many methods for solving more general sparse systems. Both for direct methods and for many iterative preconditioners, it is used to solve the system or improve an approximate solution, often across many iterations. Solving triangular systems is notoriously resistant to parallelism, however, and existing parallel linear algebra packages appear to be ineffective in exploiting significant parallelism for this problem. 相似文献

8.

Efficient Computation of the Characteristic Polynomial of a Tree and Related Tasks

Martin Fürer 《Algorithmica》2014,68(3):626-642

An O(nlog² n) algorithm is presented to compute all coefficients of the characteristic polynomial of a tree on n vertices improving on the previously best quadratic time. With the same running time, the algorithm can be generalized in two directions. The algorithm is a counting algorithm for matchings, and the same ideas can be used to count other objects. For example, one can count the number of independent sets of all possible sizes simultaneously with the same running time. These counting algorithms not only work for trees, but can be extended to arbitrary graphs of bounded tree-width. 相似文献

9.

Design and Implementation of a Practical Parallel Delaunay Algorithm 总被引：1，自引：0，他引：1

G. E. Blelloch G. L. Miller J. C. Hardwick D. Talmor 《Algorithmica》1999,24(3-4):243-269

This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and some implementations based on bucketing that work well for uniform distributions, there has been little work on implementations for general distributions. We use the well known reduction of 2D Delaunay triangulation to find the 3D convex hull of points on a paraboloid. Based on this reduction we developed a variant of the Edelsbrunner and Shi 3D convex hull algorithm, specialized for the case when the point set lies on a paraboloid. This simplification reduces the work required by the algorithm (number of operations) from O(n log ² n) to O(n log n) . The depth (parallel time) is O( log ³ n) on a CREW PRAM. The algorithm is simpler than previous O(n log n) work parallel algorithms leading to smaller constants. Initial experiments using a variety of distributions showed that our parallel algorithm was within a factor of 2 in work from the best sequential algorithm. Based on these promising results, the algorithm was implemented using C and an MPI-based toolkit. Compared with previous work, the resulting implementation achieves significantly better speedups over good sequential code, does not assume a uniform distribution of points, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for the IBM SP2, Cray T3D, SGI Power Challenge, and DEC AlphaCluster. Received June 1, 1997; revised March 10, 1998. 相似文献

10.

Distributed Matrix-Free Solution of Large Sparse Linear Systems over Finite Fields

E. Kaltofen A. Lobo 《Algorithmica》1999,24(3-4):331-348

We describe a coarse-grain parallel approach for the homogeneous solution of linear systems. Our solutions are symbolic, i.e., exact rather than numerical approximations. We have performed an outer loop parallelization that works well in conjunction with a black box abstraction for the coefficient matrix. Our implementation can be run on a network cluster of UNIX workstations as well as on an SP-2 multiprocessor. Task distribution and management are effected through MPI and other packages. Fault tolerance, checkpointing, and recovery are incorporated. Detailed timings are presented for experiments with systems that arise in RSA challenge integer factoring efforts. For example, we can solve a 252,222 × 252,222 system with about 11.04 million nonzero entries over the Galois field with two elements using four processors of an SP-2 multiprocessor, in about 26.5 hours CPU time. Received June 1, 1997; revised March 10, 1998. 相似文献

11.

An Efficient Output-Size Sensitive Parallel Algorithm for Hidden-Surface Removal for Terrains

N. Gupta S. Sen 《Algorithmica》2001,31(2):179-207

We describe an efficient parallel algorithm for hidden-surface removal for terrain maps. The algorithm runs in O(log ⁴ n) steps on the CREW PRAM model with a work bound of O((n+k) \polylog ( n)) where n and k are the input and output sizes, respectively. In order to achieve the work bound we use a number of techniques, among which our use of persistent data structures is somewhat novel in the context of parallel algorithms. Received July 29, 1998; revised October 5, 1999. 相似文献

12.

Efficient Algorithms for Dynamic Allocation of Distributed Memory

T. Leighton E. J. Schwabe 《Algorithmica》1999,24(2):139-171

We consider the problem of dynamically allocating and deallocating local memory resources among multiple users in a parallel or distributed system. Given a group of independent users and a collection of interconnected local memory devices, we want to render the fragmentation of the memory resources irrelevant by allowing any user to allocate space for his or her purposes as long as there is space available anywhere in the system. In effect, we would like it to appear to the users as though they are allocating memory from a single central pool of memory, even though the space is distributed throughout the system. Our goal is to devise an on-line allocation algorithm that minimizes two cost measures: first, the fraction of unused space , which arises due to fragmentation of the memory; second, the slowdown needed by the system to service user requests, which arises due to the contention for access to the memory devices. We solve this distributed dynamic allocation problem in near-optimal fashion by devising an algorithm that allows the memory to be used to 100% of capacity despite the fragmentation and guarantees that service delays will always be within a constant factor of optimal. The algorithm is completely on-line (no foreknowledge of user activity is assumed) and can accommodate any sequence of allocations and deallocations by the users that does not violate global memory bounds. We also consider the distributed dynamic allocation problem in the more restrictive setting where the local memory devices are connected by a low-degree fixed-connection network, rather than being fully interconnected. In this case, communication costs must be more explicitly considered in our allocation algorithms. We give allocation algorithms for butterfly and hypercube networks, and prove necessary and sufficient conditions on the total amount of memory space needed for near-optimal algorithms to exist. Received November 5, 1996; revised December 10, 1997. 相似文献

13.

基于PVM的启发式搜索的并行计算模型设计 总被引：2，自引：1，他引：2

王京辉刘彩虹乔卫民《计算机工程》2005,31(1):68-70

通过分析人工智能中的A和A^*启发式搜索,提出了通过PVM工具包,设计和实现A和A^*启发式搜索的并行计算模型。在启发搜索过程中同时进行评估函数计算,使计算的速度加快。解决了在搜索解空间庞大,评估函数计算复杂的情况下,使用单计算机计算速度慢的问题。该文实现了基于PVM的启发式搜索过程,该模型可应用于一般性启发式搜索问题的并行计算模型。相似文献

14.

基于C 的稀疏矩阵乘法运算器的实现

ZHOU Min 《电脑编程技巧与维护》2008,(14)

稀疏矩阵是指那些多数元素为零的矩阵。本文利用稀疏矩阵"稀疏"特点进行存储和计算可以大大节省存储空间,提高计算效率。通过采用标准C 语言设计实现了稀疏矩阵乘法运算器。相似文献

15.

Optimal Parallel Randomized Algorithms for the Voronoi Diagram of Line Segments in the Plane

Rajasekaran Ramaswami 《Algorithmica》2002,33(4):436-460

Abstract. We present an optimal parallel randomized algorithm for the Voronoi diagram of a set of n nonintersecting (except possibly at endpoints) line segments in the plane. Our algorithm runs in O(log n) time with high probability using O(n) processors on a CRCW PRAM. This algorithm is optimal in terms of work done since the sequential time bound for this problem is Ω(n log n) . Our algorithm improves by an O(log n) factor the previously best known deterministic parallel algorithm, given by Goodrich, ó'Dúnlaing, and Yap, which runs in O( log ² n) time using O(n) processors. We obtain this result by using a new ``two-stage' random sampling technique. By choosing large samples in the first stage of the algorithm, we avoid the hurdle of problem-size ``blow-up' that is typical in recursive parallel geometric algorithms. We combine the two-stage sampling technique with efficient search and merge procedures to obtain an optimal algorithm. This technique gives an alternative optimal algorithm for the Voronoi diagram of points as well (all other optimal parallel algorithms for this problem use the transformation to three-dimensional half-space intersection). 相似文献

16.

Finding the k Shortest Paths in Parallel

E. Ruppert 《Algorithmica》2000,28(2):242-254

A concurrent-read exclusive-write PRAM algorithm is developed to find the k shortest paths between pairs of vertices in an edge-weighted directed graph. Repetitions of vertices along the paths are allowed. The algorithm computes an implicit representation of the k shortest paths to a given destination vertex from every vertex of a graph with n vertices and m edges, using O(m+nk log² k) work and O( log^3k log ^*k+ log n( log log k+ log ^*n)) time, assuming that a shortest path tree rooted at the destination is pre-computed. The paths themselves can be extracted from the implicit representation in O( log k + log n) time, and O(n log n +L) work, where L is the total length of the output. Received July 2, 1997; revised June 18, 1998. 相似文献

17.

Efficient decomposition and performance of parallel PDE,FFT, Monte Carlo simulations,simplex, and Sparse solvers

Zarka Cvetanovic Edward G. Freedman Charles Nofsinger 《The Journal of supercomputing》1991,5(2-3):219-238

In this paper, we describe the decomposition of six algorithms: two partial differential equations (PDE) solvers (successive over-relaxation [SOR] and alternating direction implicit [ADI]), fast Fourier transform (FFT), Monte Carlo simulations, Simplex linear programming, and Sparse solvers. The algorithms were selected not only because of their importance in scientific applications, but also because they represent a variety of computational (structured to irregular) and communication (low to high) requirements. We present the performance results of these algorithms on two shared-memory VAX/VMS^TM1 multiprocessor prototypes: the VAX 6300 series with up to 8 processors and the M31 with up to 22 processors. We demonstrate that by efficient decomposition it is possible to achieve high performance for all algorithms on both prototypes. We describe the efficient decomposition techniques applied to optimize the performance of parallel algorithms. Also, we discuss the performance implications due to different cache designs on two multiprocessors.An earlier version of this paper was presented at Supercomputing '90.At the time of writing, all three authors were with Digital Equipment Corporation, VMS Systems and Servers Group. 相似文献

18.

Parallel Complexity of Matrix Multiplication1

Santos Eunice E. 《The Journal of supercomputing》2003,25(2):155-175

Effective design of parallel matrix multiplication algorithms relies on the consideration of many interdependent issues based on the underlying parallel machine or network upon which such algorithms will be implemented, as well as, the type of methodology utilized by an algorithm. In this paper, we determine the parallel complexity of multiplying two (not necessarily square) matrices on parallel distributed-memory machines and/or networks. In other words, we provided an achievable parallel run-time that can not be beaten by any algorithm (known or unknown) for solving this problem. In addition, any algorithm that claims to be optimal must attain this run-time. In order to obtain results that are general and useful throughout a span of machines, we base our results on the well-known LogP model. Furthermore, three important criteria must be considered in order to determine the running time of a parallel algorithm; namely, (i) local computational tasks, (ii) the initial data layout, and (iii) the communication schedule. We provide optimality results by first proving general lower bounds on parallel run-time. These lower bounds lead to significant insights on (i)–(iii) above. In particular, we present what types of data layouts and communication schedules are needed in order to obtain optimal run-times. We prove that no one data layout can achieve optimal running times for all cases. Instead, optimal layouts depend on the dimensions of each matrix, and on the number of processors. Lastly, optimal algorithms are provided. 相似文献

19.

基于C＋＋的稀疏矩阵乘法运算器的实现

周敏《电脑编程技巧与维护》2008,(11):19-19,42

稀疏矩阵是指那些多数元素为零的矩阵。本文利用稀疏矩阵“稀疏”特点进行存储和计算可以大大节省存储空间,提高计算效率。通过采用标准C＋＋语言设计实现了稀疏矩阵乘法运算器。相似文献

20.

A Parallel Implementation of the Simplex Function Minimization Routine

Donghoon Lee Matthew Wiswall 《Computational Economics》2007,30(2):171-187

This paper generalizes the widely used Nelder and Mead (Comput J 7:308–313, 1965) simplex algorithm to parallel processors. Unlike most previous parallelization methods, which are based on parallelizing the tasks required to compute a specific objective function given a vector of parameters, our parallel simplex algorithm uses parallelization at the parameter level. Our parallel simplex algorithm assigns to each processor a separate vector of parameters corresponding to a point on a simplex. The processors then conduct the simplex search steps for an improved point, communicate the results, and a new simplex is formed. The advantage of this method is that our algorithm is generic and can be applied, without re-writing computer code, to any optimization problem which the non-parallel Nelder–Mead is applicable. The method is also easily scalable to any degree of parallelization up to the number of parameters. In a series of Monte Carlo experiments, we show that this parallel simplex method yields computational savings in some experiments up to three times the number of processors. 相似文献