首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A new parallel normalized exact inverse algorithm is presented for solving sparse symmetric finite element linear systems on symmetric multiprocessor systems (SMP), based upon an antidiagonal motion approach (“wave”-like pattern) for overcoming the data dependencies. The proposed algorithm was implemented using OpenMP directives. Numerical results, such as speedups and efficiency, are presented illustrating the efficient performance on a symmetric multiprocessor computer system, where the proposed algorithmic solution method achieves good speedups.
George A. GravvanisEmail:
  相似文献   

2.
该文提出一个针对大型实对称正定稠密方程组或复对称非Hermitian稠密方程组线性求解器的并行分布式算法。它使用了不同于ScaLAPACK的J-变量块Cholesky分解算法和一维块循环列数据分配。该算法以MPI作为消息传递库,在最多可达16个处理器的集群上针对实对称正定稠密方程组可提供与ScaLAPACK近似的浮点操作性能,并可解决一些涉及复对称非Hermitian稠密方程组的电磁场散射问题。该算法的优点是执行Cholesky分解所需的存储量只是标准并行库ScaLAPACK的一半。仿真的数值结果表明该算法是正确、有效的。  相似文献   

3.
Dehne  Dittrich  Hutchinson 《Algorithmica》2008,36(2):97-122
Abstract. External memory (EM) algorithms are designed for large-scale computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Typical EM algorithms are specially crafted for the EM situation. In the past, several attempts have been made to relate the large body of work on parallel algorithms to EM, but with limited success. The combination of EM computing, on multiple disks, with multiprocessor parallelism has been posted as a challenge by the ACM Working Group on Storage I/ O for Large-Scale Computing. In this paper we provide a simulation technique which produces efficient parallel EM algorithms from efficient BSP-like parallel algorithms. The techniques obtained can accommodate one or multiple processors on the EM target machine, each with one or more disks, and they also adapt to the disk blocking factor of the target machine. When applied to existing BSP-like algorithms, our simulation technique produces improved parallel EM algorithms for a large number of problems.  相似文献   

4.
Nested dissection is a very popular direct method for solving sparse linear systems that arise from finite difference and finite element methods. Worley and Schreiber [16] give a fine grain algorithm for a square array of processors. Their algorithm uses O(N2) processors, each with O(N) memory, to factor an N2 by N2 sparse matrix whose graphs is an N × N mesh. The efficiency of their method is between 1/46 and 1/12. George et al. [6] [8] give a medium grain algorithm for hypercube architecture, while George et al. [7] give an algorithm for shared memory machines. These papers present a column oriented approach which can exploit O(N) parallelism and yield efficiencies up to 50%. Lucas [11] also gives a column oriented scheme which achieves up to 75% efficiency and O(N) parallelism. In this paper, we present a medium to fine grain algorithm for a P × P array of processors with local memory. This algorithm can exploit up to O(N2) parallelism. The efficiency of the fine grain version is comparable to [16] while as a medium grain algorithm achieves about 49% efficiency. The strength of the method is due to three factors: its ability to pipeline much of the computation, overlapping computation and communication, and the use of level 3 BLAS like primitives. In addition to its high efficiency its memory requirement is optimal, only O(N2 log N/P2) words memory is needed per processor.  相似文献   

5.
In the last two decades several NC algorithms for solving basic linear algebraic problems have appeared in the literature. This interest was clearly motivated by the emergence of a parallel computing technology and by the wide applicability of matrix computations. The traditionally adopted computation model, however, ignores the arithmetic aspects of the applications, and no analysis is currently available demonstrating the concrete feasibility of many of the known fast methods. In this paper we give strong evidence to the contrary, on the sole basis of the issue of robustness, indicating that some theoretically brilliant solutions fail the severe test of the ``Engineering of Algorithms.' We perform a comparative analysis of several well-known numerical matrix inversion algorithms under both fixed- and variable-precision models of arithmetic. We show that, for most methods investigated, a typical input leads to poor numerical performance, and that in the exact-arithmetic setting no benefit derives from conditions usually deemed favorable in standard scientific computing. Under these circumstances, the only algorithm admitting sufficiently accurate NC implementations is Newton's iterative method, and the word size required to guarantee worst-case correctness appears to be the critical complexity measure. Our analysis also accounts for the observed instability of the considered superfast methods when implemented with the same floating-point arithmetic that is perfectly adequate for the fixed-precision approach. Received March 28, 1998; revised February 2, 1999, and April 21, 1999.  相似文献   

6.
Solving large, sparse, linear systems of equations is a fundamental problems in large scale scientific and engineering computation. A model of a general class of asynchronous, iterative solution methods for linear systems is developed. In the model, the system is solved by creating several cooperating tasks that each compute a portion of the solution vector. A data transfer model predicting both the probability that data must be transferred between two tasks and the amount of data to be transferred is presented. This model is used to derive an execution time model for predicting parallel execution time and an optimal number of tasks given the dimension and sparsity of the coefficient matrix and the costs of computation, synchronization, and communication.The suitability of different parallel architectures for solving randomly sparse linear systems is discussed. Based on the complexity of task scheduling, one parallel architecture, based on a broadcast bus, is presented and analyzed.  相似文献   

7.
Solving sparse triangular systems of linear equations is a performance bottleneck in many methods for solving more general sparse systems. Both for direct methods and for many iterative preconditioners, it is used to solve the system or improve an approximate solution, often across many iterations. Solving triangular systems is notoriously resistant to parallelism, however, and existing parallel linear algebra packages appear to be ineffective in exploiting significant parallelism for this problem.  相似文献   

8.
Martin Fürer 《Algorithmica》2014,68(3):626-642
An O(nlog2 n) algorithm is presented to compute all coefficients of the characteristic polynomial of a tree on n vertices improving on the previously best quadratic time. With the same running time, the algorithm can be generalized in two directions. The algorithm is a counting algorithm for matchings, and the same ideas can be used to count other objects. For example, one can count the number of independent sets of all possible sizes simultaneously with the same running time. These counting algorithms not only work for trees, but can be extended to arbitrary graphs of bounded tree-width.  相似文献   

9.
Design and Implementation of a Practical Parallel Delaunay Algorithm   总被引:1,自引:0,他引:1  
This paper describes the design and implementation of a practical parallel algorithm for Delaunay triangulation that works well on general distributions. Although there have been many theoretical parallel algorithms for the problem, and some implementations based on bucketing that work well for uniform distributions, there has been little work on implementations for general distributions. We use the well known reduction of 2D Delaunay triangulation to find the 3D convex hull of points on a paraboloid. Based on this reduction we developed a variant of the Edelsbrunner and Shi 3D convex hull algorithm, specialized for the case when the point set lies on a paraboloid. This simplification reduces the work required by the algorithm (number of operations) from O(n log 2 n) to O(n log n) . The depth (parallel time) is O( log 3 n) on a CREW PRAM. The algorithm is simpler than previous O(n log n) work parallel algorithms leading to smaller constants. Initial experiments using a variety of distributions showed that our parallel algorithm was within a factor of 2 in work from the best sequential algorithm. Based on these promising results, the algorithm was implemented using C and an MPI-based toolkit. Compared with previous work, the resulting implementation achieves significantly better speedups over good sequential code, does not assume a uniform distribution of points, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for the IBM SP2, Cray T3D, SGI Power Challenge, and DEC AlphaCluster. Received June 1, 1997; revised March 10, 1998.  相似文献   

10.
鞠九滨 《软件学报》1999,10(2):117-120
介绍了一个并行计算的支撑环境PCSE(parallel computation supporting environment).通过引入FORK与JOIN原语,使得无论是使用过程程序设计语言,还是使用逻辑程序设计语言编写的用户源程序都能在本系统上并行执行,从而为用户进行高速计算提供了一个很好的软件基础.目前,PCSE支持FORTRAN,C以及PROLOG应用程序的并行执行.  相似文献   

11.
N. Gupta  S. Sen 《Algorithmica》2001,31(2):179-207
We describe an efficient parallel algorithm for hidden-surface removal for terrain maps. The algorithm runs in O(log 4 n) steps on the CREW PRAM model with a work bound of O((n+k) \polylog ( n)) where n and k are the input and output sizes, respectively. In order to achieve the work bound we use a number of techniques, among which our use of persistent data structures is somewhat novel in the context of parallel algorithms. Received July 29, 1998; revised October 5, 1999.  相似文献   

12.
E. Kaltofen  A. Lobo 《Algorithmica》1999,24(3-4):331-348
We describe a coarse-grain parallel approach for the homogeneous solution of linear systems. Our solutions are symbolic, i.e., exact rather than numerical approximations. We have performed an outer loop parallelization that works well in conjunction with a black box abstraction for the coefficient matrix. Our implementation can be run on a network cluster of UNIX workstations as well as on an SP-2 multiprocessor. Task distribution and management are effected through MPI and other packages. Fault tolerance, checkpointing, and recovery are incorporated. Detailed timings are presented for experiments with systems that arise in RSA challenge integer factoring efforts. For example, we can solve a 252,222 × 252,222 system with about 11.04 million nonzero entries over the Galois field with two elements using four processors of an SP-2 multiprocessor, in about 26.5 hours CPU time. Received June 1, 1997; revised March 10, 1998.  相似文献   

13.
LogP is becoming a practical parallel computation model that meets the demanding of parallel computers and parallel algorithms. So it is important to re-design parallel algorithms on the LogP model. This paper studies the parallel algorithm of computing converse matrix on the simplified LogP model, and gets the simulating results.  相似文献   

14.
We consider the problem of dynamically allocating and deallocating local memory resources among multiple users in a parallel or distributed system. Given a group of independent users and a collection of interconnected local memory devices, we want to render the fragmentation of the memory resources irrelevant by allowing any user to allocate space for his or her purposes as long as there is space available anywhere in the system. In effect, we would like it to appear to the users as though they are allocating memory from a single central pool of memory, even though the space is distributed throughout the system. Our goal is to devise an on-line allocation algorithm that minimizes two cost measures: first, the fraction of unused space , which arises due to fragmentation of the memory; second, the slowdown needed by the system to service user requests, which arises due to the contention for access to the memory devices. We solve this distributed dynamic allocation problem in near-optimal fashion by devising an algorithm that allows the memory to be used to 100% of capacity despite the fragmentation and guarantees that service delays will always be within a constant factor of optimal. The algorithm is completely on-line (no foreknowledge of user activity is assumed) and can accommodate any sequence of allocations and deallocations by the users that does not violate global memory bounds. We also consider the distributed dynamic allocation problem in the more restrictive setting where the local memory devices are connected by a low-degree fixed-connection network, rather than being fully interconnected. In this case, communication costs must be more explicitly considered in our allocation algorithms. We give allocation algorithms for butterfly and hypercube networks, and prove necessary and sufficient conditions on the total amount of memory space needed for near-optimal algorithms to exist. Received November 5, 1996; revised December 10, 1997.  相似文献   

15.
基于PVM的启发式搜索的并行计算模型设计   总被引:2,自引:1,他引:2  
通过分析人工智能中的A和A^*启发式搜索,提出了通过PVM工具包,设计和实现A和A^*启发式搜索的并行计算模型。在启发搜索过程中同时进行评估函数计算,使计算的速度加快。解决了在搜索解空间庞大,评估函数计算复杂的情况下,使用单计算机计算速度慢的问题。该文实现了基于PVM的启发式搜索过程,该模型可应用于一般性启发式搜索问题的并行计算模型。  相似文献   

16.
稀疏矩阵是指那些多数元素为零的矩阵。本文利用稀疏矩阵"稀疏"特点进行存储和计算可以大大节省存储空间,提高计算效率。通过采用标准C 语言设计实现了稀疏矩阵乘法运算器。  相似文献   

17.
Abstract. We present an optimal parallel randomized algorithm for the Voronoi diagram of a set of n nonintersecting (except possibly at endpoints) line segments in the plane. Our algorithm runs in O(log n) time with high probability using O(n) processors on a CRCW PRAM. This algorithm is optimal in terms of work done since the sequential time bound for this problem is Ω(n log n) . Our algorithm improves by an O(log n) factor the previously best known deterministic parallel algorithm, given by Goodrich, ó'Dúnlaing, and Yap, which runs in O( log 2 n) time using O(n) processors. We obtain this result by using a new ``two-stage' random sampling technique. By choosing large samples in the first stage of the algorithm, we avoid the hurdle of problem-size ``blow-up' that is typical in recursive parallel geometric algorithms. We combine the two-stage sampling technique with efficient search and merge procedures to obtain an optimal algorithm. This technique gives an alternative optimal algorithm for the Voronoi diagram of points as well (all other optimal parallel algorithms for this problem use the transformation to three-dimensional half-space intersection).  相似文献   

18.
宋庆增  顾军华 《计算机工程》2011,37(23):214-216
针对传统的通用处理器(GPP)平台上执行稀疏矩阵向量乘计算效率低的问题,提出一种基于可重构计算平台的SpMXV协处理器设计。方案采用二叉树结构高度流水的数据流、IEEE-754的32 bit浮点数数据格式和对角存储格式。数据通路以流水线方式进行组织,能够优化计算性能。仿真结果表明,与GPP平台上的软件实现相比,通过硬件实现的设计能达到最高2.69倍的性能加速。  相似文献   

19.
E. Ruppert 《Algorithmica》2000,28(2):242-254
A concurrent-read exclusive-write PRAM algorithm is developed to find the k shortest paths between pairs of vertices in an edge-weighted directed graph. Repetitions of vertices along the paths are allowed. The algorithm computes an implicit representation of the k shortest paths to a given destination vertex from every vertex of a graph with n vertices and m edges, using O(m+nk log 2 k) work and O( log^3k log ^*k+ log n( log log k+ log ^*n)) time, assuming that a shortest path tree rooted at the destination is pre-computed. The paths themselves can be extracted from the implicit representation in O( log k + log n) time, and O(n log n +L) work, where L is the total length of the output. Received July 2, 1997; revised June 18, 1998.  相似文献   

20.
In this paper, we describe the decomposition of six algorithms: two partial differential equations (PDE) solvers (successive over-relaxation [SOR] and alternating direction implicit [ADI]), fast Fourier transform (FFT), Monte Carlo simulations, Simplex linear programming, and Sparse solvers. The algorithms were selected not only because of their importance in scientific applications, but also because they represent a variety of computational (structured to irregular) and communication (low to high) requirements. We present the performance results of these algorithms on two shared-memory VAX/VMSTM1 multiprocessor prototypes: the VAX 6300 series with up to 8 processors and the M31 with up to 22 processors. We demonstrate that by efficient decomposition it is possible to achieve high performance for all algorithms on both prototypes. We describe the efficient decomposition techniques applied to optimize the performance of parallel algorithms. Also, we discuss the performance implications due to different cache designs on two multiprocessors.An earlier version of this paper was presented at Supercomputing '90.At the time of writing, all three authors were with Digital Equipment Corporation, VMS Systems and Servers Group.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号