期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Parallelism in multigrid methods: How much is too much?

Lesley R. Matheson Robert E. Tarjan 《International journal of parallel programming》1996,24(5):397-432

Multigrid methods are powerful techniques to accelerate the solution of computationally-intensive problems arising in a broad range of applications. Used in conjunction with iterative processes for solving partial differential equations, multigrid methods speed up iterative methods by moving the computation from the original mesh covering the problem domain through a series of coarser meshes. But this hierarchical structure leaves domain-parallel versions of the standard multigrid algorithms with a deficiency of parallelism on coarser grids. To compensate, several parallel multigrid strategies with more parallelism, but also more work, have been designed. We examine these parallel strategies and compare them to simpler standard algorithms to try to determine which techniques are more efficient and practical. We consider three parallel multigrid strategies: (1) domain-parallel versions of the standard V-cycle and F-cycle algorithms; (2) a multiple coarse grid algorithm, proposed by Fredrickson and McBryan, which generates several coarse grids for each fine grid; and (3) two Rosendale algorithm, which allow computation on all grids simultaneously. We study an elliptic model problem on simple domains, discretized with finite difference techniques on block-structured meshes in two or three dimensions with up to 10⁶ or 10⁹ points, respectively. We analyze performance using three models of parallel computation: the PRAM and two bridging models. The bridging models reflect the salient characteristics of two kinds of parallel computers: SIMD fine-grain computers, which contain a large number of small (bitserial) processors, and SPMD medium-grain computers, which have a more modest number of powerful (single chip) processors. Our analysis suggests that the standard algorithms are substantially more efficient than algorithms utilizing either parallel strategy. Both parallel strategies need too much extra work to compensate for their extra parallelism. They require a highly impractical number of processors to be competitive with simpler, standard algorithms. The analysis also suggests that the F-cycle, with the appropriate optimization techniques, is more efficient than the V-cycle under a broad range of problem, implementation, and machine characteristics, despite the fact that it exhibits even less parallelism than the V-cycle. Research at Princeton University partially supported by the National Science Foundation, Grant No. CCR-8920505, and the Office of Naval Research, Contract No. N0014-91-J-1463. 相似文献

2.

A multigrid algorithm for parallel computers: CPMG

S. N. Gupta M. Zubair C. E. Grosch 《Journal of scientific computing》1992,7(3):263-279

In this article, we present a multigrid algorithm for parallel computers, the chopped parallel multigrid (CPMG) algorithm. The CPMG algorithm improves the processor utilization by reducing the work load on coarse grids without affecting the convergence rate of the algorithm. This is in contrast to earlier approaches (Gannon and van Rosendale, 1986; Frederickson and McBryan, 1989), where unutilized processors are used to improve the convergence rate. The CPMG algorithm reduces the coarse grid work bychopping the alternate cycles of multigrid. Using analytical results and simulations on sequential machines we show that the CPMG can achieve almost the same convergence rate as standard MG for many cases. Analytically we show that the advantage gained by CPMG over standard MG on a mesh connected massively parallel machine is 33% in hardware utilization, 50% in communication overheads and 38% in overall execution time. We have also evaluated the performance of CPMG on an actual massively parallel machine, the DAP-510. The advantage gained by CPMG over standard MG is 35% in overall execution time. Moreover, the CPMG can be integrated with other parallel multigrid algorithms, such as the PSMG algorithm (Frederickson and McBryan, 1989) and Decker's algorithm (Decker, 1990). 相似文献

3.

Parallelizing genetic linkage analysis: a case study for applying parallel computation in molecular biology 总被引：1，自引：0，他引：1

P L Miller P Nadkarni J E Gelernter N Carriero A J Pakstis K K Kidd 《Computers and biomedical research》1991,24(3):234-248

Parallel computers offer a solution to improve the lengthy computation time of many conventional, sequential programs used in molecular biology. On a parallel computer, different pieces of the computation are performed simultaneously on different processors. LINKMAP is a sequential program widely used by scientists to perform genetic linkage analysis. We have converted LINKMAP to run on a parallel computer, using the machine-independent parallel programming language, Linda. Using the parallelization of LINKMAP as a case study, the paper outlines an approach to converting existing highly iterative programs to a parallel form. The paper describes the steps involved in converting the sequential program to a parallel program. It presents performance benchmarks comparing the sequential version of LINKMAP with the parallel version running on different parallel machines. The paper also discusses alternative approaches to the problem of "load balancing," making sure the computational load is shared as evenly as possible among the available processors. 相似文献

4.

HYPRE中多重网格解法器的并行可扩展性能分析

徐小文莫则尧曹小林《软件学报》2009,20(Z1):8-14

测试并分析了高性能预条件库HYPRE的多重网格解法器SMG和BoomerAMG在某国产大规模并行机数千个处理器上的可扩展性能,得到若干对线性解法器算法研究和并行实现技术发展具有启示性意义的结论.这些结论对实际复杂物理系统数值模拟中线性解法器的应用和发展具有一定的指导意义. 相似文献

5.

Massively parallel quantum computer simulator

K. De Raedt H. De Raedt B. Trieu 《Computer Physics Communications》2007,176(2):121-136

We describe portable software to simulate universal quantum computers on massive parallel computers. We illustrate the use of the simulation software by running various quantum algorithms on different computer architectures, such as a IBM BlueGene/L, a IBM Regatta p690+, a Hitachi SR11000/J1, a Cray X1E, a SGI Altix 3700 and clusters of PCs running Windows XP. We study the performance of the software by simulating quantum computers containing up to 36 qubits, using up to 4096 processors and up to 1 TB of memory. Our results demonstrate that the simulator exhibits nearly ideal scaling as a function of the number of processors and suggest that the simulation software described in this paper may also serve as benchmark for testing high-end parallel computers. 相似文献

6.

Relationships between efficiency and execution time of fullmultigrid methods on parallel computers

Martin I. Tirado F. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):562-573

The large number of processing elements in current parallel systems necessitates the development of more comprehensive and realistic tools for the scalability analysis of algorithms on those architectures. This paper presents a simple analytical tool with which to study the scalability of parallel algorithm-architecture combinations. Our practical method studies separately execution time, efficiency, and memory usage in the accuracy-critical scaling model, where the problem size-input data set size-increases with the number of processors, which is the relevant one in many situations. The paper defines quantitative and qualitative measurements of the scalability and derives important relationships between execution time and efficiency. For example, results show that the best way to scale the system (to deteriorate as little as possible the properties of the system) is by maintaining constant execution time. These analytical results are verified with one candidate application for massive parallel computers: the full multigrid method. We study the scalability of a general d-dimensional full multigrid method on an r-dimensional mesh of processors. The analytical expressions are verified through experimental results obtained by implementing the full multigrid method on a Transputer-based machine and on the CRAY T3D 相似文献

7.

Parallel processsing in finite flement structural analysis

Ahmed K. Noor 《Engineering with Computers》1988,3(4):225-241

A brief review is made of the fundamental concepts and basic issues of parallel processing. Discussion focuses on mechanisms for parallel processing, construction and implementation of parallel numerical algorithms, performance evaluation of parallel processing machines and numerical algorithms, and parallelism in finite element computations. A novel partitioning strategy is outlined for maximizing the degree of parallelism on computers with a small number of powerful processors. 相似文献

8.

Parallel pascal: An extended pascal for parallel computers

Anthony P. Reeves 《Journal of Parallel and Distributed Computing》1984,1(1):64-80

Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities. 相似文献

9.

Proteus: A reconfigurable computational network for computer vision

Robert M. Haralick Arun K. Somani Craig Wittenbrink Robert Johnson Kenneth Cooper Linda G. Shapiro Ihsin T. Phillips Jenq-Neng Hwang William Cheung Yung Hsi Yao Chung-Ho Chen Larry Yang Brian Daugherty Bob Lorbeski Kent Loving Tom Miller Larye Parkins Steve Soos 《Machine Vision and Applications》1995,8(2):85-100

The Proteus architecture is a highly parallel, multiple instruction, multiple data machine (MIMD) optimized for large granularity tasks such as machine vision and image processing. The system can achieve 20 gigaflops (80 gigaflops peak). It accepts data via multiple serial links at a rate of up to 640 MB/S. The system employs a hierarchical reconfigurable interconnection network with the highest level being a circuit-switchedenhanced hypercube, serial interconnection network for internal data transfers. The system is designed to use 256 to 1024 RISC processors. The processors use 1-MB externalread/write allocating caches for reduced multiprocessor contention. The system detects, locates, and replaces faulty subsystems using redundant hardware to facilitatefault tolerance. The parallelism is directly controllable through an advanced software system for partitioning, scheduling, and development. System software includes a translator for the INSIGHT language, a parallel debugger, lowand high-level simulators, and a message-passing system for all control needs. Image-processing application software includes a variety of point operators, neighborhood operators, convolution, and the mathematical morphology operations of binary and gray-scale dilation, erosion, opening, and closing. 相似文献

10.

Practical parallel Union-Find algorithms for transitive closure and clustering

G. Cybenko T. G. Allen J. E. Polito 《International journal of parallel programming》1988,17(5):403-423

Practical parallel algorithms, based on classical sequential Union-Find algorithms for computing transitive closures of binary relations, are described and implemented for both shared memory and distributed memory parallel computers. By practical algorithms, we mean algorithms that are efficient for parallel systems with bounded numbers of processors as opposed to algorithms where the number of processors grows with the problem size. Transitive closures are useful for decomposing many applications problems into independent subproblems. The implementations were on an ENCORE Multimax shared memory machine and an NCUBE hypercube. Our implementations indicate that transitive closure computations are intrinsically difficult for distributed memory parallel machines because of the need for global information. By contrast, our results for shared memory machines exhibited excellent speedups.Supported in part by NSF Grant DCR-8619103, ONR contract N000-86-G-0202 and DOE Grant DE-FG02-85ER25001.Supported in part by RADC contract F30602-85-C-0303.Supported in part by RADC contract F30602-85-C-0303. 相似文献

11.

The Connection Machine: PDE solution on 65 536 processors

《Parallel Computing》1988,9(1):1-24

The Connection Machine is a massively parallel architecture with 65 536 single-bit processors and 32 Mbytes of memory, organized as a high-dimensional hypercube. A sophisticated router system provides efficient communication between remote processors. A rich software environment, including a parallel extension of COMMON LISP, provides access to the processors and network. Virtual processor capability extends the degree of fine-grained parallelism beyond 1 000 000.We describe the hardware and the parallel programming environment. We then present implementations of SOR, Multigrid and Conjugate Gradient algorithms for solving Partial Differential Equations on the Connection Machine. Measurements of computational efficiency are provided as well as an analysis of opportunities for achieving better performance. Despite the lack of floating-point hardware, computation rates above 100 Mflops have been achieved in PDE solution. Virtual processors prove to be a real advantage, easing the effort of software development while improving system performance significantly. 相似文献

12.

Fast bio-inspired computation using a GPU-based systemic computer

Marjan Rouhipour Peter J. Bentley Hooman Shayani 《Parallel Computing》2010,36(10-11):591-617

Biology is inherently parallel. Models of biological systems and bio-inspired algorithms also share this parallelism, although most are simulated on serial computers. Previous work created the systemic computer – a new model of computation designed to exploit many natural properties observed in biological systems, including parallelism. The approach has been proven through two existing implementations and many biological models and visualizations. However to date the systemic computer implementations have all been sequential simulations that do not exploit the true potential of the model. In this paper the first ever parallel implementation of systemic computation is introduced. The GPU Systemic Computation Architecture is the first implementation that enables parallel systemic computation by exploiting the multiple cores available in graphics processors. Comparisons with the serial implementation when running two programs at different scales show that as the number of systems increases, the parallel architecture is several hundred times faster than the existing implementations, making it feasible to investigate systemic models of more complex biological systems. 相似文献

13.

Finite element analysis on the BBN butterfly multiprocessor

《Computers & Structures》1987,27(1):13-21

Applications of the finite element method can easily exhaust CPU and memory resources on conventional uniprocessor machines, and are thus prime candidates for multiprocessor application. In order to investigate various parallel processing methodologies and to evaluate their performance, we have developed a finite element program for the Butterfly Parallel Processor. The overall organization of the program and its data structures are described along with parallel implementations of well-known algorithms for transient and static analysis. Results are presented for two implementations of explicit time integration, and for static analysis using a parallel version of Gaussian elimination. For comparison, results from running the same program on a DEC VAX 11/785 are also included. The transient cases yielded near-linear (within 2.5%) speedup in the full range of 1–64 processors; the static case results showed significant nonlinearity over the same range. 相似文献

14.

FFTs in external or hierarchical memory 总被引：2，自引：0，他引：2

David H. Bailey 《The Journal of supercomputing》1990,4(1):23-35

Conventional algorithms for computing large one-dimensional fast Fourier transforms (FFTs), even those algorithms recently developed for vector and parallel computers, are largely unsuitable for systems with external or hierarchical memory. The principal reason for this is the fact that most FFT algorithms require at least m complete passes through the data set to compute a 2^m-point FFT. This paper describes some advanced techniques for computing an ordered FFT on a computer with external or hierarchical memory. These algorithms (1) require as few as two passes through the external data set, (2) employ strictly unit stride, long vector transfers between main memory and external storage, (3) require only a modest amount of scratch space in main memory, and (4) are well suited for vector and parallel computation.Performance figures are included for implementations of some of these algorithms on Cray supercomputers. Of interest is the fact that a main memory version outperforms the current Cray library FFT routines on the CRAY-2, the CRAY X-MP, and the CRAY Y-MP systems. Using all eight processors on the CRAY Y-MP, this main memory routine runs at nearly two gigaflops.A condensed version of this paper previously appeared in the Proceedings of Supercomputing '89. 相似文献

15.

更实际的并行计算模型 总被引：7，自引：0，他引：7

陈国良《小型微型计算机系统》1995,16(2):1-9

过去所报导的大量并行算法在小规模的并行机上均运行得很好，然而将其移植到大规模并行机上运行时性能却很差。原因之一就是并行计算模型（如ＰＲＡＭ）过于抽象，略去了一些诸如通信、同步等算法运行时不可忽略的因素。本文介绍目前所提出的几个较能反映近代并行机性能的更为实际的并行计算模型，包括异步ＰＲＡＭ，ＢＳＰ，ｌｏｇＰ和Ｃ３模型等。当然这些模型在与真实并行机吻合的程度、可使用性和分析较复杂算法时的可操作性等方面尚存异议，但是它们的确打开了研究并行计其模型的新途径，成为当今并行算法研究的热点之一。相似文献

16.

Simulated Annealing Based Parallel State Assignment of Finite State Machines

《Journal of Parallel and Distributed Computing》1997,43(1):21-35

Simulated annealing is an effective tool in many optimization problems in VLSI CAD but its time requirements are prohibitive. In this paper, we report parallel algorithms for a well established simulated annealing based algorithm for the state assignment problem for finite state machines. Our parallel annealing strategy uses parallel moves by multiple processes, each performing local moves within its assigned subspace of the state encoding space. The novelty is in the dynamic repartitioning of the state space among processors, so that each processor gets to perform moves on the entire space over time. This is important to keep the quality of the parallel algorithm comparable to the serial algorithm. Our algorithm gives quality results within 0.05% of the serial algorithm on 64 processors. Our algorithms, ProperJEDI and PartJEDI, are portable across a wide range of MIMD machines. PartJEDI is memory scalable and is incrementally developed from ProperJEDI which is data replicated. For a large circuit, ProperJEDI reduces the runtime from 11 h to 10 min on a 64-processor machine. For the same circuit, PartJEDI reduces the runtime from 11 h to 20 min and memory requirement from 114 to 2 MB. 相似文献

17.

Factoring: Algorithms,computations, and computers

Duncan A. Buell 《The Journal of supercomputing》1987,1(2):191-216

This article discusses the computational structure of the most effective methods for factoring integers and the computer architectures—existing and used, proposed, and under construction—which efficiently perform the computations of these various methods. New developments in technology and in pricing of computers are making it possible to build powerful parallel machines, at relatively low cost, which can substantially outperform standard computers on specific types of computations. The intent of this article is to use factoring and computers for factoring to provoke general thought about this matching of computer architectures to algorithms and computations.The author's research at Louisiana State University was supported in part by the National Science Foundation and the National Security Agency under grants NSF DCR 83-115-80 and NSA MDA904-85-H-0006. 相似文献

18.

Parallel pivoting algorithms for sparse symmetric matrices

Frans J. Peters 《Parallel Computing》1984,1(1):99-110

In this paper it is investigated which pivots may be processed simultaneously when solving a set of linear equations. It is shown that for dense sets of equations all the pivots must necessarily be processed one at a time; only if the set is sufficiently sparse, some pivots may be processed simultaneously. We present parallel pivoting algorithms for MIMD computers with sufficiently many processors and a common memory. Moreover we present algorithms for MIMD computers with an arbitrary, but fixed number of processors. For both types of computers algorithms embodying an ordering strategy are given. 相似文献

19.

Scalability aspects of parallel multigrid

J. Linden G. Lonsdale H. Ritzdorf A. Schüller 《Future Generation Computer Systems》1994,10(4):429-439

This paper summarizes theoretical and practical investigations into the effect of parallelization by grid-partitioning on the performance of multigrid methods for the solution of partial differential equations on general two-dimensional domains. Particular emphasis will be placed on the algorithmic scalability for MIMD distributed memory systems. Experimental results for two Navier-Stokes test problems, presented in the last section of the paper, show that the theoretically predicted dependency of the combined numerical and parallel efficiencies of multigrid methods on the number of processors employed is in fact very weak. This leads to the conclusion that multigrid is an appropriate candidate for solving partial differential equations on massively parallel machines. 相似文献

20.

Reliable performance prediction for multigrid software on distributed memory systems

Giuseppe Romanazzi Peter K. JimackChristopher E. Goodyer 《Advances in Engineering Software》2011,42(5):247-258

We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions to be made for the performance of the software on large numbers of processors of a given parallel system, by only benchmarking the code on small numbers of processors. Having described the methods used, and emphasized the simplicity of their implementation, the approach is tested on a range of engineering software applications that are built upon the use of multigrid algorithms. Despite their simplicity, the models are demonstrated to provide both accurate and robust predictions across a range of different parallel architectures, partitioning strategies and multigrid codes. In particular, the effectiveness of the predictive methodology is shown for a practical engineering software implementation of an elastohydrodynamic lubrication solver. 相似文献