期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

New architectures: Performance highlights and new algorithms

Oliver A. McBryan 《Parallel Computing》1988,7(3):477-499

Parallel computers are having a profound impact on computational science. Recently highly parallel machines have taken the lead as the fastest supercomputers, a trend that is likely to accelerate in the future. We describe some of these new computers, and issues involved in using them. We present elliptic PDE solutions currently running at 3.8 gigaflops, and an atmospheric dynamics model running at 1.7 gigaflops, on a 65 536-processor computer.

One intrinsic disadvantage of a parallel machine is the need to perform inter-processor communication. It is important to ensure that such communication time is maintained at a small fraction of computation time. We analyze standard multigrid algorithms in two and three dimensions from this point of view, indicating that performance efficiencies in excess of 95% are attainable under suitable conditions on moderately parallel machines. We also demonstrate that such performance is not attainable for multigrid on massively parallel computers, as indicated by an example of poor multigrid efficiency on 65 536 processors. The fundamental difficulty is the inability to keep 65 536 processors busy when operating on very coarse grids.

Most algorithms used for implementing applications on parallel machines have been derived directly from algorithms designed for serial machines. The previously mentioned multigrid example indicates that such ‘parallelized’ algorithms may not always be optimal. Parallel machines open the possibility of finding totally new approaches to solving standard tasks—intrinsically parallel algorithms. In particular, we present a class of superconvergent multiple scale methods that were motivated directly by massevely parallel machines. These methods differ from standard multigrid methods in an intrinsic way, and allow all processors to be used at all times, even when processing on the coarsest grid levels. Their serial versions are not sensible algorithms. The idea that parallel hardware—the Connection Machine in this case—can lead to discovery of new mathematical algorithms was surprising for us. 相似文献

2.

Relationships between efficiency and execution time of fullmultigrid methods on parallel computers

Martin I. Tirado F. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):562-573

The large number of processing elements in current parallel systems necessitates the development of more comprehensive and realistic tools for the scalability analysis of algorithms on those architectures. This paper presents a simple analytical tool with which to study the scalability of parallel algorithm-architecture combinations. Our practical method studies separately execution time, efficiency, and memory usage in the accuracy-critical scaling model, where the problem size-input data set size-increases with the number of processors, which is the relevant one in many situations. The paper defines quantitative and qualitative measurements of the scalability and derives important relationships between execution time and efficiency. For example, results show that the best way to scale the system (to deteriorate as little as possible the properties of the system) is by maintaining constant execution time. These analytical results are verified with one candidate application for massive parallel computers: the full multigrid method. We study the scalability of a general d-dimensional full multigrid method on an r-dimensional mesh of processors. The analytical expressions are verified through experimental results obtained by implementing the full multigrid method on a Transputer-based machine and on the CRAY T3D 相似文献

3.

Scalability aspects of parallel multigrid

J. Linden G. Lonsdale H. Ritzdorf A. Schüller 《Future Generation Computer Systems》1994,10(4):429-439

This paper summarizes theoretical and practical investigations into the effect of parallelization by grid-partitioning on the performance of multigrid methods for the solution of partial differential equations on general two-dimensional domains. Particular emphasis will be placed on the algorithmic scalability for MIMD distributed memory systems. Experimental results for two Navier-Stokes test problems, presented in the last section of the paper, show that the theoretically predicted dependency of the combined numerical and parallel efficiencies of multigrid methods on the number of processors employed is in fact very weak. This leads to the conclusion that multigrid is an appropriate candidate for solving partial differential equations on massively parallel machines. 相似文献

4.

Numerical and computational efficiency of solvers for two-phase problems

O. Axelsson P. Boyanova M. Kronbichler M. Neytcheva X. Wu 《Computers & Mathematics with Applications》2013,65(3):301-314

We consider two-phase flow problems, modelled by the Cahn–Hilliard equation. In this work, the nonlinear fourth-order equation is decomposed into a system of two coupled second-order equations for the concentration and the chemical potential.We analyse solution methods based on an approximate two-by-two block factorization of the Jacobian of the nonlinear discrete problem. We propose a preconditioning technique that reduces the problem of solving the non-symmetric discrete Cahn–Hilliard system to a problem of solving systems with symmetric positive definite matrices where off-the-shelf multilevel and multigrid algorithms are directly applicable. The resulting solution methods exhibit optimal convergence and computational complexity properties and are suitable for parallel implementation.We illustrate the efficiency of the proposed methods by various numerical experiments, including parallel results for large scale three dimensional problems. 相似文献

5.

Performance estimations for SUPRENUM systems

O. Kolp H. Mierendorff 《Parallel Computing》1988,7(3):357-366

A SUPRENUM system consists of many independent processors connected by a hierarchical bus system. Application problems are usually parallelized by decomposition into processes which are mapped onto the processors. Standard multigrid methods for the Poisson equation are considered as a model problem. An abstract model of the SUPRENUM system is developed consisting of five essential components. Their performance is approximated by linear functions. The efficiency and speedup of the considered parallel algorithms are estimated for several system and problem sizes. Parameter studies show the influence of the most important system parameters. The results are extended to some other multigrid algorithms. 相似文献

6.

Improving the efficiency of preconditioning for iterative methods

M. Papadrakakis M. C. Dracopoulos 《Computers & Structures》1991,41(6):1263-1272

Techniques based on the idea of preconditioning and on the element-by-element concept have significantly improved the efficiency of classical iterative methods in conventional as well as in parallel hardware environment. In this work two preconditioning approaches based on the incomplete Choleski factorization have been further refined, with the result that both storage requirements and solution times have been greatly improved when processing large structural problems. The partial preconditioning method is also employed to develop a framework for constructing a global preconditioner, without the need to store the complete coefficient matrix, for accelerating an iterative element-by-element solution procedure. 相似文献

7.

Preconditioned CG methods for sparse matrices on massively parallel machines

《Parallel Computing》1997,23(3):381-398

Conjugate gradient (CG) methods to solve sparse systems of linear equations play an important role in numerical methods for solving discretized partial differential equations. The large size and the condition of many technical or physical applications in this area result in the need for efficient parallelization and preconditioning techniques of the CG method, in particular on massively parallel machines. Here, the data distribution and the communication scheme for the sparse matrix operations of the preconditioned CG are based on the analysis of the indices of the non-zero elements. Polynomial preconditioning is shown to reduce global synchronizations considerably, and a fully local incomplete Cholesky preconditioner is presented. On a PARAGON XP/S 10 with 138 processors, the developed parallel methods outperform diagonally scaled CG markedly with respect to both scaling behavior and execution time for many matrices from real finite element applications. 相似文献

8.

A non-intrusive parallel-in-time adjoint solver with the XBraid library

Günther Stefanie Gauger Nicolas R. Schroder Jacob B. 《Computing and Visualization in Science》2018,19(3-4):85-95

In this paper, an adjoint solver for the multigrid-in-time software library XBraid is presented. XBraid provides a non-intrusive approach for simulating unsteady dynamics on multiple processors while parallelizing not only in space but also in the time domain (XBraid: Parallel multigrid in time, http://llnl.gov/casc/xbraid). It applies an iterative multigrid reduction in time algorithm to existing spatially parallel classical time propagators and computes the unsteady solution parallel in time. Techniques from Automatic Differentiation are used to develop a consistent discrete adjoint solver which provides sensitivity information of output quantities with respect to design parameter changes. The adjoint code runs backwards through the primal XBraid actions and accumulates gradient information parallel in time. It is highly non-intrusive as existing adjoint time propagators can easily be integrated through the adjoint interface. The adjoint code is validated on advection-dominated flow with periodic upstream boundary condition. It provides similar strong scaling results as the primal XBraid solver and offers great potential for speeding up the overall computational costs for sensitivity analysis using multiple processors.

相似文献

9.

Parallel iterative solvers for sparse linear systems in circuit simulation

A. U. M. K. 《Future Generation Computer Systems》2005,21(8):1275-1284

For the solution of sparse linear systems from circuit simulation whose coefficient matrices include a few dense rows and columns, a parallel iterative algorithm with distributed Schur complement preconditioning is presented. The parallel efficiency of the solver is increased by transforming the equation system into a problem without dense rows and columns as well as by exploitation of parallel graph partitioning methods. The costs of local, incomplete LU decompositions are decreased by fill-in reducing reordering methods of the matrix and a threshold strategy for the factorization. The efficiency of the parallel solver is demonstrated with real circuit simulation problems on PC clusters. 相似文献

10.

Parallel decomposition of unstructured FEM-meshes

Ralf Diekmann Derk Meyer Burkhard Monien 《Concurrency and Computation》1998,10(1):53-72

We present a parallel algorithm for static and dynamic partitioning of unstructured FEM-meshes. The method consists of two parts. First a fast but inaccurate sequential clustering is determined which is used, together with a simple mapping heuristic, to map the mesh initially onto the processors of a parallel system. The second part of the method uses a massively parallel algorithm to remap and optimize the mesh decomposition, taking several cost functions into account which reflect the characteristics of the underlying hardware and the requirements of the numerical solution method supposed to run after the decomposition. The parallel algorithm first calculates the number of nodes that have to be migrated between pairs of clusters in order to obtain an optimal load balancing. In a second step, nodes to be migrated are chosen according to cost functions optimizing the amount of necessary communication and the shapes of subdomains. The latter criterion is extremely important for the convergence behavior of certain numerical solution methods, especially for preconditioned conjugate gradient methods. The parallel parts of the method are implemented in C under Parix to run on the Parsytec GC systems. Results on up to 64 processors are presented and compared to those of other existing methods. © 1998 John Wiley & Sons, Ltd. 相似文献

11.

支持流数据传输的互连网络控制器研究与实现

下载免费PDF全文

马驰远陈书明邢座程郝跃《计算机工程与科学》2008,30(9):103-106

本文提出一种支持流数据传输的互连网络控制器的设计。该设计应用于FT64流处理器上,使得多个流处理器能够通过高性能网络进行数据传输,以便进行并行流数据运算。该设计采用二维环绕网,使用虚通道避免死锁,支持多个流的数据同时传输。投片后的测试结果表明,该设计功能正确,核心频率为500MHz,链路时钟频率为400MHz,满足设计要求。相似文献

12.

On parallelizing the multiprocessor scheduling problem 总被引：1，自引：0，他引：1

Ahmad I. Yu-Kwong Kwok 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(4):414-431

Existing heuristics for scheduling a node and edge weighted directed task graph to multiple processors can produce satisfactory solutions but incur high time complexities, which tend to exacerbate in more realistic environments with relaxed assumptions. Consequently, these heuristics do not scale well and cannot handle problems of moderate sizes. A natural approach to reducing complexity, while aiming for a similar or potentially better solution, is to parallelize the scheduling algorithm. This can be done by partitioning the task graphs and concurrently generating partial schedules for the partitioned parts, which are then concatenated to obtain the final schedule. The problem, however, is nontrivial as there exists dependencies among the nodes of a task graph which must be preserved for generating a valid schedule. Moreover, the time clock for scheduling is global for all the processors (that are executing the parallel scheduling algorithm), making the inherent parallelism invisible. In this paper, we introduce a parallel algorithm that is guided by a systematic partitioning of the task graph to perform scheduling using multiple processors. The algorithm schedules both the tasks and messages, and is suitable for graphs with arbitrary computation and communication costs, and is applicable to systems with arbitrary network topologies using homogeneous or heterogeneous processors. We have implemented the algorithm on the Intel Paragon and compared it with three closely related algorithms. The experimental results indicate that our algorithm yields higher quality solutions while using an order of magnitude smaller scheduling times. The algorithm also exhibits an interesting trade-off between the solution quality and speedup while scaling well with the problem size 相似文献

13.

Parallelism in multigrid methods: How much is too much?

Lesley R. Matheson Robert E. Tarjan 《International journal of parallel programming》1996,24(5):397-432

Multigrid methods are powerful techniques to accelerate the solution of computationally-intensive problems arising in a broad range of applications. Used in conjunction with iterative processes for solving partial differential equations, multigrid methods speed up iterative methods by moving the computation from the original mesh covering the problem domain through a series of coarser meshes. But this hierarchical structure leaves domain-parallel versions of the standard multigrid algorithms with a deficiency of parallelism on coarser grids. To compensate, several parallel multigrid strategies with more parallelism, but also more work, have been designed. We examine these parallel strategies and compare them to simpler standard algorithms to try to determine which techniques are more efficient and practical. We consider three parallel multigrid strategies: (1) domain-parallel versions of the standard V-cycle and F-cycle algorithms; (2) a multiple coarse grid algorithm, proposed by Fredrickson and McBryan, which generates several coarse grids for each fine grid; and (3) two Rosendale algorithm, which allow computation on all grids simultaneously. We study an elliptic model problem on simple domains, discretized with finite difference techniques on block-structured meshes in two or three dimensions with up to 10⁶ or 10⁹ points, respectively. We analyze performance using three models of parallel computation: the PRAM and two bridging models. The bridging models reflect the salient characteristics of two kinds of parallel computers: SIMD fine-grain computers, which contain a large number of small (bitserial) processors, and SPMD medium-grain computers, which have a more modest number of powerful (single chip) processors. Our analysis suggests that the standard algorithms are substantially more efficient than algorithms utilizing either parallel strategy. Both parallel strategies need too much extra work to compensate for their extra parallelism. They require a highly impractical number of processors to be competitive with simpler, standard algorithms. The analysis also suggests that the F-cycle, with the appropriate optimization techniques, is more efficient than the V-cycle under a broad range of problem, implementation, and machine characteristics, despite the fact that it exhibits even less parallelism than the V-cycle. Research at Princeton University partially supported by the National Science Foundation, Grant No. CCR-8920505, and the Office of Naval Research, Contract No. N0014-91-J-1463. 相似文献

14.

A comparison of parallel solvers for the incompressible Navier–Stokes equations

V. John 《Computing and Visualization in Science》1999,1(4):193-200

The paper compares coupled multigrid methods and pressure Schur complement schemes (operator splitting schemes) for the solution of the steady state and time dependent incompressible Navier–Stokes equations. We consider pressure Schur complement schemes with multigrid as well as single grid methods for the solution of the Schur complement problem for the pressure. The numerical tests have been carried out on benchmark problems using a MIMD parallel computer. They show the superiority of the coupled multigrid methods for the considered class of problems. Received: 14 October 1997 / Accepted: 11 February 1998 相似文献

15.

Reliable performance prediction for multigrid software on distributed memory systems

Giuseppe Romanazzi Peter K. JimackChristopher E. Goodyer 《Advances in Engineering Software》2011,42(5):247-258

We propose a model for describing and predicting the parallel performance of a broad class of parallel numerical software on distributed memory architectures. The purpose of this model is to allow reliable predictions to be made for the performance of the software on large numbers of processors of a given parallel system, by only benchmarking the code on small numbers of processors. Having described the methods used, and emphasized the simplicity of their implementation, the approach is tested on a range of engineering software applications that are built upon the use of multigrid algorithms. Despite their simplicity, the models are demonstrated to provide both accurate and robust predictions across a range of different parallel architectures, partitioning strategies and multigrid codes. In particular, the effectiveness of the predictive methodology is shown for a practical engineering software implementation of an elastohydrodynamic lubrication solver. 相似文献

16.

Mapping Conjugate Gradient Algorithms for Neutron Diffusion Applications onto SIMD, MIMD, and Mixed-Mode Machines

John John E. So Thomas J. Downar Raghunandan Janardhan Howard Jay Siegel 《International journal of parallel programming》1998,26(2):183-207

The performance of conjugate gradient (CG) algorithms for the solution of the system of linear equations that results from the finite-differencing of the neutron diffusion equation was analyzed on SIMD, MIMD, and mixed-mode parallel machines. A block preconditioner based on the incomplete Cholesky factorization was used to accelerate the conjugate gradient search. The issues involved in mapping both the unpreconditioned and preconditioned conjugate gradient algorithms onto the mixed-mode PASM prototype, the SIMD MasPar MP-1, and the MIMD Intel Paragon XP/S are discussed. On PASM , the mixed-mode implementation outperformed either SIMD or MIMD alone. Theoretical performance predictions were analyzed and compared with the experimental results on the MasPar MP-1 and the Paragon XP/S. Other issues addressed include the impact on execution time of the number of processors used, the effect of the interprocessor communication network on performance, and the relationship of the number of processors to the quality of the preconditioning. Applications studies such as this are necessary in the development of software tools for mapping algorithms onto either a single parallel machine or a heterogeneous suite of parallel machines. 相似文献

17.

The auxiliary space method and optimal multigrid preconditioning techniques for unstructured grids 总被引：1，自引：0，他引：1

J. Xu 《Computing》1996,56(3):215-235

An abstract framework ofauxiliary space method is proposed and, as an application, an optimal multigrid technique is developed for general unstructured grids. The auxiliary space method is a (nonnested) two level preconditioning technique based on a simple relaxation scheme (smoother) and an auxiliary space (that may be roughly understood as a nonnested coarser space). An optimal multigrid preconditioner is then obtained for a discretized partial differential operator defined on an unstructured grid by using an auxiliary space defined on a more structured grid in which a furthernested multigrid method can be naturally applied. This new technique makes it possible to apply multigrid methods to general unstructured grids without too much more programming effort than traditional solution methods. Some simple examples are also given to illustrate the abstract theory and for instance the Morley finite element space is used as an auxiliary space to construct a preconditioner for Argyris element for biharmonic equations. Some numerical results are also given to demonstrate the efficiency of using structured grid for auxiliary space to precondition unstructured grids. 相似文献

18.

Adaptive multigrid for finite element computations in plasticity 总被引：1，自引：0，他引：1

Torbjörn Ekevid Per Kettil Nils-Erik Wiberg 《Computers & Structures》2004,82(28):2413-2424

The solution of the system of equilibrium equations is the most time-consuming part in large-scale finite element computations of plasticity problems. The development of efficient solution methods are therefore of utmost importance to the field of computational plasticity. Traditionally, direct solvers have most frequently been used. However, recent developments of iterative solvers and preconditioners may impose a change. In particular, preconditioning by the multigrid technique is especially favorable in FE applications.The multigrid preconditioner uses a number of nested grid levels to improve the convergence of the iterative solver. Prolongation of fine-grid residual forces is done to coarser grids and computed corrections are interpolated to the fine grid such that the fine-grid solution successively is improved. By this technique, large 3D problems, invincible for solvers based on direct methods, can be solved in acceptable time at low memory requirements. By means of a posteriori error estimates the computational grid could successively be refined (adapted) until the solution fulfils a predefined accuracy level. In contrast to procedures where the preceding grids are erased, the previously generated grids are used in the multigrid algorithm to speed up the solution process.The paper presents results using the adaptive multigrid procedure to plasticity problems. In particular, different error indicators are tested. 相似文献

19.

Array-based,parallel hierarchical mesh refinement algorithms for unstructured meshes

《Computer aided design》2017

In this paper, we describe an array-based hierarchical mesh refinement capability through uniform refinement of unstructured meshes for efficient solution of PDE’s using finite element methods and multigrid solvers. A multi-degree, multi-dimensional and multi-level framework is designed to generate the nested hierarchies from an initial coarse mesh that can be used for a variety of purposes such as in multigrid solvers/preconditioners, to do solution convergence and verification studies and to improve overall parallel efficiency by decreasing I/O bandwidth requirements (by loading smaller meshes and in-memory refinement). We also describe a high-order boundary reconstruction capability that can be used to project the new points after refinement using high-order approximations instead of linear projection in order to minimize and provide more control on geometrical errors introduced by curved boundaries.The capability is developed under the parallel unstructured mesh framework “Mesh Oriented dAtaBase” (MOAB Tautges et al. (2004)). We describe the underlying data structures and algorithms to generate such hierarchies in parallel and present numerical results for computational efficiency and effect on mesh quality. We also present results to demonstrate the applicability of the developed capability to study convergence properties of different point projection schemes for various mesh hierarchies and to a multigrid finite-element solver for elliptic problems. 相似文献

20.

Parallel methods for optimality criteria-based topology optimization

《Computer Methods in Applied Mechanics and Engineering》2005,194(34-35):3637-3667

Topology optimization problems require the repeated solution of finite element problems that are often extremely ill-conditioned due to highly heterogeneous material distributions. This makes the use of iterative linear solvers inefficient unless appropriate preconditioning is used. Even then, the solution time for topology optimization problems is typically very high. These problems are addressed by considering the use of non-overlapping domain decomposition-based parallel methods for the solution of topology optimization problems. The parallel algorithms presented here are based on the solid isotropic material with penalization (SIMP) formulation of the topology optimization problem and use the optimality criteria method for iterative optimization. We consider three parallel linear solvers to solve the equilibrium problem at each step of the iterative optimization procedure. These include two preconditioned conjugate gradient (PCG) methods: one using a diagonal preconditioner and one using an incomplete LU factorization preconditioner with a drop tolerance. A third substructuring solver that employs a hybrid of direct and iterative (PCG) techniques is also studied. This solver is found to be the most effective of the three solvers studied, both in terms of parallel efficiency and in terms of its ability to mitigate the effects of ill-conditioning. In addition to examining parallel linear solvers, we consider the parallelization of the iterative optimality criteria method. To tackle checkerboarding and mesh dependence, we propose a multi-pass filtering technique that limits the number of “ghost” elements that need to be exchanged across interprocessor boundaries. 相似文献