首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A three-dimensional parallel unstructured non-nested multigrid solver for solutions of unsteady incompressible viscous flow is developed and validated. The finite-volume Navier–Stokes solver is based on the artificial compressibility approach with a high-resolution method of characteristics-based scheme for handling convection terms. The unsteady flow is calculated with a matrix-free implicit dual time stepping scheme. The parallelization of the multigrid solver is achieved by multigrid domain decomposition approach (MG-DD), using single program multiple data (SPMD) and multiple instruction multiple data (MIMD) programming paradigm. There are two parallelization strategies proposed in this work, first strategy is a one-level parallelization strategy using geometric domain decomposition technique alone, second strategy is a two-level parallelization strategy that consists of a hybrid of both geometric domain decomposition and data decomposition techniques. Message-passing interface (MPI) and OpenMP standard are used to communicate data between processors and decompose loop iterations arrays, respectively. The parallel-multigrid code is used to simulate both steady and unsteady incompressible viscous flows over a circular cylinder and a lid-driven cavity flow. A maximum speedup of 22.5 could be achieved on 32 processors, for instance, the lid-driven cavity flow of Re = 1000. The results obtained agree well with numerical solutions obtained by other researchers as well as experimental measurements. A detailed study of the time step size and number of pseudo-sub-iterations per time step required for simulating unsteady flow are presented in this paper.  相似文献   

2.
This paper generalizes the widely used Nelder and Mead (Comput J 7:308–313, 1965) simplex algorithm to parallel processors. Unlike most previous parallelization methods, which are based on parallelizing the tasks required to compute a specific objective function given a vector of parameters, our parallel simplex algorithm uses parallelization at the parameter level. Our parallel simplex algorithm assigns to each processor a separate vector of parameters corresponding to a point on a simplex. The processors then conduct the simplex search steps for an improved point, communicate the results, and a new simplex is formed. The advantage of this method is that our algorithm is generic and can be applied, without re-writing computer code, to any optimization problem which the non-parallel Nelder–Mead is applicable. The method is also easily scalable to any degree of parallelization up to the number of parameters. In a series of Monte Carlo experiments, we show that this parallel simplex method yields computational savings in some experiments up to three times the number of processors.  相似文献   

3.
Jürgen Bruder 《Computing》1997,59(2):139-151
For the parallelization of implicit Runge-Kutta methods for stiff ODE’s a parallel computation of the stages is obvious. In this paper we consider the parallelization of the stages of linearly-implicit Runge-Kutta methods. The construction and implementation of a parallel linearly-implicit Runge-Kutta method is described. The numerical results are compared with the code PSODE of van der Houwen/Sommeijer [6] and a straightforward parallelization of RADAU5 [5]. All methods are based on the 3-stage implicit Radau-IIA method.  相似文献   

4.
We have parallelized the FASTA algorithm for biological sequence comparison using Linda, a machine-independent parallel programming language. The resulting parallel program runs on a variety of different parallel machines. A straight-forward parallelization strategy works well if the amount of computation to be done is relatively large. When the amount of computation is reduced, however, disk I/O becomes a bottleneck which may prevent additional speed-up as the number of processors is increased. The paper describes the parallelization of FASTA, and uses FASTA to illustrate the I/O bottleneck problem that may arise when performing parallel database search with a fast sequence comparison algorithm. The paper also describes several program design strategies that can help with this problem. The paper discusses how this bottleneck is an example of a general problem that may occur when parallelizing, or otherwise speeding up, a time-consuming computation.  相似文献   

5.
A hybrid message passing and shared memory parallelization technique is presented for improving the scalability of the adaptive integral method (AIM), an FFT based algorithm, on clusters of identical multi-core processors. The proposed hybrid MPI/OpenMP parallelization scheme is based on a nested one-dimensional (1-D) slab decomposition of the 3-D auxiliary regular grid and the associated AIM calculations: If there are M processors and T cores per processor, the scheme (i) divides the regular grid into M slabs and MT sub-slabs, (ii) assigns each slab/sub-slab and the associated operations to one of the processors/cores, and (iii) uses MPI for inter-processor data communication and OpenMP for intra-processor data exchange. The MPI/OpenMP parallel AIM is used to accelerate the solution of the combined-field integral equation pertinent to the analysis of time-harmonic electromagnetic scattering from perfectly conducting surfaces. The scalability of the scheme is investigated theoretically and verified on a state-of-the-art multi-core cluster for benchmark scattering problems. Timing and speedup results on up to 1024 quad-core processors show that the hybrid MPI/OpenMP parallelization of AIM exhibits better strong scalability (fixed problem size speedup) than pure MPI parallelization of it when multiple cores are used on each processor.  相似文献   

6.
Solution of independent sets of linear banded systems is a core part of implicit numerical algorithms. In this study we propose a novel pipelined Thomas algorithm with low parallelization penalty. We introduce two-step pipelined algorithms (PAs) formally and show that the idle processor time is invariant with respect to the order of backward and forward steps. Therefore, the parallelization efficiency of the PA cannot be improved directly. However, the processor idle time can be used if some lines have been computed by the time processors become idle. We develop the immediate backward pipelined Thomas algorithm (IB-PTA). The backward step is computed immediately after the forward step has been completed for the first portion of lines. The advantage of the IB-PTA over the basic PTA is the presence of solved lines, which are available for other computations, by the time processors become idle. Implementation of the IB-PTA is based on a proposed static processor schedule that switches between forward and backward computations and controls communication between processors. Computations are performed on the Cray T3E MIMD computer. Combination of the proposed IB-PTA with the “burn from two ends” algorithm shows low parallelization penalty.  相似文献   

7.
A functional programming language supporting implicit parallelization of programs is described. The language is based on four operations of composition, of which three can perform parallel processing. Functional programs are represented schematically to use a dynamic parallelization algorithm. The implemented algorithms make it possible to dynamically distribute the load between processors and control the grain of parallelism. Experimental results for the efficiency of the implemented system obtained on examples of typical problems are presented.  相似文献   

8.
The advent of multicores presents a promising opportunity for speeding up the execution of sequential programs through their parallelization. In this paper we present a novel solution for efficiently supporting software-based speculative parallelization of sequential loops on multicore processors. The execution model we employ is based upon state separation, an approach for separately maintaining the speculative state of parallel threads and non-speculative state of the computation. If speculation is successful, the results produced by parallel threads in speculative state are committed by copying them into the computation’s non-speculative state. If misspeculation is detected, no costly state recovery mechanisms are needed as the speculative state can be simply discarded. Techniques are proposed to reduce the cost of data copying between non-speculative and speculative state and efficiently carrying out misspeculation detection. We apply the above approach to speculative parallelization of loops in several sequential programs which results in significant speedups on a Dell PowerEdge 1900 server with two Intel Xeon quad-core processors.  相似文献   

9.
《Parallel Computing》1997,23(9):1261-1277
This paper describes a strategy for the parallelization of a finite element code for the numerical simulation of shallow water flow. The numerical scheme adopted for the discretization of the equations in the scalar algorithm is briefly described, with emphasis on the aspects concerning its porting to a parallel architecture. The parallelization strategy is of the domain decomposition type: the implicit computational kernel of the scheme, a Poisson problem, is solved by an additive Schwarz preconditioning technique within conjugate gradient iterations. Both the theoretical and the implementation aspects of the domain decomposition method are described as applied in the present context. Finally, some computational examples are shown and discussed.  相似文献   

10.
Two-level parallelization is introduced to solve a massive block-tridiagonal matrix system. One-level is used for distributing blocks whose size is as large as the number of block rows due to the spectral basis, and the other level is used for parallelizing in the block row dimension. The purpose of the added parallelization dimension is to retard the saturation of the scaling due to communication overhead and inefficiencies in the single-level parallelization only distributing blocks. As a technique for parallelizing the tridiagonal matrix, the combined method of “Partitioned Thomas method” and “Cyclic Odd–Even Reduction” is implemented in an MPI-Fortran90 based finite element-spectral code (TORIC) that calculates the propagation of electromagnetic waves in a tokamak. The two-level parallel solver using thousands of processors shows more than 5 times improved computation speed with the optimized processor grid compared to the single-level parallel solver under the same conditions. Three-dimensional RF field reconstructions in a tokamak are shown as examples of the physics simulations that have been enabled by this algorithmic advance.  相似文献   

11.
Direct numerical simulation (DNS) of turbulent flows is widely recognized to demand fine spatial meshes, small timesteps, and very long runtimes to properly resolve the flow field. To overcome these limitations, most DNS is performed on supercomputing machines. With the rapid development of terascale (and, eventually, petascale) computing on thousands of processors, it has become imperative to consider the development of DNS algorithms and parallelization methods that are capable of fully exploiting these massively parallel machines. A highly parallelizable algorithm for the simulation of turbulent channel flow that allows for efficient scaling on several thousand processors is presented. A model that accurately predicts the performance of the algorithm is developed and compared with experimental data. The results demonstrate that the proposed numerical algorithm is capable of scaling well on petascale computing machines and thus will allow for the development and analysis of high Reynolds number channel flows. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

12.
Averbuch  A.  Epstein  B.  Ioffe  L.  Yavneh  I. 《The Journal of supercomputing》2000,17(2):123-142
We present an efficient parallelization strategy for speeding up the computation of a high-accuracy 3-dimensional serial Navier-Stokes solver that treats turbulent transonic high-Reynolds flows. The code solves the full compressible Navier-Stokes equations and is applicable to realistic large size aerodynamic configurations and as such requires huge computational resources in terms of computer memory and execution time. The solver can resolve the flow properly on relatively coarse grids. Since the serial code contains a complex infrastructure typical for industrial code (which ensures its flexibility and applicability to complex configurations), then the parallelization task is not straightforward. We get scalable implementation on massively parallel machines by maintaining efficiency at a fixed value by simultaneously increasing the number of processors and the size of the problem.The 3-D Navier-Stokes solver was implemented on three MIMD message-passing multiprocessors (a 64-processors IBM SP2, a 20-processors MOSIX, and a 64-processors Origin 2000). The same code written with PVM and MPI software packages was executed on all the above distinct computational platforms. The examples in the paper demonstrate that we can achieve efficiency of about 60% for as many as 64 processors on Origin 2000 on a full-size 3-D aerodynamic problem which is solved on realistic computational grids.  相似文献   

13.
In this paper we consider the problem of computing the connected components of the complement of a given graph. We describe a simple sequential algorithm for this problem, which works on the input graph and not on its complement, and which for a graph on n vertices and m edges runs in optimal O(n+m) time. Moreover, unlike previous linear co-connectivity algorithms, this algorithm admits efficient parallelization, leading to an optimal O(log n)-time and O((n+m)log n)-processor algorithm on the EREW PRAM model of computation. It is worth noting that, for the related problem of computing the connected components of a graph, no optimal deterministic parallel algorithm is currently available. The co-connectivity algorithms find applications in a number of problems. In fact, we also include a parallel recognition algorithm for weakly triangulated graphs, which takes advantage of the parallel co-connectivity algorithm and achieves an O(log2 n) time complexity using O((n+m2) log n) processors on the EREW PRAM model of computation.  相似文献   

14.
A previously presented hybrid finite volume/particle method for the solution of the joint-velocity-frequency-composition probability density function (JPDF) transport equation in complex 3D geometries is extended for parallel computing. The parallelization strategy is based on domain decomposition. The finite volume method (FVM) and the particle method (PM) are parallelized separately and the algorithm is fully synchronous. For the FVM a standard method based on transferring data in ghost cells is used. Moreover, a subdomain interior decomposition algorithm to efficiently solve the implicit time integration for hyperbolic systems is described. The parallelization of the PM is more complicated due to the use of a sub-time stepping algorithm for the particle trajectory integration. Hereby, each particle obeys its local CFL criterion, and the covered distances per global time step can vary significantly. Therefore, an efficient algorithm which deals with this issue and has minimum communication effort was devised and implemented. Numerical tests to validate the parallel vs. the serial algorithm are presented, where also the effectiveness of the subdomain interior decomposition for the implicit time integration was investigated. A 3D dump-combustor configuration test case with about 2.5 × 105 cells was used to demonstrate the good performance of the parallel algorithm. The hybrid algorithm scales well and the maximum speedup on 60 processors for this configuration was 50 (≈80% parallel efficiency).  相似文献   

15.
In this paper, we study the parallelization of the Jacobi method to solve the symmetric eigenvalue problem on a mesh of processors. To solve this problem obtaining a theoretical efficiency of 100% it is necessary to exploit the symmetry of the matrix. The only previous algorithm we know exploiting the symmetry on multicomputers is that of van de Geijn (1991), but that algorithm uses a storage scheme adequate for a logical ring of processors, so having a low scalability. In this paper we show how matrix symmetry can be exploited on a logical mesh of processors obtaining a higher scalability than that obtained with van de Geijn's algorithm. In addition, we show how the storage scheme exploiting the symmetry can be combined with a scheme by blocks to obtain a highly efficient and scalable Jacobi method for solving the symmetric eigenvalue problem for distributed memory parallel computers. We report performance results from the Intel Touchstone Delta, the iPSC/860, the Alliant FX/80 and the PARSYS SN-1040. © 1997 by John Wiley & Sons, Ltd.  相似文献   

16.
现有的密文策略基于属性加密CP-ABE(ciphertext-policyattribute-basedencryption)算法普遍在解密时存在计算量过大、计算时间过长的问题,该问题造成CP-ABE难以应用和实施.针对该问题,将计算外包引入到方案的设计之中,提出一种面向公有云的基于Spark大数据平台的CP-ABE快速解密方案.在该方案中,专门根据CP-ABE的解密特点设计了解密并行化算法;利用并行化算法,将计算量较大的叶子节点解密和根节点解密并行化;之后,将并行化任务交给Spark集群进行处理.计算外包使得绝大多数解密工作由云服务器完成,用户客户端只需进行一次指数运算;而并行化处理则提高了解密速度.安全性分析表明,所提出的方案在一般群模型和随机预言模型下能对抗选择明文攻击.  相似文献   

17.
18.
The paper describes how to modify the two-sided Hari–Zimmermann algorithm for computation of the generalized eigenvalues of a matrix pair (A, B), where B is positive definite, to an implicit algorithm that computes the generalized singular values of a pair (F, G). In addition, we present blocking and parallelization techniques for speedup of the computation.For triangular matrix pairs of a moderate size, numerical tests show that the double precision sequential pointwise algorithm is several times faster than the Lapack DTGSJA algorithm, while the accuracy is slightly better, especially for small generalized singular values.Cache-aware algorithms, implemented either as the block-oriented, or as the full block algorithm, are several times faster than the pointwise algorithm. The algorithm is almost perfectly parallelizable, so parallel shared memory versions of the algorithm are perfectly scalable, and their speedup almost solely depends on the number of cores used. A hybrid shared/distributed memory algorithm is intended for huge matrices that do not fit into the shared memory.  相似文献   

19.
The development and validation of a parallel unstructured tetrahedral non-nested multigrid (MG) method for simulation of unsteady 3D incompressible viscous flow is presented. The Navier-Stokes solver is based on the artificial compressibility method (ACM) and a higher-order characteristics-based finite-volume scheme on unstructured MG. Unsteady flow is calculated with an implicit dual time stepping scheme. The parallelization of the solver is achieved by a MG domain decomposition approach (MG-DD), using the Single Program Multiple Data (SPMD) programming paradigm. The Message-Passing Interface (MPI) Library is used for communication of data and loop arrays are decomposed using the OpenMP standard. The parallel codes using single grid and MG are used to simulate steady and unsteady incompressible viscous flows for a 3D lid-driven cavity flow for validation and performance evaluation purposes. The speedups and efficiencies obtained by both the parallel single grid and MG solvers are reasonably good for all test cases, using up to 32 processors on the SGI Origin 3400. The parallel results obtained agree well with those of serial solvers and with numerical solutions obtained by other researchers, as well as experimental measurements.  相似文献   

20.
The development of computational fluid dynamics (CFD) highly depends on high-performance computers. Computer hardware has evolved rapidly, yet scalable CFD parallel software remains scarce. In this article, we design a highly scalable CFD parallel paradigm for both homogeneous and heterogeneous supercomputers. The paradigm achieves the separation of communication and computation and automatically adapts to various solvers and hardware environments, thus reducing programming difficulties and increasing automatic parallelization. Meanwhile, the number of communications is greatly reduced and the scalability of the program is improved through implementing centralized communication and two-level partitioning techniques. Complex flow problems for real aircraft were then computed on different hardware platforms with a grid size of ten billion. The homogeneous computer hardware includes Intel Xeon Gold 6258R and Phytium 2000+ processors, and the heterogeneous computer platforms include NVIDIA Tesla V100 and SW26010 processors. High parallel efficiency was obtained on all computer platforms, verifying that the paradigm has good automatic parallelization, scalability, and stability. The paradigm in this article has an important reference value for CFD massively parallel computing and can promote the development and application of CFD technology.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号