首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 734 毫秒
This work is devoted to the development of efficient parallel algorithms for the direct numerical simulation (DNS) of incompressible flows on modern supercomputers. In doing so, a Poisson equation needs to be solved at each time-step to project the velocity field onto a divergence-free space. Due to the non-local nature of its solution, this elliptic system is the part of the algorithm that is most difficult to parallelize. The Poisson solver presented here is restricted to problems with one uniform periodic direction. It is a combination of a block preconditioned Conjugate Gradient (PCG) and an FFT diagonalization. The latter decomposes the original system into a set of mutually independent 2D systems that are solved by means of the PCG algorithm. For the most ill-conditioned systems, that correspond to the lowest Fourier frequencies, the PCG is replaced by a direct Schur-complement based solver.The previous version of the Poisson solver was conceived for single-core (also dual-core) processors and therefore, the distributed memory model with message-passing interface (MPI) was used. The irruption of multi-core architectures motivated the use of a two-level hybrid MPI + OpenMP parallelization with the shared memory model on the second level. Advantages and implementation details for the additional OpenMP parallelization are presented and discussed in this paper. Numerical experiments show that, within its range of efficient scalability, the previous MPI-only parallelization is slightly outperformed by the MPI + OpenMP approach. But more importantly, the hybrid parallelization has allowed to significantly extend the range of efficient scalability. Here, the solver has been successfully tested up to 12800 CPU cores for meshes with up to 109 grid points. However, estimations based on the presented results show that this range can be potentially stretched up until 200,000 cores approximately. Finally, several examples of DNS simulations are briefly presented to illustrate some potential applications of the solver.  相似文献   

In this note, we introduce a simple, effective numerical method, the local tangential lifting method, for solving partial differential equations for scalar- and vector-valued data defined on surfaces. Even though we follow the traditional way to approximate the regular surfaces under consideration by triangular meshes, the key idea of our algorithm is to develop an intrinsic and unified way to compute directly the partial derivatives of functions defined on triangular meshes. We present examples in computer graphics and image processing applications.  相似文献   

Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in “flat” three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction (NWP) models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and to meet operational requirements an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units (GPUs) have been shown to be highly efficient (both in terms of absolute performance and power consumption) for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. In this article we describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the PCG algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it both to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing CUDA libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.  相似文献   

This paper describes changes made to a previous implementation of an N-body tree code developed for a fine-grained, SIMD computer architecture. These changes include (1) switching from a balanced binary tree to a balanced oct tree, (2) addition of quadrupole corrections, and (3) having the particles search the tree in groups rather than individually. An algorithm for limiting errors is also discussed. In aggregate, these changes have led to a performance increase of over a factor of 10 compared to the previous code. For problems several times larger than the processor array, the code now achieves performance levels of ∼ 1 Gflop on the Maspar MP-2 or roughly 20% of the quoted peak performance of this machine. This percentage is competitive with other parallel implementations of tree codes on MIMD architectures. This is significant, considering the low relative cost of SIMD architectures.  相似文献   

The present work investigates the feasibility of finite element methods and topology optimization for unstructured meshes in massively parallel computer architectures, more specifically on Graphics Processing Units or GPUs. Challenges in the parallel implementation, like the parallel assembly race condition, are discussed and solved with simple algorithms, in this case greedy graph coloring. The parallel implementation for every step involved in the topology optimization process is benchmarked and compared against an equivalent sequential implementation. The ultimate goal of this work is to speed up the topology optimization process by means of parallel computing using off-the-shelf hardware. Examples are compared with both a standard sequential version of the implementation and a massively parallel version to better illustrate the advantages and disadvantages of this approach.  相似文献   

The increasing gap between the speeds of processors and main memory has led to hardware architectures with an increasing number of caches to reduce average memory access times. Such deep memory hierarchies make the sequential and parallel efficiency of computer programs strongly dependent on their memory access pattern. In this paper, we consider embedded Runge–Kutta methods for the solution of ordinary differential equations and study their efficient implementation on different parallel platforms. In particular, we focus on ordinary differential equations which are characterized by a special access pattern as it results from the spatial discretization of partial differential equations by the method of lines. We explore how the potential parallelism in the stage vector computation of such equations can be exploited in a pipelining approach leading to a better locality behavior and a higher scalability. Experiments show that this approach results in efficiency improvements on several recent sequential and parallel computers.  相似文献   

数据结构组织方式在算法的程序实现中占有重要地位。论文探讨了网格数据处理中的数据结构组织问题,分析了不同的数据组织在时间和空间方面的优缺点,提出了一种弹性的、有较强适应性的网格数据组织结构,并以不同实例验证了所提出的数据结构在时间上的即时有效性、存贮空间上的自适应性以及实现上的简单性。论文提出的网格数据组织结构可用于各类网格数据的计算。  相似文献   

Mesh analysis using geodesic mean-shift   总被引:4,自引:0,他引:4  
In this paper, we introduce a versatile and robust method for analyzing the feature space associated with a given mesh surface. The method is based on the mean-shift operator, which was shown to be successful in image and video processing. Its strength lies in the fact that it works in a single joint space of geometry and attributes called the feature-space. The mean-shift procedure works as a gradient ascend finding maxima of an estimated probability density function in feature-space. Our method for using the mean-shift technique on surfaces solves several difficulties. First, meshes as opposed to images do not present a regular and uniform sampling of domain. Second, on surface meshes the shifting procedure must be constrained to stay on the surface and preserve geodesic distances. We define a special local geodesic parameterization scheme, and use it to generalize the mean-shift procedure to unstructured surface meshes. Our method can support piecewise linear attribute definitions as well as piecewise constant attributes.  相似文献   

In this paper, the two-dimensional multi-term time-space fractional diffusion-wave equation on an irregular convex domain is considered as a much more general case for wider applications in fluid mechanics. A novel unstructured mesh finite element method is proposed for the considered equation. In most existing works, the finite element method is applied on regular domains using uniform meshes. The case of irregular convex domains, which would require subdivision using unstructured meshes, is mostly still open. Furthermore, the orders of the multi-term time-fractional derivatives have been considered to belong to (0, 1] or (1, 2] separately in existing models. In this paper, we consider two-dimensional multi-term time-space fractional diffusion-wave equations with the time fractional orders belonging to the whole interval (0, 2) on an irregular convex domain. We propose to use a mixed difference scheme in time and an unstructured mesh finite element method in space. Detailed implementation and the stability and convergence analyses of the proposed numerical scheme are given. Numerical examples are conducted to evaluate the theoretical analysis.  相似文献   

In this paper, we present a novel volumetric mesh representation suited for parallel computing on modern GPU architectures. The data structure is based on a compact, ternary sparse matrix storage of boundary operators. Boundary operators correspond to the first‐order top‐down relations of k‐faces to their (k ? 1)‐face facets. The compact, ternary matrix storage format is based on compressed sparse row matrices with signed indices and allows for efficient parallel computation of indirect and bottom‐up relations. This representation is then used in the implementation of several parallel volumetric mesh algorithms including Laplacian smoothing and volumetric Catmull‐Clark subdivision. We compare these algorithms with their counterparts based on OpenVolumeMesh and achieve speedups from 3× to 531×, for sufficiently large meshes, while reducing memory consumption by up to 36%.  相似文献   

Concepts and implementation of parallel finite element analysis   总被引:1,自引:0,他引:1  
The design of complex engineering systems such as advanced aircraft structures and offshore platforms requires continually increasing levels of detail in supporting analysis. The finite element method is widely used as a computational method with which to model physical systems in various engineering problems. For detailed analyses of complex designs, structural models composed of several thousands of degrees of freedom are no longer uncommon. Such design activities require large order finite element and/or finite difference models and excessive computation demands in both calculation speed and information management. The computer simulation of the nonlinear dynamic response of structures and the implementation of parallel FEM systems on a high speed multiprocessor have received considerable attention in recent years. The driving forces of these activities included the reliable simulation of automotive and aircraft crash phenomena, and the increased performance of computers. Most existing major structural analysis software systems were designed 10–20 years ago and have been optimized for current sequential computers. Such systems often are not well structured to take maximum advantage of the recent and continuing revolution in parallel vector computing capabilities. These parallel vector computer architectures not only occur in the form of large supercomputers, but are now also occurring for minicomputers and even engineering workstations. To benefit from advances in parallel computers, software must be developed which takes maximum advantage of the parallel processing feature.  相似文献   

This paper analyzes the performance and scalability of an iteration of the preconditioned conjugate gradient algorithm on parallel architectures with a variety of interconnection networks, such as the mesh, the hypercube, and that of the CM-5 parallel computer. It is shown that for block-tridiagonal matrices resulting from two-dimensional finite difference grids, the communication overhead due to vector inner products dominates the communication overheads of the remainder of the computation on a large number of processors. However, with a suitable mapping, the parallel formulation of a PCG iteration is highly scalable for such matrices on a machine like the CM-5 whose fast control network practically eliminates the overheads due to inner product computation. The use of the truncated Incomplete Cholesky (IC) preconditioner can lead to further improvement in scalability on the CM-5 by a constant factor,as a result, a parallel formulation of the PCG algorithm with IC preconditioner may execute faster than that with a simple diagonal preconditioner even if the latter runs faster in a serial implementation  相似文献   

0.引 言 近年来,Hamilton-Jacobi方程(简称H-J方程)的数学理论与数值逼近已引起人们越来越多的关注.H-J方程不仅在原有的领域例如控制论、微分几何等有非常重要的应用[8],而且不断开拓新的应用领域,例如用于网格生成[5]以及流体界面的水平集方法计算 [9,12,13,15]等.由于 H-J方程解的导数会出现间断,导致解曲面(线)出现尖点或纽结等现象[7],故如何做到既节省计算时间,又能在光滑区域高精度数值求解和较好地分辨间断是一个十分重要的问题.文卜]通过在每个坐标方向构造单变量的…  相似文献   

This work is related with the implementation of a finite volume method to solve the 2D Shallow Water Equations on Graphic Processing Units (GPU). The strategy is fully oriented to work efficiently with unstructured meshes which are widely used in many fields of Engineering. Due to the design of the GPU cards, structured meshes are better suited to work with than unstructured meshes. In order to overcome this situation, some strategies are proposed and analyzed in terms of computational gain, by means of introducing certain ordering on the unstructured meshes. The necessity of performing the simulations using unstructured instead of structured meshes is also justified by means of some test cases with analytical solution.  相似文献   

A general 2D-hp-adaptive Finite Element (FE) implementation in Fortran 90 is described. The implementation is based on an abstract data structure, which allows to incorporate the full hp-adaptivity of triangular and quadrilateral finite elements. The h-refinement strategies are based on h2-refinement of quadrilaterals and h4-refinement of triangles. For p-refinement we allow the approximation order to vary within any element. The mesh refinement algorithms are restricted to 1-irregular meshes. Anisotropic and geometric refinement of quadrilateral meshes is made possible by additionally allowing double constrained nodes in rectangles. The capabilities of this hp-adaptive FE package are demonstrated on various test problems. Received: 18 December 1997 / Accepted: 17 April 1998  相似文献   

This paper presents an interpolating ternary butterfly subdivision scheme for triangular meshes based on a 1–9 splitting operator. The regular rules are derived from a C2 interpolating subdivision curve, and the irregular rules are established through the Fourier analysis of the regular case. By analyzing the eigenstructures and characteristic maps, we show that the subdivision surfaces generated by this scheme is C1 continuous up to valence 100. In addition, the curvature of regular region is bounded. Finally we demonstrate the visual quality of our subdivision scheme with several examples.  相似文献   

A mesh-vertex finite volume scheme for solving the Euler equations on triangular unstructured meshes is implemented on a MIMD (multiple instruction/multiple data stream) parallel computer. Three partitioning strategies for distributing the work load onto the processors are discussed. Issues pertaining to the communication costs are also addressed. We find that the spectral bisection strategy yields the best performance. The performance of this unstructured computation on the Intel iPSC/860 compares very favorably with that on a one-processor CRAY Y-MP/1 and an earlier implementation on the Connection Machine.The authors are employees of Computer Sciences Corporation. This work was funded under contract NAS 2-12961  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号