首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind–magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax–Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision.  相似文献   

2.
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.  相似文献   

3.
We present several algorithms to compute the solution of a linear system of equations on a graphics processor (GPU), as well as general techniques to improve their performance, such as padding and hybrid GPU‐CPU computation. We compare single and double precision performance of a modern GPU with unified architecture, and show how iterative refinement with mixed precision can be used to regain full accuracy in the solution of linear systems, exploiting the potential of the processor for single precision arithmetic. Experimental results on a GTX280 using CUBLAS 2.0, the implementation of BLAS for NVIDIA® GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

4.
Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially.  相似文献   

5.
张哲 《微型机与应用》2012,31(10):85-88
对于使用支持NVIDACUDA程序设计模型的GPU的二维一层浅水系统,给出了如何加速平衡性良好的有限体积模式的数值解,同时给出并实现了在单双浮点精度下使用CUDA模型利用潜在数据并行的算法。数值实验表明,CUDA体系结构的求解程序比CPU并行实现求解程序高效。  相似文献   

6.
Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 108 MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores.  相似文献   

7.
This paper presents the porting of 2D and 3D Navier–Stokes equations solvers for unstructured grids, from the CPU to the graphics processing unit (GPU; NVIDIA’s Ge-Force GTX 280 and 285), using the CUDA language. The performance of the GPU implementations, with single, double or mixed precision arithmetic operations, is compared to that of the CPU code.Issues regarding the optimal handling of the unstructured grid topology on the GPU, particularly for vertex-centered CFD algorithms, are discussed. Restructuring the existing codes was necessary in order to maximize the parallel efficiency of the GPU implementations. The mixed precision implementation, in which the left-hand-side operators are computed with single precision, was shown to bridge the gap between the single and double precision speed-ups. Based on the different speed-ups and prediction accuracy of the aforementioned GPU implementations of the Navier–Stokes equations solver, a hierarchical optimization method which is suitable for GPUs is proposed and demonstrated in inviscid and turbulent 2D flow problems. The search for the optimal solution(s) splits into two levels, both relying upon evolutionary algorithms (EAs) though with different evaluation tools each. The low level EA uses the very fast single precision GPU implementation with relaxed convergence criteria for the inexpensive evaluation of candidate solutions. Promising solutions are regularly broadcast to the high level EA which uses the mixed precision GPU implementation of the same flow solver. Single- and two-objective aerodynamic shape optimization problems are solved using the developed software.  相似文献   

8.
We port a high-order finite-element application that performs the numerical simulation of seismic wave propagation resulting from earthquakes in the Earth on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. This application runs in single precision and is therefore a good candidate for implementation on current GPU hardware, which either does not support double precision or supports it but at the cost of reduced performance. We discuss and compare two implementations of the code: one that has maximum efficiency but is limited to the memory size of the card, and one that can handle larger problems but that is less efficient. We use a coloring scheme to handle efficiently summation operations over nodes on a topology with variable valence. We perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25.  相似文献   

9.
An optimized implementation of a block tridiagonal solver based on the block cyclic reduction (BCR) algorithm is introduced and its portability to graphics processing units (GPUs) is explored. The computations are performed on the NVIDIA GTX480 GPU. The results are compared with those obtained on a single core of Intel Core i7-920 (2.67 GHz) in terms of calculation runtime. The BCR linear solver achieves the maximum speedup of 5.84x with block size of 32 over the CPU Thomas algorithm in double precision. The proposed BCR solver is applied to discontinuous Galerkin (DG) simulations on structured grids via alternating direction implicit (ADI) scheme. The GPU performance of the entire computational fluid dynamics (CFD) code is studied for different compressible inviscid flow test cases. For a general mesh with quadrilateral elements, the ADI-DG solver achieves the maximum total speedup of 7.45x for the piecewise quadratic solution over the CPU platform in double precision.  相似文献   

10.
Sheet forming simulation is very important for vehicle body design. Due to the increase of complexity and scale of the CAE model, a tradeoff between the accuracy and efficiency become the bottleneck for application. Therefore, a parallel explicit finite element (FE) based on graphics processing unit (GPU) architecture for sheet forming is developed. Implementation details with computer unified device architecture (CUDA) are considered in this work. A pre-index strategy is suggested for parallelization of nodal force assembling. Parallel reduction method is introduced to calculation of the global time step. To ensure the reliability and accuracy of the GPU-based program, double precision floating and intrinsic functions are implemented for the explicit FE computing. The simulation results based on a commercial NVIDIA GTX285 device can obtain about 27X speedup than on a Intel Q8200 CPU, which demonstrates the efficiency of the parallel sheet forming simulation system.  相似文献   

11.
Dissipative particle dynamics (DPD) simulation is implemented on multiple GPUs by using NVIDIA’s Compute Unified Device Architecture (CUDA) in this paper. Data communication between each GPU is executed based on the POSIX thread. Compared with the single-GPU implementation, this implementation can provide faster computation speed and more storage space to perform simulations on a significant larger system. In benchmark, the performance of GPUs is compared with that of Material Studio running on a single CPU core. We can achieve more than 90x speedup by using three C2050 GPUs to perform simulations on an 80∗80∗80 system. This implementation is applied to the study on the dispersancy of lubricant succinimide dispersants. A series of simulations are performed on lubricant–soot–dispersant systems to study the impact factors including concentration and interaction with lubricant on the dispersancy, and the simulation results are agreed with the study in our present work.  相似文献   

12.
13.
Graphics Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications. In this paper, we develop a general purpose Lattice Boltzmann code that runs entirely on a single GPU. The results show that: (1) simple precision floating point arithmetic is sufficient for LBM computation in comparison to double precision; (2) the implementation of LBM on GPUs allows us to achieve up to about one billion lattice update per second using single precision floating point; (3) GPUs provide an inexpensive alternative to large clusters for fluid dynamics prediction.  相似文献   

14.
目的 近年来双目视觉领域的研究重点逐步转而关注其“实时化”策略的研究,而立体代价聚合是双目视觉中最为复杂且最为耗时的步骤,为此,提出一种基于GPU通用计算(GPGPU)技术的近实时双目立体代价聚合算法。方法 选用一种匹配精度接近于全局匹配算法的局部算法——线性立体匹配算法(linear stereo matching)作为代价聚合策略;结合线性代价聚合的原理,对其主要步骤(代价计算、均值滤波及系数求解等)的计算流程进行有针对性地并行优化。结果 对于相同的实验样本,用本文方法在NVIDA GTX780 实验平台上能在更短的时间计算出代价矩阵,与原有的CPU实现方法相比,代价聚合的效率平均有了数十倍的提升。结论 实时双目立体代价聚合方法,为在个人通用PC平台上实时获取高质量双目视觉深度信息提供了一个高效可靠的途径。  相似文献   

15.
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Program summary

Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run.  相似文献   

16.
Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.  相似文献   

17.
We present an implementation approach for Marching Cubes (MC) on graphics hardware for OpenGL 2.0 or comparable graphics APIs. It currently outperforms all other known graphics processing units (GPU)‐based iso‐surface extraction algorithms in direct rendering for sparse or large volumes, even those using the recently introduced geometry shader (GS) capabilites. To achieve this, we outfit the Histogram Pyramid (HP) algorithm, previously only used in GPU data compaction, with the capability for arbitrary data expansion. After reformulation of MC as a data compaction and expansion process, the HP algorithm becomes the core of a highly efficient and interactive MC implementation. For graphics hardware lacking GSs, such as mobile GPUs, the concept of HP data expansion is easily generalized, opening new application domains in mobile visual computing. Further, to serve recent developments, we present how the HP can be implemented in the parallel programming language CUDA (compute unified device architecture), by using a novel 1D chunk/layer construction.  相似文献   

18.
To improve the simulation efficiency of turbulent fluid flows at high Reynolds numbers with large eddy dynamics, a CUDA-based simulation solution of lattice Boltzmann method for large eddy simulation (LES) using multiple graphics processing units (GPUs) is proposed. Our solution adopts the “collision after propagation” lattice evolution way and puts the misaligned propagation phase at global memory read process. The latest GPU platform allows a single CPU thread to control up to four GPUs that run in parallel. In order to make use of multiple GPUs, the whole working set is evenly partitioned into sub-domains. We implement Smagorinsky model and Vreman model respectively to verify our multi-GPU solution. These two LES models have different relaxation time calculation behavior and lead to different CUDA implementation characteristics. The implementation based on Smagorinsky model achieves 190 times speedup over the sequential implementation on CPU, while the implementation based on Vreman model archives more than 90 times speedup. The experimental results show that the parallel performance of our multi-GPU solution scales very well on multiple GPUs. Therefore large-scale (up to 10,240 $\times $ 10,240 lattices) LES–LBM simulation becomes possible at a low cost, even using double-precision floating point calculation.  相似文献   

19.
The Stochastic Simulation Algorithm (SSA) developed by Gillespie provides a powerful mechanism for exploring the behavior of chemical systems with small species populations or with important noise contributions. Gene circuit simulations for systems biology commonly employ the SSA method, as do ecological applications. This algorithm tends to be computationally expensive, so researchers seek an efficient implementation of SSA. In this program package, the Accelerated Exact Stochastic Simulation Algorithm (AESS) contains optimized implementations of Gillespie?s SSA that improve the performance of individual simulation runs or ensembles of simulations used for sweeping parameters or to provide statistically significant results.

Program summary

Program title: AESSCatalogue identifier: AEJW_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEJW_v1_0.htmlProgram obtainable from: CPC Program Library, Queen?s University, Belfast, N. IrelandLicensing provisions: University of Tennessee copyright agreementNo. of lines in distributed program, including test data, etc.: 10 861No. of bytes in distributed program, including test data, etc.: 394 631Distribution format: tar.gzProgramming language: C for processors, CUDA for NVIDIA GPUsComputer: Developed and tested on various x86 computers and NVIDIA C1060 Tesla and GTX 480 Fermi GPUs. The system targets x86 workstations, optionally with multicore processors or NVIDIA GPUs as accelerators.Operating system: Tested under Ubuntu Linux OS and CentOS 5.5 Linux OSClassification: 3, 16.12Nature of problem: Simulation of chemical systems, particularly with low species populations, can be accurately performed using Gillespie?s method of stochastic simulation. Numerous variations on the original stochastic simulation algorithm have been developed, including approaches that produce results with statistics that exactly match the chemical master equation (CME) as well as other approaches that approximate the CME.Solution method: The Accelerated Exact Stochastic Simulation (AESS) tool provides implementations of a wide variety of popular variations on the Gillespie method. Users can select the specific algorithm considered most appropriate. Comparisons between the methods and with other available implementations indicate that AESS provides the fastest known implementation of Gillespie?s method for a variety of test models. Users may wish to execute ensembles of simulations to sweep parameters or to obtain better statistical results, so AESS supports acceleration of ensembles of simulation using parallel processing with MPI, SSE vector units on x86 processors, and/or using NVIDIA GPUs with CUDA.  相似文献   

20.
We present a highly general implementation of fast multipole methods on graphics processing units (GPUs). Our two-dimensional double precision code features an asymmetric type of adaptive space discretization leading to a particularly elegant and flexible implementation. All steps of the multipole algorithm are efficiently performed on the GPU, including the initial phase, which assembles the topological information of the input data. Through careful timing experiments, we investigate the effects of the various peculiarities of the GPU architecture.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号