期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A parallel lattice Boltzmann method for large eddy simulation on multiple GPUs

Qinjian Li Chengwen Zhong Kai Li Guangyong Zhang Xiaowei Lu Qing Zhang Kaiyong Zhao Xiaowen Chu 《Computing》2014,96(6):479-501

To improve the simulation efficiency of turbulent fluid flows at high Reynolds numbers with large eddy dynamics, a CUDA-based simulation solution of lattice Boltzmann method for large eddy simulation (LES) using multiple graphics processing units (GPUs) is proposed. Our solution adopts the “collision after propagation” lattice evolution way and puts the misaligned propagation phase at global memory read process. The latest GPU platform allows a single CPU thread to control up to four GPUs that run in parallel. In order to make use of multiple GPUs, the whole working set is evenly partitioned into sub-domains. We implement Smagorinsky model and Vreman model respectively to verify our multi-GPU solution. These two LES models have different relaxation time calculation behavior and lead to different CUDA implementation characteristics. The implementation based on Smagorinsky model achieves 190 times speedup over the sequential implementation on CPU, while the implementation based on Vreman model archives more than 90 times speedup. The experimental results show that the parallel performance of our multi-GPU solution scales very well on multiple GPUs. Therefore large-scale (up to 10,240 $\times $ 10,240 lattices) LES–LBM simulation becomes possible at a low cost, even using double-precision floating point calculation. 相似文献

2.

CUDA体系结构上的一层浅水系统的模拟

张哲《微型机与应用》2012,31(10):85-88

对于使用支持NVIDACUDA程序设计模型的GPU的二维一层浅水系统,给出了如何加速平衡性良好的有限体积模式的数值解,同时给出并实现了在单双浮点精度下使用CUDA模型利用潜在数据并行的算法。数值实验表明,CUDA体系结构的求解程序比CPU并行实现求解程序高效。相似文献

3.

Sailfish: A flexible multi-GPU implementation of the lattice Boltzmann method

M. Januszewski M. Kostur 《Computer Physics Communications》2014

We present Sailfish, an open source fluid simulation package implementing the lattice Boltzmann method (LBM) on modern Graphics Processing Units (GPUs) using CUDA/OpenCL. We take a novel approach to GPU code implementation and use run-time code generation techniques and a high level programming language (Python) to achieve state of the art performance, while allowing easy experimentation with different LBM models and tuning for various types of hardware. We discuss the general design principles of the code, scaling to multiple GPUs in a distributed environment, as well as the GPU implementation and optimization of many different LBM models, both single component (BGK, MRT, ELBM) and multicomponent (Shan–Chen, free energy). The paper also presents results of performance benchmarks spanning the last three NVIDIA GPU generations (Tesla, Fermi, Kepler), which we hope will be useful for researchers working with this type of hardware and similar codes. 相似文献

4.

Memory transfer optimization for a lattice Boltzmann solver on Kepler architecture nVidia GPUs

Mark J. Mawson Alistair J. Revell 《Computer Physics Communications》2014

The Lattice Boltzmann method (LBM) for solving fluid flow is naturally well suited to an efficient implementation for massively parallel computing, due to the prevalence of local operations in the algorithm. This paper presents and analyses the performance of a 3D lattice Boltzmann solver, optimized for third generation nVidia GPU hardware, also known as ‘Kepler’. We provide a review of previous optimization strategies and analyse data read/write times for different memory types. In LBM, the time propagation step (known as streaming), involves shifting data to adjacent locations and is central to parallel performance; here we examine three approaches which make use of different hardware options. Two of which make use of ‘performance enhancing’ features of the GPU; shared memory and the new shuffle instruction found in Kepler based GPUs. These are compared to a standard transfer of data which relies instead on optimized storage to increase coalesced access. It is shown that the more simple approach is most efficient; since the need for large numbers of registers per thread in LBM limits the block size and thus the efficiency of these special features is reduced. Detailed results are obtained for a D3Q19 LBM solver, which is benchmarked on nVidia K5000M and K20C GPUs. In the latter case the use of a read-only data cache is explored, and peak performance of over 1036 Million Lattice Updates Per Second (MLUPS) is achieved. The appearance of a periodic bottleneck in the solver performance is also reported, believed to be hardware related; spikes in iteration-time occur with a frequency of around 11 Hz for both GPUs, independent of the size of the problem. 相似文献

5.

Highly accelerated simulations of glassy dynamics using GPUs: Caveats on limited floating-point precision

Peter H. Colberg Felix Höfling 《Computer Physics Communications》2011,(5):1120-1129

Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 10⁸ MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores. 相似文献

6.

Efficient magnetohydrodynamic simulations on graphics processing units with CUDA

Hon-Cheng Wong Un-Hong Wong Xueshang Feng Zesheng Tang 《Computer Physics Communications》2011,(10):2132-2160

Magnetohydrodynamic (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the best of the author?s knowledge, the first implementation of MHD simulations entirely on GPUs with CUDA, named GPU-MHD, to accelerate the simulation process. GPU-MHD supports both single and double precision computations. A series of numerical tests have been performed to validate the correctness of our code. Accuracy evaluation by comparing single and double precision computation results is also given. Performance measurements of both single and double precision are conducted on both the NVIDIA GeForce GTX 295 (GT200 architecture) and GTX 480 (Fermi architecture) graphics cards. These measurements show that our GPU-based implementation achieves between one and two orders of magnitude of improvement depending on the graphics card used, the problem size, and the precision when comparing to the original serial CPU MHD implementation. In addition, we extend GPU-MHD to support the visualization of the simulation results and thus the whole MHD simulation and visualization process can be performed entirely on GPUs. 相似文献

7.

Simulation of one-layer shallow water systems on multicore and CUDA architectures

Marc de la Asunción José M. Mantas Manuel J. Castro 《The Journal of supercomputing》2011,58(2):206-214

The numerical solution of shallow water systems is useful for several applications related to geophysical flows, but the big dimensions of the domains suggests the use of powerful accelerators to obtain numerical results in reasonable times. This paper addresses how to speed up the numerical solution of a first order well-balanced finite volume scheme for 2D one-layer shallow water systems by using modern Graphics Processing Units (GPUs) supporting the NVIDIA CUDA programming model. An algorithm which exploits the potential data parallelism of this method is presented and implemented using the CUDA model in single and double floating point precision. Numerical experiments show the high efficiency of this CUDA solver in comparison with a CPU parallel implementation of the solver and with respect to a previously existing GPU solver based on a shading language. 相似文献

8.

An area efficient multi-mode quadruple precision floating point adder

《Microprocessors and Microsystems》2016

Most of the scientific and engineering applications require accurate computations. Double precision floating point computations are not enough for many applications like climate modelling, computational physics, etc. Efficient design of quadruple precision floating point adder is needed for these applications. The proposed multi-mode quadruple precision floating point adder architecture supports four single precision operations in parallel, as well as two double precision operations in parallel and also supports one quadruple precision operation. Compared to existing Quadruple precision floating point adders and Dual mode Quadruple precision floating point adder, the proposed architecture can perform more computations with less area because of resource sharing among different precision operands. The proposed Multi-mode quadruple precision adder supports both normal and subnormal operations and also the exceptional case handling such as infinity, Not a Number (NaN) and zero cases. The proposed adder has been designed and implemented in both ASIC and FPGA. During ASIC implementation with 90 nm technology using the synopsis tool, the proposed Multi-mode quadruple precision floating point adder has a 38.57% smaller area compared to the existing quadruple precision floating point adder. Similarly, the proposed design reduces the area by 29.28% and 35.68% when implemented on Virtex 4 and Virtex 5 FPGAs respectively. 相似文献

9.

A new approach to the lattice Boltzmann method for graphics processing units

Christian Obrecht Frédéric Kuznik Bernard Tourancheau Jean-Jacques Roux 《Computers & Mathematics with Applications》2011,61(12):3628-3638

Emerging many-core processors, like CUDA capable nVidia GPUs, are promising platforms for regular parallel algorithms such as the Lattice Boltzmann Method (LBM). Since the global memory for graphic devices shows high latency and LBM is data intensive, the memory access pattern is an important issue for achieving good performances. Whenever possible, global memory loads and stores should be coalescent and aligned, but the propagation phase in LBM can lead to frequent misaligned memory accesses. Most previous CUDA implementations of 3D LBM addressed this problem by using low latency on chip shared memory. Instead of this, our CUDA implementation of LBM follows carefully chosen data transfer schemes in global memory. For the 3D lid-driven cavity test case, we obtained up to 86% of the global memory maximal throughput on nVidia’s GT200. We show that as a consequence highly efficient implementations of LBM on GPUs are possible, even for complex models. 相似文献

10.

Scientific computation for simulations on programmable graphics hardware

Robert Strzodka Michael Doggett Andreas Kolb 《Simulation Modelling Practice and Theory》2005,13(8):667-680

Graphics processor units (GPUs) have emerged as powerful parallel processors in recent years. Although floating point computations and high level programming languages are now available, the efficient use of the enormous computing power of GPUs still requires a significant amount of graphics specific knowledge.The paper explains how to use GPUs for scientific computations without graphics specific terminology. It offers an algorithmic view on GPUs with comparisons to cache aware and parallel programming of CPUs. Two typical simulation techniques, namely grid based and particle based methods are discussed. 相似文献

11.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

12.

GPU-accelerated minimum distance and clearance queries

Krishnamurthy A McMains S Haller K 《IEEE transactions on visualization and computer graphics》2011,17(6):729-742

We present practical algorithms for accelerating distance queries on models made of trimmed NURBS surfaces using programmable Graphics Processing Units (GPUs). We provide a generalized framework for using GPUs as coprocessors in accelerating CAD operations. By supplementing surface data with a surface bounding-box hierarchy on the GPU, we answer distance queries such as finding the closest point on a curved NURBS surface given any point in space and evaluating the clearance between two solid models constructed using multiple NURBS surfaces. We simultaneously output the parameter values corresponding to the solution of these queries along with the model space values. Though our algorithms make use of the programmable fragment processor, the accuracy is based on the model space precision, unlike earlier graphics algorithms that were based only on image space precision. In addition, we provide theoretical bounds for both the computed minimum distance values as well as the location of the closest point. Our algorithms are at least an order of magnitude faster and about two orders of magnitude more accurate than the commercial solid modeling kernel ACIS. 相似文献

13.

A floating point conversion algorithm for mixed precision computations

Choon Lih HOO Sallehuddin Mohamed HARIS Nik Abdullah Nik MOHAMED 《浙江大学学报:C卷英文版》2012,(9):711-718

The floating point number is the most commonly used real number representation for digital computations due to its high precision characteristics. It is used on computers and on single chip applications such as DSP chips. Double precision (64-bit) representations allow for a wider range of real numbers to be denoted. However, single precision (32-bit) operations are more efficient. Recently, there has been an increasing interest in mixed precision computations which take advantage of single precision efficiency on 64-bit numbers. This calls for the ability to interchange between the two formats. In this paper, an algorithm that converts floating point numbers from 64- to 32-bit representations is presented. The algorithm was implemented as a Verilog code and tested on field programmable gate array (FPGA) using the Quartus II DE2 board and Agilent 16821A portable logic analyzer. Results indicate that the algorithm can perform the conversion reliably and accurately within a constant execution time of 25 ns with a 20 MHz clock frequency regardless of the number being converted. 相似文献

14.

Emerging Architectures Enable to Boost Massively Parallel Data Mining Using Adaptive Sparse Grids

Alexander Heinecke Dirk Pflüger 《International journal of parallel programming》2013,41(3):357-399

Gaining knowledge out of vast datasets is a main challenge in data-driven applications nowadays. Sparse grids provide a numerical method for both classification and regression in data mining which scales only linearly in the number of data points and is thus well-suited for huge amounts of data. Due to the recursive nature of sparse grid algorithms and their classical random memory access pattern, they impose a challenge for the parallelization on modern hardware architectures such as accelerators. In this paper, we present the parallelization on several current task- and data-parallel platforms, covering multi-core CPUs with vector units, GPUs, and hybrid systems. We demonstrate that a less efficient implementation from an algorithmical point of view can be beneficial if it allows vectorization and a higher degree of parallelism instead. Furthermore, we analyze the suitability of parallel programming languages for the implementation. Considering hardware, we restrict ourselves to the x86 platform with SSE and AVX vector extensions and to NVIDIA’s Fermi architecture for GPUs. We consider both multi-core CPU and GPU architectures independently, as well as hybrid systems with up to 12 cores and 2 Fermi GPUs. With respect to parallel programming, we examine both the open standard OpenCL and Intel Array Building Blocks, a recently introduced high-level programming approach, and comment on their ease of use. As the baseline, we use the best results obtained with classically parallelized sparse grid algorithms and their OpenMP-parallelized intrinsics counterpart (SSE and AVX instructions), reporting both single and double precision measurements. The huge data sets we use are a real-life dataset stemming from astrophysics and artificial ones, all of which exhibit challenging properties. In all settings, we achieve excellent results, obtaining speedups of up to 188 × using single precision on a hybrid system. 相似文献

15.

Highly interactive computational steering for coupled 3D flow problems utilizing multiple GPUs

Jan Linxweiler Manfred Krafczyk Jonas Tölke 《Computing and Visualization in Science》2010,13(7):299-314

Most computational fluid dynamics (CFD) simulations require massive computational power which is usually provided by traditional High Performance Computing (HPC) environments. Although interactivity of the simulation process is highly appreciated by scientists and engineers, due to limitations of typical HPC environments, present CFD simulations are usually executed non interactively. A recent trend is to harness the parallel computational power of graphics processing units (GPUs) for general purpose applications. As an alternative to traditional massively parallel computing, GPU computing has also gained popularity in the CFD community, especially for its application to the lattice Boltzmann method (LBM). For instance, Tölke and others presented very efficient implementations of the LBM for 2D as well as 3D space (Toelke J, in Comput Visual Sci. (2008); Toelke J and Krafczk M, in Int J Comput Fluid Dyn 22(7): 443–456 (2008)). In this work we motivate the use of GPU computing to facilitate interactive CFD simulations. In our approach, the simulation is executed on multiple GPUs instead of traditional HPC environments, which allows the integration of the complete simulation process into a single desktop application. To demonstrate the feasibility of our approach, we show a fully bidirectional fluid-structure-interaction for self induced membrane oscillations in a turbulent flow. The efficiency of the approach allows a 3D simulation close to realtime. 相似文献

16.

TightCCD: Efficient and Robust Continuous Collision Detection using Tight Error Bounds

下载免费PDF全文

Zhendong Wang Min Tang Ruofeng Tong Dinesh Manocha 《Computer Graphics Forum》2015,34(7):289-298

http://gamma.cs.unc.edu/BSC/ We present a realtime and reliable continuous collision detection (CCD) algorithm between triangulated models that exploits the floating point hardware capability of current CPUs and GPUs. Our formulation is based on Bernstein Sign Classification that takes advantage of the geometry properties of Bernstein basis and Bézier curves to perform Boolean collision queries. We derive tight numerical error bounds on the computations and employ those bounds to design an accurate algorithm using finite‐precision arithmetic. Compared with prior floatingpoint CCD algorithms, our approach eliminates all the false negatives and 90–95% of the false positives. We integrated our algorithm (TightCCD) with physically‐based simulation system and observe speedups in collision queries of 5–15X compared with prior reliable CCD algorithms. Furthermore, we demonstrate its benefits in terms of improving the performance or robustness of cloth simulation systems. 相似文献

17.

VelociTI结构浮点DSPs寄存器堆读写的流水线设计

下载免费PDF全文

胡正伟仲顺安陈禾《计算机工程》2007,33(21):237-239

研究了VelociTI结构浮点数字信号处理器寄存器堆的流水线读写原理并提出了一种设计方法。该方法对单操作数双精度浮点指令采用2个32位数据通路用1个流水线周期读取源操作数，双操作数双精度浮点指令采用锁定译码单元，利用若干流水线周期读取源操作数。采用写控制向量的方法实现了流水线多个周期执行写操作。该方法正确实现了基于IEEE754标准的双精度浮点数据在寄存器堆与功能单元之间的32位数据通路上的传输，仿真结果验证了其正确性。相似文献

18.

Quantum Monte Carlo on graphical processing units

Amos G. Anderson William A. Goddard III 《Computer Physics Communications》2007,177(3):298-306

Quantum Monte Carlo (QMC) is among the most accurate methods for solving the time independent Schrödinger equation. Unfortunately, the method is very expensive and requires a vast array of computing resources in order to obtain results of a reasonable convergence level. On the other hand, the method is not only easily parallelizable across CPU clusters, but as we report here, it also has a high degree of data parallelism. This facilitates the use of recent technological advances in Graphical Processing Units (GPUs), a powerful type of processor well known to computer gamers. In this paper we report on an end-to-end QMC application with core elements of the algorithm running on a GPU. With individual kernels achieving as much as 30× speed up, the overall application performs at up to 6× faster relative to an optimized CPU implementation, yet requires only a modest increase in hardware cost. This demonstrates the speedup improvements possible for QMC in running on advanced hardware, thus exploring a path toward providing QMC level accuracy as a more standard tool. The major current challenge in running codes of this type on the GPU arises from the lack of fully compliant IEEE floating point implementations. To achieve better accuracy we propose the use of the Kahan summation formula in matrix multiplications. While this drops overall performance, we demonstrate that the proposed new algorithm can match CPU single precision. 相似文献

19.

Efficient magnetohydrodynamic simulations on distributed multi-GPU systems using a novel GPU Direct–MPI hybrid approach

Un-Hong Wong Takayuki Aoki Hon-Cheng Wong 《Computer Physics Communications》2014

Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 1200³ grid points using 216 GPUs. 相似文献

20.

Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU–GPU clusters

《Parallel Computing》2015

Computational fluid dynamic simulations are in general very compute intensive. Only by parallel simulations on modern supercomputers the computational demands of complex simulation tasks can be satisfied. Facing these computational demands GPUs offer high performance, as they provide the high floating point performance and memory to processor chip bandwidth. To successfully utilize GPU clusters for the daily business of a large community, usable software frameworks must be established on these clusters. The development of such software frameworks is only feasible with maintainable software designs that consider performance as a design objective right from the start. For this work we extend the software design concepts to achieve more efficient and highly scalable multi-GPU parallelization within our software framework waLBerla for multi-physics simulations centered around the lattice Boltzmann method. Our software designs now also support a pure-MPI and a hybrid parallelization approach capable of heterogeneous simulations using CPUs and GPUs in parallel. For the first time weak and strong scaling performance results obtained on the Tsubame 2.0 cluster for more than 1000 GPUs are presented using waLBerla. With the help of a new communication model the parallel efficiency of our implementation is investigated and analyzed in a detailed and structured performance analysis. The suitability of the waLBerla framework for production runs on large GPU clusters is demonstrated. As one possible application we show results of strong scaling experiments for flows through a porous medium. 相似文献