期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient magnetohydrodynamic simulations on graphics processing units with CUDA

Hon-Cheng Wong Un-Hong Wong Xueshang Feng Zesheng Tang 《Computer Physics Communications》2011,(10):2132-2160

Magnetohydrodynamic (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the best of the author?s knowledge, the first implementation of MHD simulations entirely on GPUs with CUDA, named GPU-MHD, to accelerate the simulation process. GPU-MHD supports both single and double precision computations. A series of numerical tests have been performed to validate the correctness of our code. Accuracy evaluation by comparing single and double precision computation results is also given. Performance measurements of both single and double precision are conducted on both the NVIDIA GeForce GTX 295 (GT200 architecture) and GTX 480 (Fermi architecture) graphics cards. These measurements show that our GPU-based implementation achieves between one and two orders of magnitude of improvement depending on the graphics card used, the problem size, and the precision when comparing to the original serial CPU MHD implementation. In addition, we extend GPU-MHD to support the visualization of the simulation results and thus the whole MHD simulation and visualization process can be performed entirely on GPUs. 相似文献

2.

Efficient magnetohydrodynamic simulations on distributed multi-GPU systems using a novel GPU Direct–MPI hybrid approach

Un-Hong Wong Takayuki Aoki Hon-Cheng Wong 《Computer Physics Communications》2014

Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 1200³ grid points using 216 GPUs. 相似文献

3.

Massively parallelized replica-exchange simulations of polymers on GPUs

Jonathan Gross Wolfhard Janke Michael Bachmann 《Computer Physics Communications》2011,(8):1638-1644

We discuss the advantages of parallelization by multithreading on graphics processing units (GPUs) for parallel tempering Monte Carlo computer simulations of an exemplified bead-spring model for homopolymers. Since the sampling of a large ensemble of conformations is a prerequisite for the precise estimation of statistical quantities such as typical indicators for conformational transitions like the peak structure of the specific heat, the advantage of a strong increase in performance of Monte Carlo simulations cannot be overestimated. Employing multithreading and utilizing the massive power of the large number of cores on GPUs, being available in modern but standard graphics cards, we find a rapid increase in efficiency when porting parts of the code from the central processing unit (CPU) to the GPU. 相似文献

4.

Multi-agent simulation on multiple GPUs

《Simulation Modelling Practice and Theory》2015

Multi-agent simulation is widely used in many areas including biology, economic, political, and environmental science to study complex systems. Unfortunately, it is computationally expensive. In this paper, we shall explore the implementation of a general multi-agent simulation system on a system with multiple GPUs acting as accelerators. In particular, we have ported the popular Java multi-agent simulation framework MASON to a nVidia CUDA-based multi-GPU setting. We evaluated our implementation over different numbers and types of nVidia GPUs. For our evaluation, we ported three models in the original version of MASON. On the well-known Boids model, we achieved a speedup of

187 \times

. Using a fictional model, we showed that speedup of up to

468 \times

is possible. In the paper, we shall also describe the detailed internals of our system, and the various issues we encountered and how they were solved. 相似文献

5.

Designing fast LTL model checking algorithms for many-core GPUs

Jiří BarnatAuthor Vitae Petr BauchAuthor VitaeLuboš BrimAuthor Vitae Milan Češka 《Journal of Parallel and Distributed Computing》2012

Recent technological developments made various many-core hardware platforms widely accessible. These massively parallel architectures have been used to significantly accelerate many computation demanding tasks. In this paper, we show how the algorithms for LTL model checking can be redesigned in order to accelerate LTL model checking on many-core GPU platforms. Our detailed experimental evaluation demonstrates that using the NVIDIA CUDA technology results in a significant speedup of the verification process. Together with state space generation based on shared hash-table and DFS exploration, our CUDA accelerated model checker is the fastest among state-of-the-art shared memory model checking tools. 相似文献

6.

Accelerating Advanced MRI Reconstructions on GPUs 总被引：1，自引：0，他引：1

Stone SS Haldar JP Tsao SC Hwu WM Sutton BP Liang ZP 《Journal of Parallel and Distributed Computing》2008,68(10):1307-1318

Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA's Quadro FX 5600. The reconstruction of a 3D image with 128(3) voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%. 相似文献

7.

Highly accelerated simulations of glassy dynamics using GPUs: Caveats on limited floating-point precision

Peter H. Colberg Felix Höfling 《Computer Physics Communications》2011,(5):1120-1129

Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 10⁸ MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores. 相似文献

8.

A parallel computational model for GATE simulations

F.R. Rannou N. Vega-Acevedo Z. El Bitar 《Computer methods and programs in biomedicine》2013

GATE/Geant4 Monte Carlo simulations are computationally demanding applications, requiring thousands of processor hours to produce realistic results. The classical strategy of distributing the simulation of individual events does not apply efficiently for Positron Emission Tomography (PET) experiments, because it requires a centralized coincidence processing and large communication overheads. We propose a parallel computational model for GATE that handles event generation and coincidence processing in a simple and efficient way by decentralizing event generation and processing but maintaining a centralized event and time coordinator. The model is implemented with the inclusion of a new set of factory classes that can run the same executable in sequential or parallel mode. A Mann–Whitney test shows that the output produced by this parallel model in terms of number of tallies is equivalent (but not equal) to its sequential counterpart. Computational performance evaluation shows that the software is scalable and well balanced. 相似文献

9.

基于GPU的实时超分辨率算法实现

章拓王知衍《广东电脑与电讯》2009,(3)

高分辨率显示设备的发展意味着需要高分辨率的图象与之匹配。本文通过GPU,实现了一种实时超分辨率,使分辨率较低的视频资料在高分辨率显示设备上有较好的显示效果。相似文献

10.

Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU

George Stantchev William Dorland Nail Gumerov 《Journal of Parallel and Distributed Computing》2008

Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of Particle-To-Grid interpolation. 相似文献

11.

Accelerating dissipative particle dynamics simulations on GPUs: Algorithms,numerics and applications

Yu-Hang Tang George Em Karniadakis 《Computer Physics Communications》2014

We present a scalable dissipative particle dynamics simulation code, fully implemented on the Graphics Processing Units (GPUs) using a hybrid CUDA/MPI programming model, which achieves 10–30 times speedup on a single GPU over 16 CPU cores and almost linear weak scaling across a thousand nodes. A unified framework is developed within which the efficient generation of the neighbor list and maintaining particle data locality are addressed. Our algorithm generates strictly ordered neighbor lists in parallel, while the construction is deterministic and makes no use of atomic operations or sorting. Such neighbor list leads to optimal data loading efficiency when combined with a two-level particle reordering scheme. A faster in situ generation scheme for Gaussian random numbers is proposed using precomputed binary signatures. We designed custom transcendental functions that are fast and accurate for evaluating the pairwise interaction. The correctness and accuracy of the code is verified through a set of test cases simulating Poiseuille flow and spontaneous vesicle formation. Computer benchmarks demonstrate the speedup of our implementation over the CPU implementation as well as strong and weak scalability. A large-scale simulation of spontaneous vesicle formation consisting of 128 million particles was conducted to further illustrate the practicality of our code in real-world applications. 相似文献

12.

Long-range interactions and parallel scalability in molecular simulations

Michael Patra Emma Falck Ilpo Vattulainen Mikko Karttunen 《Computer Physics Communications》2007,176(1):14-22

Typical biomolecular systems such as cellular membranes, DNA, and protein complexes are highly charged. Thus, efficient and accurate treatment of electrostatic interactions is of great importance in computational modeling of such systems. We have employed the GROMACS simulation package to perform extensive benchmarking of different commonly used electrostatic schemes on a range of computer architectures (Pentium-4, IBM Power 4, and Apple/IBM G5) for single processor and parallel performance up to 8 nodes—we have also tested the scalability on four different networks, namely Infiniband, GigaBit Ethernet, Fast Ethernet, and nearly uniform memory architecture, i.e. communication between CPUs is possible by directly reading from or writing to other CPUs' local memory. It turns out that the particle-mesh Ewald method (PME) performs surprisingly well and offers competitive performance unless parallel runs on PC hardware with older network infrastructure are needed. Lipid bilayers of sizes 128, 512 and 2048 lipid molecules were used as the test systems representing typical cases encountered in biomolecular simulations. Our results enable an accurate prediction of computational speed on most current computing systems, both for serial and parallel runs. These results should be helpful in, for example, choosing the most suitable configuration for a small departmental computer cluster. 相似文献

13.

Accelerating dissipative particle dynamics with multiple GPUs

Sibo Wang Junbo Xu Hao Wen 《Computer Physics Communications》2013

Dissipative particle dynamics (DPD) simulation is implemented on multiple GPUs by using NVIDIA’s Compute Unified Device Architecture (CUDA) in this paper. Data communication between each GPU is executed based on the POSIX thread. Compared with the single-GPU implementation, this implementation can provide faster computation speed and more storage space to perform simulations on a significant larger system. In benchmark, the performance of GPUs is compared with that of Material Studio running on a single CPU core. We can achieve more than 90x speedup by using three C2050 GPUs to perform simulations on an 80∗80∗80 system. This implementation is applied to the study on the dispersancy of lubricant succinimide dispersants. A series of simulations are performed on lubricant–soot–dispersant systems to study the impact factors including concentration and interaction with lubricant on the dispersancy, and the simulation results are agreed with the study in our present work. 相似文献

14.

《Mathematics and computers in simulation》2013

For the calibration of the parameters in static and dynamic SABR stochastic volatility models, we propose the application of the GPU technology to the Simulated Annealing global optimization algorithm and to the Monte Carlo simulation. This calibration has been performed for EURO STOXX 50 index and EUR/USD exchange rate with an asymptotic formula for volatility or Monte Carlo simulation. Moreover, in the dynamic model we propose an original more general expression for the functional parameters, specially well suited for the EUR/USD exchange rate case. Numerical results illustrate the expected behavior of both SABR models and the accuracy of the calibration. In terms of computational time, when the asymptotic formula for volatility is used the speedup with respect to CPU computation is around 200 with one GPU. Furthermore, GPU technology allows the use of Monte Carlo simulation for calibration purposes, the computational time with CPU being prohibitive. 相似文献

15.

高光谱遥感蚀变填图SCM并行算法设计与实现

程宾洋王茂芝罗耀华郭科《软件》2012,(8):144-146

由于空间和波谱分辨率的不断提高,高光谱遥感影像的海量数据特性导致高光谱遥感影像并行处理成为遥感影像处理技术的发展趋势。本文基于CUDA和GPU环境,设计并实现了高光谱遥感蚀变填图的SCM并行算法。实验结果表明,并行加速比可达到25,SCM并行算法能有效改善算法性能。相似文献

16.

Polymer field-theory simulations on graphics processing units

Kris T. Delaney Glenn H. Fredrickson 《Computer Physics Communications》2013

We report the first CUDA™ graphics-processing-unit (GPU) implementation of the polymer field-theoretic simulation framework for determining fully fluctuating expectation values of equilibrium properties for periodic and select aperiodic polymer systems. Our implementation is suitable both for self-consistent field theory (mean-field) solutions of the field equations, and for fully fluctuating simulations using the complex Langevin approach. Running on NVIDIA^® Tesla T20 series GPUs, we find double-precision speedups of up to 30×

30 \times

compared to single-core serial calculations on a recent reference CPU, while single-precision calculations proceed up to 60×

60 \times

faster than those on the single CPU core. Due to intensive communications overhead, an MPI implementation running on 64 CPU cores remains two times slower than a single GPU. 相似文献

17.

GPU通用计算及其在计算智能领域的应用

丁科谭营《智能系统学报》2015,(1):1-11

在日趋复杂的图形处理任务的推动下,GPU已经演化成为具有众多计算核心、计算能力强大的通用计算设备,并被越来越多地应用于图形处理之外的计算领域。GPU具有高并行、低能耗和低成本的特点,在数据并行度高的计算任务中,相比与传统的CPU平台有着显著的优势。随着GPU体系结构的不断演进以及开发平台的逐步完善,GPU已经进入到高性能计算的主流行列。GPU通用计算的普及,使个人和小型机构能有机会获得以往昂贵的大型、超级计算机才能提供的计算能力,并一定程度上改变了科学计算领域的格局和编程开发模式。GPU提供的强大计算能力极大地推动了计算智能的发展,并且已经在深度学习和群体智能优化方法等子领域获得了巨大的成功,更是在图像、语音等领域取得了突破性的进展。随着人工智能技术和方法的不断进步,GPU将在更多的领域获得更加广泛的应用。相似文献

18.

Solving knapsack problems on GPU

V. Boyer D. El Baz M. Elkihel 《Computers & Operations Research》2012,39(1):42-47

A parallel implementation via CUDA of the dynamic programming method for the knapsack problem on NVIDIA GPU is presented. A GTX 260 card with 192 cores (1.4 GHz) is used for computational tests and processing times obtained with the parallel code are compared to the sequential one on a CPU with an Intel Xeon 3.0 GHz. The results show a speedup factor of 26 for large size problems. Furthermore, in order to limit the communication between the CPU and the GPU, a compression technique is presented which decreases significantly the memory occupancy. 相似文献

19.

Multiparticle collision dynamics: GPU accelerated particle-based mesoscale hydrodynamic simulations

E. Westphal S.P. Singh C.-C. Huang G. Gompper R.G. Winkler 《Computer Physics Communications》2014

相似文献

20.

Molecular dynamics simulations of the relaxation processes in the condensed matter on GPUs

I.V. Morozov A.M. Kazennov R.G. Bystryi G.E. Norman V.V. Pisarev V.V. Stegailov 《Computer Physics Communications》2011,(9):1974-1978

We report on simulation technique and benchmarks for molecular dynamics simulations of the relaxation processes in solids and liquids using the graphics processing units (GPUs). The implementation of a many-body potential such as the embedded atom method (EAM) on GPU is discussed. The benchmarks obtained by LAMMPS and HOOMD packages for simple Lennard-Jones liquids and metals using EAM potentials are presented for both Intel CPUs and Nvidia GPUs. As an example the crystallization rate of the supercooled Al melt is computed. 相似文献