期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU

George Stantchev William Dorland Nail Gumerov 《Journal of Parallel and Distributed Computing》2008

Particle-In-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of Particle-To-Grid interpolation. 相似文献

2.

Efficient magnetohydrodynamic simulations on distributed multi-GPU systems using a novel GPU Direct–MPI hybrid approach

Un-Hong Wong Takayuki Aoki Hon-Cheng Wong 《Computer Physics Communications》2014

Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 1200³ grid points using 216 GPUs. 相似文献

3.

Efficient magnetohydrodynamic simulations on graphics processing units with CUDA

Hon-Cheng Wong Un-Hong Wong Xueshang Feng Zesheng Tang 《Computer Physics Communications》2011,(10):2132-2160

Magnetohydrodynamic (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the best of the author?s knowledge, the first implementation of MHD simulations entirely on GPUs with CUDA, named GPU-MHD, to accelerate the simulation process. GPU-MHD supports both single and double precision computations. A series of numerical tests have been performed to validate the correctness of our code. Accuracy evaluation by comparing single and double precision computation results is also given. Performance measurements of both single and double precision are conducted on both the NVIDIA GeForce GTX 295 (GT200 architecture) and GTX 480 (Fermi architecture) graphics cards. These measurements show that our GPU-based implementation achieves between one and two orders of magnitude of improvement depending on the graphics card used, the problem size, and the precision when comparing to the original serial CPU MHD implementation. In addition, we extend GPU-MHD to support the visualization of the simulation results and thus the whole MHD simulation and visualization process can be performed entirely on GPUs. 相似文献

4.

Highly accelerated simulations of glassy dynamics using GPUs: Caveats on limited floating-point precision

Peter H. Colberg Felix Höfling 《Computer Physics Communications》2011,(5):1120-1129

Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 10⁸ MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores. 相似文献

5.

Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations

《Parallel Computing》2014,40(5-6):86-99

Simulation of in vivo cellular processes with the reaction–diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel efficiency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. 相似文献

6.

A graphics processor-based intranuclear cascade and evaporation simulation

H. Wan Chan Tseung C. Beltran 《Computer Physics Communications》2014

Monte Carlo simulations of the transport of protons in human tissue have been deployed on graphics processing units (GPUs) with impressive results. To provide a more complete treatment of non-elastic nuclear interactions in these simulations, we developed a fast intranuclear cascade-evaporation simulation for the GPU. This can be used to model non-elastic proton collisions on any therapeutically relevant nuclei at incident energies between 20 and 250 MeV. Predictions are in good agreement with Geant4.9.6p2. It takes approximately 2 s to calculate 1×10⁶ 200 MeV proton–¹⁶O interactions on a NVIDIA GTX680 GPU. A speed-up factor of ∼20 relative to one Intel i7-3820 core processor thread was achieved. 相似文献

7.

Improvements of the particle-in-cell code EUTERPE for petascaling machines

Xavier Sáez Alejandro Soba Edilberto Sánchez Ralf Kleiber Francisco Castejón José M. Cela 《Computer Physics Communications》2011,(9):2047-2051

In the present work we report some performance measures and computational improvements recently carried out using the gyrokinetic code EUTERPE (Jost, 2000 [1] and Jost et al., 1999 [2]), which is based on the general particle-in-cell (PIC) method. The scalability of the code has been studied for up to sixty thousand processing elements and some steps towards a complete hybridization of the code were made. As a numerical example, non-linear simulations of Ion Temperature Gradient (ITG) instabilities have been carried out in screw-pinch geometry and the results are compared with earlier works. A parametric study of the influence of variables (step size of the time integrator, number of markers, grid size) on the quality of the simulation is presented. 相似文献

8.

Application of a total variation diminishing scheme to electromagnetic hybrid particle-in-cell plasma simulation

Masaharu Matsumoto Yoshihiro Kajimura Hideyuki Usui Ikkoh Funaki Iku Shinohara 《Computer Physics Communications》2012,183(10):2027-2034

A discretization procedure for a total variation diminishing (TVD) scheme is introduced to an electromagnetic hybrid particle-in-cell (PIC) plasma simulation code in order to improve the numerical stability and resolution when calculating the plasma flow field in which magnetic field discontinuities (for example, Rankine–Hugoniot jump conditions for shock waves) are generated. In the hybrid PIC code used in this study, ions are treated as particles and electrons are assumed to be an inertia-less (mass-less) fluid. In the numerical results of one-dimensional test simulations, the TVD scheme significantly prevents non-physical, numerical oscillations, which would ordinarily be produced in the solution when the convection term of the magnetic induction equation in the hybrid PIC code is discretized by central difference schemes at magnetic field discontinuities. Furthermore, a two-dimensional simulation of the global structure of a collision-less bow shock, which is suitable for practical use, makes it possible to clearly capture the bow shock by using the hybrid PIC code with the TVD scheme. 相似文献

9.

Providing Source Code Level Portability Between CPU and GPU with MapCG

下载免费PDF全文

洪春涛陈德颢陈羽北陈文光郑纬民林海波《计算机科学技术学报》2012,27(1):42-56

Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod... 相似文献

10.

Three-dimensional electromagnetic particle-in-cell with Monte Carlo collision simulations on three MIMD parallel computers

J. Wang P. Liewer E. Huang 《The Journal of supercomputing》1997,10(4):331-348

A three-dimensional electromagnetic particle-in-cell code with Monte Carlo collision (PIC-MCC) is developed for MIMD parallel supercomputers. This code uses a standard relativistic leapfrog scheme incorporating Monte Carlo calculations to push plasma particles and to include collisional effects on particle orbits. A local finite-difference time-domain method is used to update the self-consistent electromagnetic fields. The code is implemented using the General Concurrent PIC (GCPIC) algorithm, which uses domain decomposition to divide the computation among the processors. Particles must be exchanged between processors as they move among subdomains. Message passing is implemented using the Express Cubix library and the PVM. We evaluate the performance of this code using a 512-processor Intel Touchstone Delta, a 512-processor Intel Paragon, and a 256-processor CRAY T3D. It is shown that a high parallel efficiency exceeding 95% has been achieved on all three machines for large problems. We have run PIC-MCC simulations using several hundred million particles with several million collisions per time step. For these large-scale simulations the particle push time achieved is in the range of 90–115 ns/particle/time step, and the collision calculation time in the range of a few hundred nanoseconds per collision. 相似文献

11.

A global collisionless PIC code in magnetic coordinates

S. Jolliet A. Bottino R. Hatzky B.F. Mcmillan K. Appert L. Villard 《Computer Physics Communications》2007,177(5):409-425

A global plasma turbulence simulation code, ORB5, is presented. It solves the gyrokinetic electrostatic equations including zonal flows in axisymmetric magnetic geometry. The present version of the code assumes a Boltzmann electron response on magnetic surfaces. It uses a Particle-In-Cell (PIC), δf scheme, 3D cubic B-splines finite elements for the field solver and several numerical noise reduction techniques. A particular feature is the use of straight-field-line magnetic coordinates and a field-aligned Fourier filtering technique that dramatically improves the performance of the code in terms of both the numerical noise reduction and the maximum time step allowed. Another feature is the capability to treat arbitrary axisymmetric ideal MHD equilibrium configurations. The code is heavily parallelized, with scalability demonstrated up to 4096 processors and 10⁹ marker particles. Various numerical convergence tests are performed. The code is validated against an analytical theory of zonal flow residual, geodesic acoustic oscillations and damping, and against other codes for a selection of linear and nonlinear tests. 相似文献

12.

Swan: A tool for porting CUDA programs to OpenCL 总被引：1，自引：0，他引：1

M.J. Harvey G. De Fabritiis 《Computer Physics Communications》2011,(4):1093-1099

The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance.

Program summary

Program title: SwanCatalogue identifier: AEIH_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GNU Public License version 2No. of lines in distributed program, including test data, etc.: 17 736No. of bytes in distributed program, including test data, etc.: 131 177Distribution format: tar.gzProgramming language: CComputer: PCOperating system: LinuxRAM: 256 MbytesClassification: 6.5External routines: NVIDIA CUDA, OpenCLNature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, OpenCL, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to OpenCL is relatively straightforward but laborious. The Swan tool facilitates this conversion.Solution method:Swan performs a translation of CUDA kernel source code into an OpenCL equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and OpenCL APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or OpenCL, with the selection made at compile time.Restrictions: No support for CUDA C++ featuresRunning time: Nominal 相似文献

13.

The GPU on the simulation of cellular computing models

José M. Cecilia José M. García Ginés D. Guerrero Miguel A. Martínez-del-Amor Mario J. Pérez-Jiménez Manuel Ujaldón 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2012,16(2):231-246

Membrane Computing is a discipline aiming to abstract formal computing models, called membrane systems or P systems, from the structure and functioning of the living cells as well as from the cooperation of cells in tissues, organs, and other higher order structures. This framework provides polynomial time solutions to NP-complete problems by trading space for time, and whose efficient simulation poses challenges in three different aspects: an intrinsic massively parallelism of P systems, an exponential computational workspace, and a non-intensive floating point nature. In this paper, we analyze the simulation of a family of recognizer P systems with active membranes that solves the Satisfiability problem in linear time on different instances of Graphics Processing Units (GPUs). For an efficient handling of the exponential workspace created by the P systems computation, we enable different data policies to increase memory bandwidth and exploit data locality through tiling and dynamic queues. Parallelism inherent to the target P system is also managed to demonstrate that GPUs offer a valid alternative for high-performance computing at a considerably lower cost. Furthermore, scalability is demonstrated on the way to the largest problem size we were able to run, and considering the new hardware generation from Nvidia, Fermi, for a total speed-up exceeding four orders of magnitude when running our simulations on the Tesla S2050 server. 相似文献

14.

Parallel particle simulation in high voltage diodes (Algorithms and concepts for implementation on SUPRENUM)

David Seldner Manfred Alef Thomas Westermann Eberhard Halter 《Parallel Computing》1988,7(3):445-449

A two-dimensional Particle-In-Cell (PIC) code has been developed in order to support the experimental investigation of pulsed power ion diodes by numerical simulation. The PIC code serves to simulate the orbits of electrically charged particles in electro-magnetic fields. Most modules of the PIC code have been vectorized; for parallelization the logical structure of the code is being considered. In this paper we introduce the idea of a PIC code, outline the single modules, and present concepts for implementation on SUPRENUM. 相似文献

15.

Towards a parallel component in a GPU–CUDA environment: a case study with the L-BFGS Harwell routine

L. D'Amore G. Laccetti D. Romano G. Scotti A. Murli 《国际计算机数学杂志》2015,92(1):59-76

Modern graphics processing units (GPUs) have been at the leading edge of increasing parallelism over the last 10 years. This fact has encouraged the use of GPUs in a broader range of applications, where developers are required to lever age this technology with new programming models which ease the task of writing programs to run efficiently on GPUs. In this paper, we discuss the main guidelines to assist the developer when porting sequential scientific code on modern GPUs. These guidelines were carried out by porting the L-BFGS, the (Limited memory-) BFGS algorithm for large scale optimization, available as Harwell routine VA15. The specific interest in the L-BFGS algorithm arises from the fact that this is the computational module with the longest running time of a Oceanographic Data Assimilation application software, on which some of the authors are working. 相似文献

16.

Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

Wang Xian Aoki Takayuki 《Parallel Computing》2011,37(9):521-535

GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384 × 384 × 384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3-4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8-96 GPUs, the performances increase by a factor about 1.1-1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re = 13,000 was carried on successfully using the mesh system 2000 × 1000 × 1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0 h were used for processing 100,000 time steps. Under this condition, the computational time (2.79 h) and the data communication time (3.06 h) are almost the same. 相似文献

17.

VForce: An environment for portable applications on high performance systems with accelerators

Nicholas Moore Miriam Leeser Laurie Smith King 《Journal of Parallel and Distributed Computing》2012

Special Purpose Processors (SPPs), including Field Programmable Gate Arrays (FPGAs) and Graphics Processing Units (GPUs), are increasingly being used to accelerate scientific applications. VForce aims to aid application programmers in using such accelerators with minimal changes in user code. VForce is an extensible middleware framework that enables VSIPL++ (the Vector Signal Image Processing Library extension) programs to transparently use Special Purpose Processors (SPPs) while maintaining portability across platforms with and without SPP hardware. The framework is designed to maintain a VSIPL++-like environment and hide hardware-specific details from the application programmer while preserving performance and productivity. VForce focuses on the interface between application code and accelerator code. The same application code can run in software on a general purpose processor or take advantage of SPPs if they are available. VForce is unique in that it supports calls to both FPGAs and GPUs while requiring no changes in user code. Results on systems with NVIDIA Tesla GPUs and Xilinx FPGAs are presented. This paper describes VForce, illustrates its support for portability, and discusses lessons learned for providing support for different hardware configurations at run time. Key considerations involve global knowledge about the relationship between processing steps for defining application mapping, memory allocation, and task parallelism. 相似文献

18.

Vortex methods for incompressible flow simulations on the GPU

Diego Rossinelli Petros Koumoutsakos 《The Visual computer》2008,24(7-9):699-708

We present a remeshed vortex particle method for incompressible flow simulations on GPUs. The particles are convected in a Lagrangian frame and are periodically reinitialized on a regular grid. The grid is used in addition to solve for the velocity–vorticity Poisson equation and for the computation of the diffusion operators. In the present GPU implementation of particle methods, the remeshing and the solution of the Poisson equation rely on fast and efficient mesh-particle interpolations. We demonstrate that particle remeshing introduces minimal artificial dissipation, enables a faster computation of differential operators on particles over grid-free techniques and can be efficiently implemented on GPUs. The results demonstrate that, contrary to common practice in particle simulations, it is necessary to remesh the (vortex) particle locations in order to solve accurately the equations they discretize, without compromising the speed of the method. The present method leads to simulations of incompressible vortical flows on GPUs with unprecedented accuracy and efficiency. 相似文献

19.

Implementation of particle-in-cell stellar dynamics codes on the connection machine-2

R. G. Hohlfeld N. F. Comins D. Shalit P. A. Shorey R. C. Giles 《The Journal of supercomputing》1993,7(4):417-436

The development of massively parallel supercomputers provides a unique opportunity to advance the state of the art inN-body simulations. TheseN-body codes are of great importance for simulations in stellar dynamics and plasma physics. For systems with long-range forces, such as gravity or electromagnetic forces, it is important to increase the number of particles toN 10⁷ particles. Significantly improved modeling ofN body systems can be expected by increasingN, arising from a more realistic representation of physical transport processes involving particle diffusion and energy and momentum transport. In addition, it will be possible to guarantee that physically significant portions of complex physical systems, such as Lindblad resonances of galaxies or current sheets in magnetospheres, will have an adequate population of particles for a realistic simulation. Particle-mesh (PM) and particle-particle particle-mesh (P³M) algorithms present the best prospects for the simulation of large-scaleN-body systems. As an example we present a two-dimensional PM simulation of a disk galaxy that we have developed on the Connection Machine-2, a massively parallel boolean hypercube supercomputer. The code is scalable to any CM-2 configuration available and, on the largest configuration, simulations withN = 128 M = 2²⁷ particles are possible in reasonable run times. 相似文献

20.

Enabling a high throughput real time data pipeline for a large radio telescope array with GPUs

R.G. Edgar M.A. Clark K. Dale 《Computer Physics Communications》2010,181(10):1707-1714

The Murchison Widefield Array (MWA) is a next-generation radio telescope currently under construction in the remote Western Australia Outback. Raw data will be generated continuously at 5 GiB s⁻¹, grouped into 8 s cadences. This high throughput motivates the development of on-site, real time processing and reduction in preference to archiving, transport and off-line processing. Each batch of 8 s data must be completely reduced before the next batch arrives. Maintaining real time operation will require a sustained performance of around 2.5 TFLOP s⁻¹ (including convolutions, FFTs, interpolations and matrix multiplications). We describe a scalable heterogeneous computing pipeline implementation, exploiting both the high computing density and FLOP-per-Watt ratio of modern GPUs. The architecture is highly parallel within and across nodes, with all major processing elements performed by GPUs. Necessary scatter-gather operations along the pipeline are loosely synchronized between the nodes hosting the GPUs. The MWA will be a frontier scientific instrument and a pathfinder for planned peta- and exa-scale facilities. 相似文献