期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Acoustic scattering solver based on single level FMM for multi-GPU systems

Miguel López-Portugués Jesús A. López-Fernández Jonatan Menéndez-Canal Alberto Rodríguez-Campa José Ranilla 《Journal of Parallel and Distributed Computing》2012

In this paper, we present a heterogeneous parallel solver of a high frequency single level Fast Multipole Method (FMM) for the Helmholtz equation applied to acoustic scattering. The developed solution uses multiple GPUs to tackle the compute bound steps of the FMM (aggregation, disaggregation, and near interactions) while the CPU handles a memory bound step (translation) using OpenMP. The proposed solver performance is measured on a workstation with two GPUs (NVIDIA GTX 480) and is compared with that of a distributed memory solver run on a cluster of 32 nodes (HP BL465c) with an Infiniband network. Some energy efficiency results are also presented in this work. 相似文献

2.

CUDA体系结构上的一层浅水系统的模拟

张哲《微型机与应用》2012,31(10):85-88

对于使用支持NVIDACUDA程序设计模型的GPU的二维一层浅水系统,给出了如何加速平衡性良好的有限体积模式的数值解,同时给出并实现了在单双浮点精度下使用CUDA模型利用潜在数据并行的算法。数值实验表明,CUDA体系结构的求解程序比CPU并行实现求解程序高效。相似文献

3.

Simulation of one-layer shallow water systems on multicore and CUDA architectures

Marc de la Asunción José M. Mantas Manuel J. Castro 《The Journal of supercomputing》2011,58(2):206-214

The numerical solution of shallow water systems is useful for several applications related to geophysical flows, but the big dimensions of the domains suggests the use of powerful accelerators to obtain numerical results in reasonable times. This paper addresses how to speed up the numerical solution of a first order well-balanced finite volume scheme for 2D one-layer shallow water systems by using modern Graphics Processing Units (GPUs) supporting the NVIDIA CUDA programming model. An algorithm which exploits the potential data parallelism of this method is presented and implemented using the CUDA model in single and double floating point precision. Numerical experiments show the high efficiency of this CUDA solver in comparison with a CPU parallel implementation of the solver and with respect to a previously existing GPU solver based on a shading language. 相似文献

4.

Evaluation of fermion loops applied to the calculation of the η′ mass and the nucleon scalar and electromagnetic form factors

C. Alexandrou K. Hadjiyiannakou G. Koutsou A. OʼCais A. Strelchenko 《Computer Physics Communications》2012,183(6):1215-1224

The exact evaluation of the disconnected diagram contributions to the flavor-singlet pseudo-scalar meson mass, the nucleon σ-term and the nucleon electromagnetic form factors is carried out utilizing GPGPU technology with the NVIDIA CUDA platform. The disconnected loops are also computed using stochastic methods with several noise reduction techniques. Various dilution schemes as well as the truncated solver method are studied. We make a comparison of these stochastic techniques to the exact results and show that the number of noise vectors depends on the operator insertion in the fermion loop. 相似文献

5.

Algorithmic performance studies on graphics processing units

Olaf Schenk Matthias Christen Helmar Burkhart 《Journal of Parallel and Distributed Computing》2008

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献

6.

Single- and multi-GPU computing on NVIDIA- and AMD-based server platforms for solidification modeling application

Kamil Halbiniak Norbert Meyer Krzysztof Rojek 《Concurrency and Computation》2024,36(9):e8000

This work explores the performance of single- and multi-GPU computing on state-of-the-art NVIDIA- and AMD-based server-class hardware using various programming interfaces to accelerate a real-world scientific application for solidification modeling based on the phase-field method. The main computations of this memory-bound application correspond to 20 stencils computed across grid nodes. We investigate the application's scalability for two basic schemes of organizing computation: without and with hiding data transfers behind computation, combined with using either peer-to-peer inter-GPU data transfers through NVIDIA NVLink and AMD Infinity interconnects or communication over the PCIe and main memory. Among the studied programming interfaces is CUDA, HIP, and OpenMP Accelerator Model. While the first two are designed to write the codes for a specific hardware platform, OpenMP enables code portability between NVIDIA and AMD GPUs. The resulting performance is experimentally assessed on computing platforms containing NVIDIA V100 (up to 8 GPUs) and A100 (one GPU), as well as AMD MI210 (one device) and MI250 (up to 8 logical GPUs). 相似文献

7.

Computing matrix trigonometric functions with GPUs through Matlab

Alonso Pedro Peinado Jesús Ibáñez Javier Sastre Jorge Defez Emilio 《The Journal of supercomputing》2019,75(3):1227-1240

This paper presents an implementation of one of the most up-to-day algorithms proposed to compute the matrix trigonometric functions sine and cosine. The method used is based on Taylor series approximations which intensively uses matrix multiplications. To accelerate matrix products, our application can use from one to four NVIDIA GPUs by using the NVIDIA cublas and cublasXt libraries. The application, implemented in C++, can be used from the Matlab command line thanks to the mex files provided. We experimentally assess our implementation in modern and very high-performance NVIDIA GPUs.

相似文献

8.

A discontinuous Galerkin method with block cyclic reduction solver for simulating compressible flows on GPUs

Behzad Baghapour Mohammad Torabzadeh Hossein Mahmoodi Darian 《国际计算机数学杂志》2015,92(1):110-131

An optimized implementation of a block tridiagonal solver based on the block cyclic reduction (BCR) algorithm is introduced and its portability to graphics processing units (GPUs) is explored. The computations are performed on the NVIDIA GTX480 GPU. The results are compared with those obtained on a single core of Intel Core i7-920 (2.67 GHz) in terms of calculation runtime. The BCR linear solver achieves the maximum speedup of 5.84x with block size of 32 over the CPU Thomas algorithm in double precision. The proposed BCR solver is applied to discontinuous Galerkin (DG) simulations on structured grids via alternating direction implicit (ADI) scheme. The GPU performance of the entire computational fluid dynamics (CFD) code is studied for different compressible inviscid flow test cases. For a general mesh with quadrilateral elements, the ADI-DG solver achieves the maximum total speedup of 7.45x for the piecewise quadratic solution over the CPU platform in double precision. 相似文献

9.

A GPGPU solution of the FMM near interactions for acoustic scattering problems

Miguel López-Portugués Jesús A. López-Fernández Alberto Rodríguez-Campa José Ranilla 《The Journal of supercomputing》2011,58(3):283-291

The Fast Multipole Method (FMM) is specially suitable for applications in which it is necessary to predict the acoustic scattering, e.g., aircraft noise control. This accelerated iterative method has two main parts, far interactions and near interactions. Near interactions are computationally intensive and they fit properly in the Single Instruction Multiple Threads paradigm. In this work, we present a heterogeneous parallel solution in which the near interactions are computed using Graphical Processing Units (GPUs). The performance of the proposed solution is proved using a workstation with one NVIDIA GTX480 GPU and a cluster that consists of 32 nodes HP BL465c with an Infiniband network. 相似文献

10.

Parallelization of Full Search Motion Estimation Algorithm for Parallel and Distributed Platforms

Eduarda Monteiro Bruno Vizzotto Cláudio Diniz Marilena Maule Bruno Zatt Sergio Bampi 《International journal of parallel programming》2014,42(2):239-264

This work presents an efficient method to map the Full Search algorithm for Motion Estimation (ME) onto General Purpose Graphic Processing Unit (GPGPU) architectures using Compute Unified Device Architecture (CUDA) programming model. Our method jointly exploits the massive parallelism available in current GPGPU devices and the parallelism potential of Full Search algorithm. Our main goal is to evaluate the feasibility of video codecs implementation using GPGPUs and its advantages and drawbacks compared to other platforms. Therefore, for comparison reasons, three solutions were developed using distinct programming paradigms for distinct underlying hardware architectures: (i) a sequential solution for general-purpose processor (GPP); (ii) a parallel solution for multi-core GPP using OpenMP library; (iii) a distributed solution for cluster/grid machines using Message Passing Interface (MPI) library. The CUDA-based solution for GPGPUs achieves speed-up compatible to the indicated by the theoretical model for different search areas. Our GPGPU Full Search Motion Estimation provides 2×, 20× and 1664× speed-up when compared to MPI, OpenMP and sequential implementations, respectively. Compared to state-of-the-art, our solution reaches up to 17× speed-up. 相似文献

11.

Speeding up solving of differential matrix Riccati equations using GPGPU computing and MATLAB

Jesus Peinado Jacinto J. Ibez Enrique Arias Vicente Hernndez 《Concurrency and Computation》2012,24(12):1334-1348

In this work, we developed a parallel algorithm to speed up the resolution of differential matrix Riccati equations using a backward differentiation formula algorithm based on a fixed‐point method. The role and use of differential matrix Riccati equations is especially important in several applications such as optimal control, filtering, and estimation. In some cases, the problem could be large, and it is interesting to speed it up as much as possible. Recently, modern graphic processing units (GPUs) have been used as a way to improve performance. In this paper, we used an approach based on general‐purpose computing on graphics processing units. We used NVIDIA © GPUs with unified architecture. To do this, a special version of basic linear algebra subprograms for GPUs, called CUBLAS, and a package (three different packages were studied) to solve linear systems using GPUs have been used. Moreover, we developed a MATLAB © toolkit to use our implementation from MATLAB in such a way that if the user has a graphic card, the performance of the implementation is improved. If the user does not have such a card, the algorithm can also be run using the machine CPU. Experimental results on a NVIDIA Quadro FX 5800 are shown. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

12.

GPU-accelerated indirect boundary element method for voxel model analyses with fast multipole method

Shoji Hamada 《Computer Physics Communications》2011,(5):1162-1168

An indirect boundary element method (BEM) that uses the fast multipole method (FMM) was accelerated using graphics processing units (GPUs) to reduce the time required to calculate a three-dimensional electrostatic field. The BEM is designed to handle cubic voxel models and is specialized to consider square voxel walls as boundary surface elements. The FMM handles the interactions among the surface charge elements and directly outputs surface integrals of the fields over each individual element. The CPU code was originally developed for field analysis in human voxel models derived from anatomical images. FMM processes are programmed using the NVIDIA Compute Unified Device Architecture (CUDA) with double-precision floating-point arithmetic on the basis of a shared pseudocode template. The electric field induced by DC-current application between two electrodes is calculated for two models with 499,629 (model 1) and 1,458,813 (model 2) surface elements. The calculation times were measured with a four-GPU configuration (two NVIDIA GTX295 cards) with four CPU cores (an Intel Core i7-975 processor). The times required by a linear system solver are 31 s and 186 s for models 1 and 2, respectively. The speed-up ratios of the FMM range from 5.9 to 8.2 for model 1 and from 5.0 to 5.6 for model 2. The calculation speed for element-interaction in this BEM analysis was comparable to that of particle-interaction using FMM on a GPU. 相似文献

13.

Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

Scott Rostrup Hans De Sterck 《Computer Physics Communications》2010,181(12):2164-2179

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Program summary

Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run. 相似文献

14.

CFD-based analysis and two-level aerodynamic optimization on graphics processing units

I.C. Kampolis X.S. Trompoukis V.G. Asouti K.C. Giannakoglou 《Computer Methods in Applied Mechanics and Engineering》2010,199(9-12):712-722

This paper presents the porting of 2D and 3D Navier–Stokes equations solvers for unstructured grids, from the CPU to the graphics processing unit (GPU; NVIDIA’s Ge-Force GTX 280 and 285), using the CUDA language. The performance of the GPU implementations, with single, double or mixed precision arithmetic operations, is compared to that of the CPU code.Issues regarding the optimal handling of the unstructured grid topology on the GPU, particularly for vertex-centered CFD algorithms, are discussed. Restructuring the existing codes was necessary in order to maximize the parallel efficiency of the GPU implementations. The mixed precision implementation, in which the left-hand-side operators are computed with single precision, was shown to bridge the gap between the single and double precision speed-ups. Based on the different speed-ups and prediction accuracy of the aforementioned GPU implementations of the Navier–Stokes equations solver, a hierarchical optimization method which is suitable for GPUs is proposed and demonstrated in inviscid and turbulent 2D flow problems. The search for the optimal solution(s) splits into two levels, both relying upon evolutionary algorithms (EAs) though with different evaluation tools each. The low level EA uses the very fast single precision GPU implementation with relaxed convergence criteria for the inexpensive evaluation of candidate solutions. Promising solutions are regularly broadcast to the high level EA which uses the mixed precision GPU implementation of the same flow solver. Single- and two-objective aerodynamic shape optimization problems are solved using the developed software. 相似文献

15.

Multi-GPU performance of incompressible flow computation by lattice Boltzmann method on GPU cluster

Wang Xian Aoki Takayuki 《Parallel Computing》2011,37(9):521-535

GPGPU has drawn much attention on accelerating non-graphic applications. The simulation by D3Q19 model of the lattice Boltzmann method was executed successfully on multi-node GPU cluster by using CUDA programming and MPI library. The GPU code runs on the multi-node GPU cluster TSUBAME of Tokyo Institute of Technology, in which a total of 680 GPUs of NVIDIA Tesla are equipped. For multi-GPU computation, domain partitioning method is used to distribute computational load to multiple GPUs and GPU-to-GPU data transfer becomes severe overhead for the total performance. Comparison and analysis were made among the parallel results by 1D, 2D and 3D domain partitionings. As a result, with 384 × 384 × 384 mesh system and 96 GPUs, the performance by 3D partitioning is about 3-4 times higher than that by 1D partitioning. The performance curve is deviated from the idealistic line due to the long communicational time between GPUs. In order to hide the communication time, we introduced the overlapping technique between computation and communication, in which the data transfer process and computation were done in two streams simultaneously. Using 8-96 GPUs, the performances increase by a factor about 1.1-1.3 with a overlapping mode. As a benchmark problem, a large-scaled computation of a flow around a sphere at Re = 13,000 was carried on successfully using the mesh system 2000 × 1000 × 1000 and 100 GPUs. For such a computation with 2 Giga lattice nodes, 6.0 h were used for processing 100,000 time steps. Under this condition, the computational time (2.79 h) and the data communication time (3.06 h) are almost the same. 相似文献

16.

Designing OP2 for GPU architectures

M.B. Giles G.R. Mudalige B. Spencer C. Bertolli I. Reguly 《Journal of Parallel and Distributed Computing》2013

OP2 is an “active” library framework for the solution of unstructured mesh applications. It aims to decouple the specification of a scientific application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the current OP2 library for generating efficient code targeting contemporary GPU platforms. In this we focus on some of the software architecture design choices and low-level optimizations to maximize performance on NVIDIA’s Fermi architecture GPUs. The performance impact of these design choices is quantified on two NVIDIA GPUs (GTX560Ti, Tesla C2070) using the end-to-end performance of an industrial representative CFD application developed using the OP2 API. Results show that for each system, a number of key configuration parameters need to be set carefully in order to gain good performance. Utilizing a recently developed auto-tuning framework, we explore the effect of these parameters, their limitations and insights into optimizations for improved performance. 相似文献

17.

Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators

《Parallel Computing》2014,40(8):425-447

EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components.Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:

•method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations;
•method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques;
•method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources;
•approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs.

Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively. 相似文献

18.

Multi-core-CPU and GPU-accelerated radiative transfer models based on the discrete ordinate method

Dmitry S. Efremenko Diego G. Loyola Adrian Doicu Robert J.D. Spurr 《Computer Physics Communications》2014

The operational processing of remote sensing data usually requires high-performance radiative transfer model (RTM) simulations. To date, multi-core CPUs and also Graphical Processing Units (GPUs) have been used for highly intensive parallel computations. In this paper, we have compared multi-core and GPU implementations of an RTM based on the discrete ordinate solution method. To implement GPUs, the original CPU code has been redesigned using the C-oriented Compute Unified Device Architecture (CUDA) developed by NVIDIA. 相似文献

19.

Systematic detection of memory related performance bottlenecks in GPGPU programs

《Journal of Systems Architecture》2016

Graphics processing units (GPUs) pose an attractive choice for designing high-performance and energy-efficient software systems. This is because GPUs are capable of executing massively parallel applications. However, the performance of GPUs is limited by the contention in memory subsystems, often resulting in substantial delays and effectively reducing the parallelism. In this paper, we propose GRAB, an automated debugger to aid the development of efficient GPU kernels. GRAB systematically detects, classifies and discovers the root causes of memory-performance bottlenecks in GPUs. We have implemented GRAB and evaluated it with several open-source GPU kernels, including two real-life case studies. We show the usage of GRAB through improvement of GPU kernels on a real NVIDIA Tegra K1 hardware – a widely used GPU for mobile and handheld devices. The guidance obtained from GRAB leads to an overall improvement of up to 64%. 相似文献

20.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献