期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance modeling and analysis of heterogeneous lattice Boltzmann simulations on CPU–GPU clusters

《Parallel Computing》2015

Computational fluid dynamic simulations are in general very compute intensive. Only by parallel simulations on modern supercomputers the computational demands of complex simulation tasks can be satisfied. Facing these computational demands GPUs offer high performance, as they provide the high floating point performance and memory to processor chip bandwidth. To successfully utilize GPU clusters for the daily business of a large community, usable software frameworks must be established on these clusters. The development of such software frameworks is only feasible with maintainable software designs that consider performance as a design objective right from the start. For this work we extend the software design concepts to achieve more efficient and highly scalable multi-GPU parallelization within our software framework waLBerla for multi-physics simulations centered around the lattice Boltzmann method. Our software designs now also support a pure-MPI and a hybrid parallelization approach capable of heterogeneous simulations using CPUs and GPUs in parallel. For the first time weak and strong scaling performance results obtained on the Tsubame 2.0 cluster for more than 1000 GPUs are presented using waLBerla. With the help of a new communication model the parallel efficiency of our implementation is investigated and analyzed in a detailed and structured performance analysis. The suitability of the waLBerla framework for production runs on large GPU clusters is demonstrated. As one possible application we show results of strong scaling experiments for flows through a porous medium. 相似文献

2.

Scalable hybrid implementation of the Schur complement method for multi-GPU systems

Sergey Kopysov Igor Kuzmin Nikita Nedozhogin Alexander Novikov Yulia Sagdeeva 《The Journal of supercomputing》2014,69(1):81-88

This paper presents a new hybrid solver based on the Schur complement method, in which computations are distributed between multiple CPUs and GPUs. In this solver, the Schur complement is formed either on CPUs (for small problems) or on GPUs (for large problems). The interface system is solved by a new multi-GPU algorithm implementing the conjugate gradient method with explicit preconditioning. Numerical simulations performed on a hybrid multi-core multi-GPU cluster demonstrate scalability and efficiency of the proposed algorithms. 相似文献

3.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

4.

Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference

Jun Chai Huayou Su Mei Wen Xing Cai Nan Wu Chunyuan Zhang 《The Journal of supercomputing》2013,66(1):364-380

Bayesian inference is one of the most important methods for estimating phylogenetic trees in bioinformatics. Due to the potentially huge computational requirements, several parallel algorithms of Bayesian inference have been implemented to run on CPU-based clusters, multicore CPUs, or small clusters of CPUs and GPUs. To the best of our knowledge, however, none of the existing methods is able to simultaneously and fully utilize both CPUs and GPUs for the computations, leaving idle either the CPU part or the GPU part of modern heterogeneous supercomputers. Aiming at an optimized utilization of heterogeneous computing resources, which is a promising hardware architecture for future bioinformatics applications, we present a new hybrid parallel algorithm and implementation of Bayesian phylogenetic inference, which combines MPI, OpenMP, and CUDA programming. The novelty of our algorithm, denoted as oMC³, is its ability of using CPU cores simultaneously with GPUs for the computations, while ensuring a fair work division between the two types of hardware components. We have implemented oMC³ based on MrBayes, which is one of the most popular software packages for Bayesian phylogenetic inference. Numerical experiments show that oMC³ obtains 2.5× speedup over nMC³, which is a cutting-edge GPU implementation of MrBayes, on a single server consisting of two GPUs and sixteen CPU cores. Moreover, oMC³ scales nicely when 128 GPUs and 1536 CPU cores are in use. 相似文献

5.

Simulation of reaction diffusion processes over biologically relevant size and time scales using multi-GPU workstations

《Parallel Computing》2014,40(5-6):86-99

Simulation of in vivo cellular processes with the reaction–diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel efficiency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems. 相似文献

6.

A Distributed PTX Virtual Machine on Hybrid CPU/GPU Clusters

《Journal of Systems Architecture》2016

Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs. 相似文献

7.

Design and analysis of scheduling strategies for multi-CPU and multi-GPU architectures

《Parallel Computing》2015

In this paper, we present a comparison of scheduling strategies for heterogeneous multi-CPU and multi-GPU architectures. We designed and evaluated four scheduling strategies on top of XKaapi runtime: work stealing, data-aware work stealing, locality-aware work stealing, and Heterogeneous Earliest-Finish-Time (HEFT). On a heterogeneous architecture with 12 CPUs and 8 GPUs, we analysed our scheduling strategies with four benchmarks: a BLAS-1 AXPY vector operation, a Jacobi 2D iterative computation, and two linear algebra algorithms Cholesky and LU. We conclude that the use of work stealing may be efficient if task annotations are given along with a data locality strategy. Furthermore, our experimental results suggests that HEFT scheduling performs better on applications with very regular computations and low data locality. 相似文献

8.

Multi-core-CPU and GPU-accelerated radiative transfer models based on the discrete ordinate method

Dmitry S. Efremenko Diego G. Loyola Adrian Doicu Robert J.D. Spurr 《Computer Physics Communications》2014

The operational processing of remote sensing data usually requires high-performance radiative transfer model (RTM) simulations. To date, multi-core CPUs and also Graphical Processing Units (GPUs) have been used for highly intensive parallel computations. In this paper, we have compared multi-core and GPU implementations of an RTM based on the discrete ordinate solution method. To implement GPUs, the original CPU code has been redesigned using the C-oriented Compute Unified Device Architecture (CUDA) developed by NVIDIA. 相似文献

9.

Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system

Qiang Wu Canqun Yang Tao Tang Liquan Xiao 《Journal of Parallel and Distributed Computing》2013

Heterogeneous systems with nodes containing more than one type of computation units, e.g., central processing units (CPUs) and graphics processing units (GPUs), are becoming popular because of their low cost and high performance. In this paper, we have developed a Three-Level Parallelization Scheme (TLPS) for molecular dynamics (MD) simulation on heterogeneous systems. The scheme exploits multi-level parallelism combining (1) inter-node parallelism using spatial decomposition via message passing, (2) intra-node parallelism using spatial decomposition via dynamically scheduled multi-threading, and (3) intra-chip parallelism using multi-threading and short vector extension in CPUs, and employing multiple CUDA threads in GPUs. By using a hierarchy of parallelism with optimizations such as communication hiding intra-node, and memory optimizations in both CPUs and GPUs, we have implemented and evaluated a MD simulation on a petascale heterogeneous supercomputer TH-1A. The results show that MD simulations can be efficiently parallelized with our TLPS scheme and can benefit from the optimizations. 相似文献

10.

MPtostream:an OpenMP compiler for CPU-GPU heterogeneous parallel systems

《中国科学:信息科学(英文版)》2012,(9):1961-1971

In light of GPUs’ powerful floating-point operation capacity,heterogeneous parallel systems incorporating general purpose CPUs and GPUs have become a highlight in the research field of high performance computing(HPC).However,due to the complexity of programming on GPUs,porting a large number of existing scientific computing applications to the heterogeneous parallel systems remains a big challenge.The OpenMP programming interface is widely adopted on multi-core CPUs in the field of scientific computing.To effectively inherit existing OpenMP applications and reduce the transplant cost,we extend OpenMP with a group of compiler directives,which explicitly divide tasks among the CPU and the GPU,and map time-consuming computing fragments to run on the GPU,thus dramatically simplifying the transplantation.We have designed and implemented MPtoStream,a compiler of the extended OpenMP for AMD’s stream processing GPUs.Our experimental results show that programming with the extended directives deviates from programming with OpenMP by less than 11% modification and achieves significant speedup ranging from 3.1 to 17.3 on a heterogeneous system,incorporating an Intel Xeon E5405 CPU and an AMD FireStream 9250 GPU,over the execution on the Xeon CPU alone. 相似文献

11.

KernelHive: a new workflow‐based framework for multilevel high performance computing using clusters and workstations with CPUs and GPUs

Pawe&#x; Rociszewski Pawe&#x; Czarnul Rafa&#x; Lewandowski Marcel Schally‐Kacprzak 《Concurrency and Computation》2016,28(9):2586-2607

The paper presents a new open‐source framework called KernelHive for multilevel parallelization of computations among various clusters, cluster nodes, and finally, among both CPUs and GPUs for a particular application. An application is modeled as an acyclic directed graph with a possibility to run nodes in parallel and automatic expansion of nodes (called node unrolling) depending on the number of computation units available. A methodology is proposed for parallelization and mapping of an application to the environment that includes selection of devices using a chosen optimizer, selection of best grid configurations for compute devices, optimization of data partitioning and the execution. One of possibly many scheduling algorithms can be selected considering execution time, power consumption, and so on. An easy‐to‐use GUI is provided for modeling and monitoring with a repository of ready‐to‐use constructs and computational kernels. The methodology, execution times, and scalability have been demonstrated for a distributed and parallel password‐breaking example run in a heterogeneous environment with a cluster and servers with different numbers of nodes and both CPUs and GPUs. Additionally, performance of the framework has been compared with an MPI + OpenCL implementation using a parallel geospatial interpolation application employing up to 40 cluster nodes and 320 cores. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

12.

Locality-aware parallel block-sparse matrix-matrix multiplication using the Chunks and Tasks programming model

《Parallel Computing》2016

We present a method for parallel block-sparse matrix-matrix multiplication on distributed memory clusters. By using a quadtree matrix representation, data locality is exploited without prior information about the matrix sparsity pattern. A distributed quadtree matrix representation is straightforward to implement due to our recent development of the Chunks and Tasks programming model [Parallel Comput. 40, 328 (2014)]. The quadtree representation combined with the Chunks and Tasks model leads to favorable weak and strong scaling of the communication cost with the number of processes, as shown both theoretically and in numerical experiments.Matrices are represented by sparse quadtrees of chunk objects. The leaves in the hierarchy are block-sparse submatrices. Sparsity is dynamically detected by the matrix library and may occur at any level in the hierarchy and/or within the submatrix leaves. In case graphics processing units (GPUs) are available, both CPUs and GPUs are used for leaf-level multiplication work, thus making use of the full computing capacity of each node.The performance is evaluated for matrices with different sparsity structures, including examples from electronic structure calculations. Compared to methods that do not exploit data locality, our locality-aware approach reduces communication significantly, achieving essentially constant communication per node in weak scaling tests. 相似文献

13.

A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems

Fengguang Song Jack Dongarra 《Concurrency and Computation》2015,27(14):3702-3723

Aiming to fully exploit the computing power of all CPUs and all graphics processing units (GPUs) on hybrid CPU‐GPU systems to solve dense linear algebra problems, we design a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs. The new heterogeneous tile algorithms are executed upon our decentralized dynamic scheduling runtime system, which schedules a task graph dynamically and transfers data between compute nodes automatically. The runtime system uses a new distributed task assignment protocol to solve data dependencies between tasks without any coordination between processing units. By overlapping computation and communication through dynamic scheduling, we are able to attain scalable performance for the double‐precision Cholesky factorization and QR factorization. Our approach demonstrates a performance comparable to Intel MKL on shared‐memory multicore systems and better performance than both vendor (e.g., Intel MKL) and open source libraries (e.g., StarPU) in the following three environments: heterogeneous clusters with GPUs, conventional clusters without GPUs, and shared‐memory systems with multiple GPUs. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

14.

Efficient magnetohydrodynamic simulations on distributed multi-GPU systems using a novel GPU Direct–MPI hybrid approach

Un-Hong Wong Takayuki Aoki Hon-Cheng Wong 《Computer Physics Communications》2014

Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 1200³ grid points using 216 GPUs. 相似文献

15.

A study of graphics hardware accelerated particle swarm optimization with digital pheromones

Vijay Kalivarapu Eliot Winer 《Structural and Multidisciplinary Optimization》2015,51(6):1281-1304

Programmable Graphics Processing Units (GPUs) have lately become a promising means to perform scientific computations. Modern GPUs have proven to outperform the number of floating point operations when compared to traditional Central Processing Units (CPUs) through inherent data parallel architecture and higher bandwidth capabilities. They allow scientific computations to be performed without noticeable degradation in accuracy in a fraction of the time compared to traditional CPUs at substantially reduced costs, making them viable alternatives to expensive computer clusters or workstations. GPU programmability however, has fostered the development of a variety of programming languages making it challenging to select a computing language and use it consistently without the pitfall of being obsolete. Some GPU languages are hardware specific and are designed to rake in performance boosts when used with their host GPUs (e.g., Nvidia Cuda). Others are operating system specific (e.g., Microsoft HLSL). A few are platform agnostic lending themselves to be used on a workstation with any CPU and a GPU (e.g., GLSL, OpenCL).Of a number of companies and organizations that implement formal optimization into their processes, only a few utilize GPUs. It is either because the others are either vested much into CPU based computing or they are not fully aware of the benefits of implementing population based optimization routines in GPUs. Literature shows a large number of research publications specifically in the field of optimization utilizing GPUs. However, most of them are limited to a specific GPU hardware or addressed specific problems. The diversity in current GPU hardware and software APIs present overwhelming number of choices making it challenging to decide where and how to begin transitioning to GPU based computing, impeding promising computing avenues that relatively is very cost effective. In this paper, the authors precisely intend to address some of these issues by broadly classifying GPU APIs into three categories: 1) Hardware vendor dependent GPU APIs, 2) Graphical in context APIs, and 3) Platform agnostic APIs. Prior work by the authors demonstrated the capability of digital pheromones within Particle Swarm Optimization (PSO) for searching n-dimensional design spaces with improved accuracy, efficiency and reliability in serial and parallel CPU computing environments. To study the impact of GPUs, the authors have taken this digital pheromone variant of PSO and implemented it on three GPU APIs, each representing a category listed above, in a simplistic sense – delegate unconstrained explicit objective function evaluations to GPUs. While this approach itself cannot be considered novel, the takeaways from implementing it on different GPU APIs provided a wealth of information that the authors believe can help optimization companies and organizations make informed decisions in implementing GPUs in their processes. 相似文献

16.

The FLAME approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations

Francisco D. Igual Ernie Chan Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Robert A. van de Geijn Field G. Van Zee 《Journal of Parallel and Distributed Computing》2012

Parallel accelerators are playing an increasingly important role in scientific computing. However, it is perceived that their weakness nowadays is their reduced “programmability” in comparison with traditional general-purpose CPUs. For the domain of dense linear algebra, we demonstrate that this is not necessarily the case. We show how the libflame library carefully layers routines and abstracts details related to storage and computation, so that extending it to take advantage of multiple accelerators is achievable without introducing platform specific complexity into the library code base. We focus on the experience of the library developer as he develops a library routine for a new operation, reduction of a generalized Hermitian positive definite eigenvalue problem to a standard Hermitian form, and configures the library to target a multi-GPU platform. It becomes obvious that the library developer does not need to know about the parallelization or the details of the multi-accelerator platform. Excellent performance on a system with four NVIDIA Tesla C2050 GPUs is reported. This makes libflame the first library to be released that incorporates multi-GPU functionality for dense matrix computations, setting a new standard for performance. 相似文献

17.

GPU-accelerated Hungarian algorithms for the Linear Assignment Problem

《Parallel Computing》2016

In this paper, we describe parallel versions of two different variants (classical and alternating tree) of the Hungarian algorithm for solving the Linear Assignment Problem (LAP). We have chosen Compute Unified Device Architecture (CUDA) enabled NVIDIA Graphics Processing Units (GPU) as the parallel programming architecture because of its ability to perform intense computations on arrays and matrices. The main contribution of this paper is an efficient parallelization of the augmenting path search phase of the Hungarian algorithm. Computational experiments on problems with up to 25 million variables reveal that the GPU-accelerated versions are extremely efficient in solving large problems, as compared to their CPU counterparts. Tremendous parallel speedups are achieved for problems with up to 400 million variables, which are solved within 13 seconds on average. We also tested multi-GPU versions of the two variants on up to 16 GPUs, which show decent scaling behavior for problems with up to 1.6 billion variables and dense cost matrix structure. 相似文献

18.

Scientific computation for simulations on programmable graphics hardware

Robert Strzodka Michael Doggett Andreas Kolb 《Simulation Modelling Practice and Theory》2005,13(8):667-680

Graphics processor units (GPUs) have emerged as powerful parallel processors in recent years. Although floating point computations and high level programming languages are now available, the efficient use of the enormous computing power of GPUs still requires a significant amount of graphics specific knowledge.The paper explains how to use GPUs for scientific computations without graphics specific terminology. It offers an algorithmic view on GPUs with comparisons to cache aware and parallel programming of CPUs. Two typical simulation techniques, namely grid based and particle based methods are discussed. 相似文献

19.

Global magnetohydrodynamic simulations on multiple GPUs

Un-Hong Wong Hon-Cheng Wong Yonghui Ma 《Computer Physics Communications》2014

Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind–magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax–Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision. 相似文献

20.

Tile-based level of detail for the parallel age

Niski K Cohen JD 《IEEE transactions on visualization and computer graphics》2007,13(6):1352-1359

Today's PCs incorporate multiple CPUs and GPUs and are easily arranged in clusters for high-performance, interactive graphics. We present an approach based on hierarchical, screen-space tiles to parallelizing rendering with level of detail. Adapttiles, render tiles, and machine tiles are associated with CPUs, GPUs, and PCs, respectively, to efficiently parallelize the workload with good resource utilization. Adaptive tile sizes provide load balancing while our level of detail system allows total and independent management of the load on CPUs and GPUs. We demonstrate our approach on parallel configurations consisting of both single PCs and a cluster of PCs 相似文献