期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Efficient 3D stencil computations using CUDA

Marcin Krotkiewski Marcin Dabrowski 《Parallel Computing》2013

We present an efficient implementation of 7-point and 27-point stencils on high-end Nvidia GPUs. A new method of reading the data from the global memory to the shared memory of thread blocks is developed. The method avoids conditional statements and requires only two coalesced instructions to load the tile data with the halo (ghost zone). Additional optimizations include storing only one XY tile of data at a time in the shared memory to lower shared memory requirements, common subexpression elimination to reduce the number of instructions, and software prefetching to overlap arithmetic and memory instructions, and enhance latency hiding. The efficiency of our implementation is analyzed using a simple stencil memory footprint model that takes into account the actual halo overhead due to the minimum memory transaction size on the GPUs. Through experiments we demonstrate that in our implementation the memory overhead due to the halos is largely eliminated by good reuse of the halo data in the memory caches, and that our method of reading the data is close to optimal in terms of memory bandwidth usage. Detailed performance analysis for single precision stencil computations, and performance results for single and double precision arithmetic on two Tesla cards are presented. Our stencil implementations are more efficient than any other implementation described in the literature to date. On Tesla C2050 with single and double precision arithmetic our 7-point stencil achieves an average throughput of 12.3 and 6.5 Gpts/s, respectively (98 GFLOP/s and 52 GFLOP/s, respectively). The symmetric 27-point stencil sustains a throughput of 10.9 and 5.8 Gpts/s, respectively. 相似文献

2.

Compile-time optimization of near-neighbor communication for scalable shared-memory multiprocessors

David E. Hudak Santosh G. Abraham 《Journal of Parallel and Distributed Computing》1992,15(4)

Scalable shared-memory multiprocessor systems are typically NUMA (nonuniform memory access) machines, where the exploitation of the memory hierarchy is critical to achieving high performance. Iterative data parallel loops with near-neighbor communication account for many important numerical applications. In such loops, the communication of partial results stresses the memory system performance. In this paper, we develop data placement schemes that minimize communication time where the near-neighbor interaction is determined by a stencil. Under a given loop partition, our compile-time algorithm partitions global data into four classes for each processor, with each class requiring specific consistency maintenance requirements. The ADAPT (Automatic Data Allocation and Partitioning Tool) system was implemented to automatically partition parallel code segments for the BBN TC2000, a scalable shared-memory multiprocessor. ADAPT caches global arrays and maintains data consistency in software through instructions that flush data from private caches. Restructuring of a fluid flow code segment by ADAPT improved performance by a factor of more than 3 on the BBN TC2000. Features in current generation pipelined processors with multiple functional units permit the overlap of memory accesses with computation. Our experiments on the BBN TC2000 show that the degree of overlap is limited by architectural parameters, such as the number of CPU registers. 相似文献

3.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

4.

A CPU-GPU hybrid approach for the unsymmetric multifrontal method 总被引：1，自引：0，他引：1

Chenhan D. YuWeichung Wang Dan’l Pierce 《Parallel Computing》2011,37(12):759-770

Multifrontal is an efficient direct method for solving large-scale sparse and unsymmetric linear systems. The method transforms a large sparse matrix factorization process into a sequence of factorizations involving smaller dense frontal matrices. Some of these dense operations can be accelerated by using a graphic processing unit (GPU). We analyze the unsymmetric multifrontal method from both an algorithmic and implementational perspective to see how a GPU, in particular the NVIDIA Tesla C2070, can be used to accelerate the computations. Our main accelerating strategies include (i) performing BLAS on both CPU and GPU, (ii) improving the communication efficiency between the CPU and GPU by using page-locked memory, zero-copy memory, and asynchronous memory copy, and (iii) a modified algorithm that reuses the memory between different GPU tasks and sets thresholds to determine whether certain tasks be performed on the GPU. The proposed acceleration strategies are implemented by modifying UMFPACK, which is an unsymmetric multifrontal linear system solver. Numerical results show that the CPU-GPU hybrid approach can accelerate the unsymmetric multifrontal solver, especially for computationally expensive problems. 相似文献

5.

Accelerating Single Iteration Performance of CUDA-Based 3D Reaction–Diffusion Simulations

John K. Holmen David L. Foster 《International journal of parallel programming》2014,42(2):343-363

The most commonly used approach for solving reaction–diffusion systems relies upon stencil computations. Although stencil computations feature low compute intensity, they place high demands on memory bandwidth. Fortunately, GPU computing allows for the heavy reliance of stencil computations on neighboring data points to be exploited to significantly increase simulation speeds by reducing these memory bandwidth demands. Upon reviewing previously published works, a wide-variety of efforts have been made to optimize NVIDIA CUDA-based stencil computations. However, a critical aspect contributing to algorithm performance is commonly glossed over: the halo region loading technique utilized in conjunction with a given spatial blocking technique. This paper presents an in-depth examination of this aspect and the associated single iteration performance impacts when using symmetric, nearest neighbor 19-point stencils. This is accomplished by closely examining how the simulated space is partitioned into thread blocks and the balance between memory accesses, divergence, and computing threads. The resulting optimization strategy for accelerating 3-dimensional reaction–diffusion simulations offers up to 2.45 times speedup for single-precision floating point numbers in reference to GPU-based speedups found within the previously published work that this paper directly extends. In reference to our multithreaded CPU-based implementation, the resulting optimization strategy offers up to 8.69 times speedup for single-precision floating point numbers. 相似文献

6.

A real-time multigrid finite hexahedra method for elasticity simulation using CUDA

Christian Dick Joachim Georgii Rüdiger Westermann 《Simulation Modelling Practice and Theory》2011,19(2):801-816

We present a multigrid approach for simulating elastic deformable objects in real time on recent NVIDIA GPU architectures. To accurately simulate large deformations we consider the co-rotated strain formulation. Our method is based on a finite element discretization of the deformable object using hexahedra. It draws upon recent work on multigrid schemes for the efficient numerical solution of partial differential equations on such discretizations. Due to the regular shape of the numerical stencil induced by the hexahedral regime, and since we use matrix-free formulations of all multigrid steps, computations and data layout can be restructured to avoid execution divergence of parallel running threads and to enable coalescing of memory accesses into single memory transactions. This enables to effectively exploit the GPU’s parallel processing units and high memory bandwidth via the CUDA parallel programming API. We demonstrate performance gains of up to a factor of 27 and 4 compared to a highly optimized CPU implementation on a single CPU core and 8 CPU cores, respectively. For hexahedral models consisting of as many as 269,000 elements our approach achieves physics-based simulation at 11 time steps per second. 相似文献

7.

基于CUDA的直升机旋翼桨叶挥舞角快速测量方法

熊邦书汪建勇黄建萍余磊《测控技术》2016,35(6):30-32

针对基于立体视觉的直升机旋翼桨叶挥舞角测量CPU串行算法耗时多、效率不高的问题,利用图像处理单元(GPU)并行计算的优势,提出一种基于CUDA统一计算设备构架的并行处理快速算法.首先,对算法中最耗时的图像去噪、阈值分割、连通域标记三部分进行并行化设计;然后,采用多层次并行策略将大量密集运算分配到不同的图像处理单元上并行执行,利用共享内存和共享寄存器加速数据访问;最后,进行多次测量实验,结果表明该方法执行效率明显高于CPU串行方法,可满足旋翼桨叶挥舞角快速测量的要求. 相似文献

8.

Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs

Eike Müller Xu Guo Robert Scheichl Sinan Shi 《Computing and Visualization in Science》2013,16(2):41-58

Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in “flat” three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction (NWP) models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and to meet operational requirements an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units (GPUs) have been shown to be highly efficient (both in terms of absolute performance and power consumption) for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. In this article we describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the PCG algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it both to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing CUDA libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth. 相似文献

9.

Hybrid CPU/GPU/APU accelerated query,insert, update and erase operations in hash tables with string keys

Groth Tobias Groppe Sven Pionteck Thilo Valdiek Franz Koppehel Martin 《Knowledge and Information Systems》2023,65(10):4359-4377

Modern computer systems can use different types of hardware acceleration to achieve massive performance improvements. Some accelerators like FPGA and dedicated GPU (dGPU) need optimized data structures for the best performance and often use dedicated memory. In contrast, APUs, which are a combination of a CPU and an integrated GPU (iGPU), support shared memory and allow the iGPU to work together with the CPU on pointer-based data structures. First, we develop an approach for dGPU to accelerate queries in libcuckoo and robin-map and when looking at accelerating insert, updates and erase operations in the original libcuckoo using OneAPI on an APU. We evaluate the dGPU against the CPU variants and our dGPU approach adapted for the CPU and also in a hybrid context by using longer keys on the CPU and shorter keys on the dGPU. In comparison with the original libcuckoo algorithm, our dGPU approach achieves a speed-up of 2.1, and in comparison with the robin-map a speed-up of 1.5. For hybrid workloads, our approach is efficient if long keys are processed on the CPU and short keys are processed on the dGPU. By processing a mixture of 20% long keys on the CPU and 80% short keys on dGPU, our hybrid approach has a 40% higher throughput than the CPU only approach. In addition, we develop a hybrid APU approach for insert, update and erase operations in the original libcuckoo structure focusing on shared memory with iGPU accelerated look-ups of the positions for insert, update and erase operations.

相似文献

10.

Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators

《Parallel Computing》2014,40(8):425-447

EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components.Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:

•method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations;
•method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques;
•method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources;
•approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs.

Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively. 相似文献

11.

Ray Accelerator: Efficient and Flexible Ray Tracing on a Heterogeneous Architecture

下载免费PDF全文

R. Barringer M. Andersson T. Akenine‐Möller 《Computer Graphics Forum》2017,36(8):166-177

We present a hybrid ray tracing system, where the work is divided between the CPU cores and the GPU in an integrated chip, and communication occurs via shared memory. Rays are organized in large packets that can be distributed among the two units as needed. Testing visibility between rays and the scene is mostly performed using an optimized kernel on the GPU, but the CPU can help as necessary. The CPU cores typically handle most or all shading, which makes it easy to support complex appearances. For efficiency, the CPU cores shade whole batches of rays by sorting them on material and shading each material using a vectorized kernel. In addition, we introduce a method to support light paths with arbitrary recursion, such as multiple recursive Whitted‐style ray tracing and adaptive sampling where the result of a ray is examined before sending the next, while still batching up rays for the benefit of GPU‐accelerated traversal and vectorized shading. This allows our system to achieve high rendering performance while maintaining the flexibility to accommodate different rendering algorithms. 相似文献

12.

Decentralized semaphore support in a virtual shared-memory system

Mahendra Ramachandran Mukesh Singhal 《The Journal of supercomputing》1995,9(1-2):51-70

Distributed shared memory has increasingly become a desirable programming model on which to program multicomputer systems. Such systems strike a balance between the performance attainable in distributed-memory multiprocessors and the ease of programming on shared-memory systems. In shared-memory systems, concurrent tasks communicate through shared variables, and synchronization of access to shared data is an important issue. Semaphores have been traditionally used to provide this synchronization. In this paper we propose a decentralized scheme to support semaphores in a virtual shared-memory system. Our method of grouping semaphores into semaphore pages and caching a semaphore at a processor on demand eliminates the reliability problems and bottlenecks associated with centralized schemes. We compare the performance of our scheme with a centralized implementation of semaphores and conclude that our system performs better under high semaphore access rates as well as larger numbers of processors. 相似文献

13.

A comparative evaluation of hybrid distributed shared-memory systems

Adrian Moga Michel Dubois 《Journal of Systems Architecture》2009,55(1):43-52

Distributed Shared-Memory (DSM) systems are shared-memory multiprocessor architectures in which each processor node contains a partition of the shared memory. In hybrid DSM systems coherence among caches is maintained by a software-implemented coherence protocol relying on some hardware support. Hardware support is provided to satisfy every node hit (the common case) and software is invoked only for accesses to remote nodes.In this paper we compare the design and performance of four hybrid distributed shared memory (DSM) organizations by detailed simulation of the same hardware platform. We have implemented the software protocol handlers for the four architectures. The handlers are written in C and assembly code. Coherence transactions are executed in trap and interrupt handlers. Together with the application, the handlers are executed in full detail in execution-driven simulations of six complete benchmarks with coarse-grain and fine-grain sharing. We relate our experience implementing and simulating the software protocols for the four architectures.Because the overhead of remote accesses is very high in hybrid systems, the system of choice is different than for purely hardware systems. 相似文献

14.

Anonymous and fault-tolerant shared-memory computing

Rachid Guerraoui Eric Ruppert 《Distributed Computing》2007,20(3):165-177

The vast majority of papers on distributed computing assume that processes are assigned unique identifiers before computation begins. But is this assumption necessary? What if processes do not have unique identifiers or do not wish to divulge them for reasons of privacy? We consider asynchronous shared-memory systems that are anonymous. The shared memory contains only the most common type of shared objects, read/write registers. We investigate, for the first time, what can be implemented deterministically in this model when processes can fail. We give anonymous algorithms for some fundamental problems: time-stamping, snapshots and consensus. Our solutions to the first two are wait-free and the third is obstruction-free. We also show that a shared object has an obstruction-free implementation if and only if it satisfies a simple property called idempotence. To prove the sufficiency of this condition, we give a universal construction that implements any idempotent object. 相似文献

15.

Parallel DSP algorithms on TurboNet: an experimental system with hybrid message-passing/shared-memory architecture

Xi Li Sotirios G. Ziavras Constantine N. Manikopoulos 《Concurrency and Computation》1996,8(5):387-411

The paper presents several parallel DSP (digital signal processing) algorithms and their performance analysis, targetting a hybrid message-passing and shared-memory architecture that has been built at New Jersey Institute of Technology. The current version of our system contains eight powerful TMS320C40 processors. The algorithms are implemented on our system using message-passing only, shared-memory only, and, if possible, a combination of both of these parallel processing paradigms. Comparisons show that TurboNet's robust, hybrid architecture results in significant performance gains because of the flexibility it introduces. 相似文献

16.

Accelerating computation of Euclidean distance map using the GPU with efficient memory access

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(5):383-406

Recent graphics processing units (GPUs), which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map (EDM) with efficient memory access. Given a two-dimensional (2D) binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation. 相似文献

17.

Exploiting parallelism on a fine-grained MIMD architecture based upon channel queues

Rajiv Gupta Sunah Lee 《International journal of parallel programming》1992,21(3):169-192

We present techniques for exploiting fine-grained parallelism extracted from sequential programs on a fine-grained MIMD system. The system exploits fine-grained parallelism through parallel execution of instructions on multiple processors as well as pipelined nature of individual processors. The processors can communicate data values via globally shared registers as well as dedicated channel queues. Compilation techniques are presented to utilize these mechanisms. A scheduling algorithm has been developed to distribute operations among the processors in a manner that reduces communication among the processors. The compiler identifies data dependencies which require synchronization and enforces them using channel queues. Delays that may result by attempting write operations to a full channel queue are avoided by spilling values from channels to local registers. If an interprocessor data dependency does not require synchronization, then the data value is passed through a shared register or shared memory.Partially supported by National Science Foundation Presidential Young Investigator Award CCR-9157371 (CCR-9249143) to the University of Pittsburgh. 相似文献

18.

CUDA-Zero:a framework for porting shared memory GPU applications to multi-GPUs

CHEN DeHao CHEN WenGuang & ZHENG WeiMin 《中国科学:信息科学(英文版)》2012,(3):663-676

As the prevalence of general purpose computations on GPU, shared memory programming models were proposed to ease the pain of GPU programming. However, with the demanding needs of more intensive workloads, it’s desirable to port GPU programs to more scalable distributed memory environment, such as multi-GPUs. To achieve this, programs need to be re-written with mixed programming models (e.g. CUDA and message passing). Programmers not only need to work carefully on workload distribution, but also on scheduling mechanisms to ensure the efficiency of the execution. In this paper, we studied the possibilities of automating the process of parallelization to multi-GPUs. Starting from a GPU program written in shared memory model, our framework analyzes the access patterns of arrays in kernel functions to derive the data partition schemes. To acquire the access pattern, we proposed a 3-tiers approach: static analysis, profile based analysis and user annotation. Experiments show that most access patterns can be derived correctly by the first two tiers, which means that zero efforts are needed to port an existing application to distributed memory environment. We use our framework to parallelize several applications, and show that for certain kinds of applications, CUDA-Zero can achieve efficient parallelization in multi-GPU environment. 相似文献

19.

CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms

Lee D Dinov I Dong B Gutman B Yanovsky I Toga AW 《Computer methods and programs in biomedicine》2012,106(3):175-187

As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. 相似文献

20.

A survey of recoverable distributed shared virtual memory systems

Morin C. Puaut I. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(9):959-969

Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSVM increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable DSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure 相似文献