首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
《Parallel Computing》2014,40(5-6):86-99
Simulation of in vivo cellular processes with the reaction–diffusion master equation (RDME) is a computationally expensive task. Our previous software enabled simulation of inhomogeneous biochemical systems for small bacteria over long time scales using the MPD-RDME method on a single GPU. Simulations of larger eukaryotic systems exceed the on-board memory capacity of individual GPUs, and long time simulations of modest-sized cells such as yeast are impractical on a single GPU. We present a new multi-GPU parallel implementation of the MPD-RDME method based on a spatial decomposition approach that supports dynamic load balancing for workstations containing GPUs of varying performance and memory capacity. We take advantage of high-performance features of CUDA for peer-to-peer GPU memory transfers and evaluate the performance of our algorithms on state-of-the-art GPU devices. We present parallel efficiency and performance results for simulations using multiple GPUs as system size, particle counts, and number of reactions grow. We also demonstrate multi-GPU performance in simulations of the Min protein system in E. coli. Moreover, our multi-GPU decomposition and load balancing approach can be generalized to other lattice-based problems.  相似文献   

2.
The numerical solution of two-layer shallow water systems is required to simulate accurately stratified fluids, which are ubiquitous in nature: they appear in atmospheric flows, ocean currents, oil spills, etc. Moreover, the implementation of the numerical schemes to solve these models in realistic scenarios imposes huge demands of computing power. In this paper, we tackle the acceleration of these simulations in triangular meshes by exploiting the combined power of several CUDA-enabled GPUs in a GPU cluster. For that purpose, an improvement of a path conservative Roe-type finite volume scheme which is specially suitable for GPU implementation is presented, and a distributed implementation of this scheme which uses CUDA and MPI to exploit the potential of a GPU cluster is developed. This implementation overlaps MPI communication with CPU–GPU memory transfers and GPU computation to increase efficiency. Several numerical experiments, performed on a cluster of modern CUDA-enabled GPUs, show the efficiency of the distributed solver.  相似文献   

3.
Real‐time rendering of large‐scale engineering computer‐aided design (CAD) models has been recognized as a challenging task. Because of the constraints of limited graphics processing unit (GPU) memory size and computation capacity, a massive model with hundreds of millions of triangles cannot be loaded and rendered in real‐time using most of modern GPUs. In this paper, an efficient GPU out‐of‐core framework is proposed for interactively visualizing large‐scale CAD models. To improve efficiency of data fetching from CPU host memory to GPU device memory, a parallel offline geometry compression scheme is introduced to minimize the storage cost of each primitive by compressing the levels of detail (LOD) geometries into a highly compact format. At the rendering stage, occlusion culling and LOD processing algorithms are integrated and implemented with an efficient GPU‐based approach to determine a minimal scale of primitives to be transferred for each frame. A prototype software system is developed to preprocess and render massive CAD models with the proposed framework. Experimental results show that users can walkthrough massive CAD models with hundreds of millions of triangles at high frame rates using our framework. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

4.
We port a high-order finite-element application that performs the numerical simulation of seismic wave propagation resulting from earthquakes in the Earth on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. This application runs in single precision and is therefore a good candidate for implementation on current GPU hardware, which either does not support double precision or supports it but at the cost of reduced performance. We discuss and compare two implementations of the code: one that has maximum efficiency but is limited to the memory size of the card, and one that can handle larger problems but that is less efficient. We use a coloring scheme to handle efficiently summation operations over nodes on a topology with variable valence. We perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25.  相似文献   

5.
Computing systems should be designed to exploit parallelism in order to improve performance. In general, a GPU (Graphics Processing Unit) can provide more parallelism than a CPU (Central Processing Unit), resulting in the wide usage of heterogeneous computing systems that utilize both the CPU and the GPU together. In the heterogeneous computing systems, the efficiency of the scheduling scheme, which selects the device to execute the application between the CPU and the GPU, is one of the most critical factors in determining the performance. This paper proposes a dynamic scheduling scheme for the selection of the device between the CPU and the GPU to execute the application based on the estimated-execution-time information. The proposed scheduling scheme enables the selection between the CPU and the GPU to minimize the completion time, resulting in a better system performance, even though it requires the training period to collect the execution history. According to our simulations, the proposed estimated-execution-time scheduling can improve the utilization of the CPU and the GPU compared to existing scheduling schemes, resulting in reduced execution time and enhanced energy efficiency of heterogeneous computing systems.  相似文献   

6.
Visual and interactive data exploration requires fast and reliable tools for embedding of an original data space in 3(2)‐dimensional Euclidean space. Multidimensional scaling (MDS) is a good candidate. However, owing to at least O(M2) memory and time complexity, MDS is computationally demanding for interactive visualization of data sets consisting of order of 104 objects on computer systems, ranging from PC with multicore CPU processor, graphics processing unit (GPU) board to midrange MPI clusters. To explore interactively data sets of that size, we have developed novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics. We demonstrate that the performance of our MDS algorithms implemented in compute unified device architecture environment on a PC equipped with a modern GPU board (Tesla M2090, GeForce GTX 480) is considerably faster than its MPI/OpenMP parallel implementation on the modern midrange professional cluster (10 nodes, each equipped with 2x Intel Xeon X5670 CPUs). We also show that the hybridized two‐level MPI/CUDA implementation, run on a cluster of GPU nodes, can additionally provide a linear speedup. Copyright 2013 John Wiley & Sons, Ltd.  相似文献   

7.
Modern graphics processing units (GPUs) have been widely utilized in magnetohydrodynamic (MHD) simulations in recent years. Due to the limited memory of a single GPU, distributed multi-GPU systems are needed to be explored for large-scale MHD simulations. However, the data transfer between GPUs bottlenecks the efficiency of the simulations on such systems. In this paper we propose a novel GPU Direct–MPI hybrid approach to address this problem for overall performance enhancement. Our approach consists of two strategies: (1) We exploit GPU Direct 2.0 to speedup the data transfers between multiple GPUs in a single node and reduce the total number of message passing interface (MPI) communications; (2) We design Compute Unified Device Architecture (CUDA) kernels instead of using memory copy to speedup the fragmented data exchange in the three-dimensional (3D) decomposition. 3D decomposition is usually not preferable for distributed multi-GPU systems due to its low efficiency of the fragmented data exchange. Our approach has made a breakthrough to make 3D decomposition available on distributed multi-GPU systems. As a result, it can reduce the memory usage and computation time of each partition of the computational domain. Experiment results show twice the FLOPS comparing to common 2D decomposition MPI-only implementation method. The proposed approach has been developed in an efficient implementation for MHD simulations on distributed multi-GPU systems, called MGPU–MHD code. The code realizes the GPU parallelization of a total variation diminishing (TVD) algorithm for solving the multidimensional ideal MHD equations, extending our work from single GPU computation (Wong et al., 2011) to multiple GPUs. Numerical tests and performance measurements are conducted on the TSUBAME 2.0 supercomputer at the Tokyo Institute of Technology. Our code achieves 2 TFLOPS in double precision for the problem with 12003 grid points using 216 GPUs.  相似文献   

8.
Starting from the single graphics processing unit (GPU) version of the Smoothed Particle Hydrodynamics (SPH) code DualSPHysics, a multi-GPU SPH program is developed for free-surface flows. The approach is based on a spatial decomposition technique, whereby different portions (sub-domains) of the physical system under study are assigned to different GPUs. Communication between devices is achieved with the use of Message Passing Interface (MPI) application programming interface (API) routines. The use of the sorting algorithm radix sort for inter-GPU particle migration and sub-domain “halo” building (which enables interaction between SPH particles of different sub-domains) is described in detail. With the resulting scheme it is possible, on the one hand, to carry out simulations that could also be performed on a single GPU, but they can now be performed even faster than on one of these devices alone. On the other hand, accelerated simulations can be performed with up to 32 million particles on the current architecture, which is beyond the limitations of a single GPU due to memory constraints. A study of weak and strong scaling behaviour, speedups and efficiency of the resulting program is presented including an investigation to elucidate the computational bottlenecks. Last, possibilities for reduction of the effects of overhead on computational efficiency in future versions of our scheme are discussed.  相似文献   

9.
The Building-Cube Method (BCM) based on equally-spaced Cartesian meshes has been proposed as a next generation CFD method. Due to the equally-spaced meshes, it is well suited for highly parallel computation. This paper proposes a parallel implementation scheme of BCM on a GPU cluster system, which needs efficient hierarchical parallel processing to exploit the potential of the cluster system. The proposed scheme employs the Red-Black SOR method for the pressure calculations, which is the most time-consuming part of BCM, to obtain massive data parallelism of BCM. By exploiting the coarse-grain and fine-grain parallelism of BCM, the proposed scheme hierarchically assigns equally-divided tasks into the GPU cluster system. Furthermore, to exploit the computational power of GPUs in the cluster system, the proposed scheme employs an efficient data management such as coalesced data transfer and reusing data on an on-chip memory. Experimental results show that the single GPU implementation can achieve about three times higher performance than the single CPU one. Moreover, the multiple GPU implementation can achieve an almost ideal scalability. Finally, the possibility of further acceleration of not only the pressure calculation but also the whole BCM is discussed.  相似文献   

10.
Recent graphics processing units (GPUs), which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map (EDM) with efficient memory access. Given a two-dimensional (2D) binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation.  相似文献   

11.
This paper investigates the speed improvements available when using a graphics processing unit (GPU) for evaluation of individuals in a genetic programming (GP) environment. An existing GP system is modified to enable parallel evaluation of individuals on a GPU device. Several issues related to implementing GP on GPU are discussed, including how to perform tree-based GP on a device without recursion support, as well as the effect that proper memory layout can have on speed increases when using CUDA-enabled nVidia GPU devices. The specific GP implementation is designed to evolve stock trading strategies using technical analysis indicators. The second goal of this research is to investigate the possible improvement in performance when training individuals on a larger number of stocks and training days. This increased training size (nearly 100,000 training points) is enabled due to the speedups realized by GPU evaluation. Several different scenarios were used to test various speed optimizations of GP evaluation on the GPU device, with a peak speedup factor of over 600 (when compared to sequential evaluation on a 2.4 GHz CPU). Also, it is found that increasing the number of stocks and the length of the training period can result in higher out-of-training testing profitability.  相似文献   

12.
Support Vector Machine (SVM) regression is an important technique in data mining. The SVM training is expensive and its cost is dominated by: (i) the kernel value computation, and (ii) a search operation which finds extreme training data points for adjusting the regression function in every training iteration. Existing training algorithms for SVM regression are not scalable to large datasets because: (i) each training iteration repeatedly performs expensive kernel value computations, which is inefficient and requires holding the whole training dataset in memory; (ii) the search operation used in each training iteration considers the whole search space which is very expensive. In this article, we significantly improve the scalability and efficiency of SVM regression by exploiting the high performance of Graphics Processing Units (GPUs) and solid state drives (SSDs). Our key ideas are as follows. (i) To reduce the cost of repeated kernel value computations and avoid holding the whole training dataset in the GPU memory, we precompute all the kernel values and store them in the CPU memory extended by the SSD; together with an efficient strategy to read the precomputed kernel values, reusing precomputed kernel values with an efficient retrieval is much faster than computing them on-the-fly. This also alleviates the restriction that the training dataset has to fit into the GPU memory, and hence makes our algorithm scalable to large datasets, especially for large datasets with very high dimensionality. (ii) To enhance the performance of the frequently used search operation, we design an algorithm that minimizes the search space and the number of accesses to the GPU global memory; this optimized search algorithm also avoids branch divergence (one of the causes for poor performance) among GPU threads to achieve high utilization of the GPU resources. Our proposed techniques together form a scalable solution to the SVM regression which we call SIGMA. Our extensive experimental results show that SIGMA is highly efficient and can handle very large datasets which the state-of-the-art GPU-based algorithm cannot handle. On the datasets of size that the state-of-the-art GPU-based algorithm can handle, SIGMA consistently outperforms the state-of-the-art GPU-based algorithm by an order of magnitude and achieves up to 86 times speedup.  相似文献   

13.
We consider three high-resolution schemes for computing shallow-water waves as described by the Saint-Venant system and discuss how to develop highly efficient implementations using graphical processing units (GPUs). The schemes are well-balanced for lake-at-rest problems, handle dry states, and support linear friction models. The first two schemes handle dry states by switching variables in the reconstruction step, so that bilinear reconstructions are computed using physical variables for small water depths and conserved variables elsewhere. In the third scheme, reconstructed slopes are modified in cells containing dry zones to ensure non-negative values at integration points. We discuss how single and double-precision arithmetics affect accuracy and efficiency, scalability and resource utilization for our implementations, and demonstrate that all three schemes map very well to current GPU hardware. We have also implemented direct and close-to-photo-realistic visualization of simulation results on the GPU, giving visual simulations with interactive speeds for reasonably-sized grids.  相似文献   

14.
张德好  刘青昆 《计算机工程》2012,38(18):262-264
在图形处理单元(GPU)平台的计算中,GPU设备存储器和内存容量相差较大,待处理数据通常无法一次性从内存拷贝至显存中进行运算。为此,提出一种Cholesky分解重叠算法。采用预存取技术,拷贝数据和计算重叠,降低设备的等待时间,将设备存储器划分为 2个缓冲区,轮流存放本次运算数据和下次待运算数据,在设备运算过程中完成设备存储器和内存之间的数据交换。实验结果表明,该算法可以有效提高运算效率。  相似文献   

15.
Modern graphics processing units (GPUs) provide impressive computing resources, which can be accessed conveniently through the CUDA programming interface. We describe how GPUs can be used to considerably speed up molecular dynamics (MD) simulations for system sizes ranging up to about 1 million particles. Particular emphasis is put on the numerical long-time stability in terms of energy and momentum conservation, and caveats on limited floating-point precision are issued. Strict energy conservation over 108 MD steps is obtained by double-single emulation of the floating-point arithmetic in accuracy-critical parts of the algorithm. For the slow dynamics of a supercooled binary Lennard-Jones mixture, we demonstrate that the use of single-floating point precision may result in quantitatively and even physically wrong results. For simulations of a Lennard-Jones fluid, the described implementation shows speedup factors of up to 80 compared to a serial implementation for the CPU, and a single GPU was found to compare with a parallelised MD simulation using 64 distributed cores.  相似文献   

16.
Hybrid CPU/GPU cluster recently has drawn lots of attention from high performance computing because of excellent execution performance and energy efficiency. Many supercomputing sites in the newest TOP 500 and Green 500 are built by hybrid CPU/GPU clusters instead of CPU clusters. However, the programming complexity of hybrid CPU/GPU clusters is so high such that most of users usually hesitate to move toward to this new cluster computing platform. To resolve this problem, we propose a distributed PTX virtual machine called BigGPU on heterogeneous clusters in this paper. As named, this virtual machine physically is a distributed system which is aimed at parallel re-compiling and executing the PTX codes by aggregating CPUs and GPUs available in a computational cluster. With the support of this virtual machine, users can regard a hybrid CPU/GPU as a single large-scale GPU. Consequently, they can develop applications by using only CUDA without combining MPI and multithreading APIs while can simultaneously use distributed CPUs and GPUs for resolving the same problem. Moreover, they need not handle the problem of load balance among heterogeneous processors and the constraints of device memory and thread configuration existing in physical GPUs because BigGPU supports large-scale virtual device memory space and thread configuration. On the other hand, we have evaluated the execution performance of BigGPU in this paper. Our experimental results have shown that BigGPU indeed can effectively exploit the computational power of CPUs and GPUs for enhancing the execution performance of user's CUDA programs.  相似文献   

17.
分子动力学模拟(MD)是分子模拟的一类常用方法,为生物体系的模拟提供了重要途径。由于计算强度大,目前MD可模拟的时空尺度还不能满足真实物理过程的需要。作为CPU的加速设备,近年来,GPU为提高MD计算能力提供了新的可能。GPU编程难点主要在于如何将计算任务分解并映射到GPU端并合理组织线程及存储器,细致地平衡数据传输和指令吞吐量以发挥GPU的最大计算性能。静电效应是长程作用,广泛存在于生物现象的各个方面,对其精确模拟是MD的重要组成部分。Particle-Mesh-Ewald(PME)方法是公认的精确处理静电作用的算法之一。本文介绍在本实验室已建立的GPU加速分子动力学模拟程序GMD的基础上,基于NVIDIACUDA,采用GPU实现PME算法的策略,针对算法中组成静电作用的三个部分即实空间、傅立叶空间和能量修正项,分别采用不同的计算任务组织策略以提升整体性能。使用事实上的标准算例dhfr进行的测试结果表明,实现PME的GMD程序,性能分别是Gromacs4.5.3版单核CPU的3.93倍,8核CPU的1.5倍,基于OpenMM2.0加速的Gromacs4.5.3GPU版本的1.87倍。  相似文献   

18.
分子动力学模拟是对微观分子原子体系在时间与空间上的运动模拟,是从微观本质上认识体系宏观性质的有力方法.针对如何提升分子动力学并行模拟性能的问题,本文以著名软件GROMACS为例,分析其在分子动力学模拟并行计算方面的实现策略,结合分子动力学模拟关键原理与测试实例,提出MPI+OpenMP并行环境下计算性能的优化策略,为并行计算环境下实现分子动力学模拟的最优化计算性能提供理论和实践参考.对GPU异构并行环境下如何进行MPI、OpenMP、GPU搭配选择以达到性能最优,本文亦给出了一定的理论和实例参考.  相似文献   

19.
We report on a source-code modification of the density-functional program suite VASP which benefits from the use of graphics-processing units (GPUs). For the electronic minimization needed to achieve the ground state using an implementation of the blocked Davidson iteration scheme (EDDAV), speed-ups of up to 3.39 on S1070 devices or 6.97 on a C2050 device were observed when calculating an ion–conductor system of actual research interest. Concerning the GPU specialty – memory throughput – the low double-precision performance forms the bottleneck on the S1070, whereas on Fermi cards the code reaches 61.7% efficiency while not suffering from any accuracy losses compared to well-established calculations performed on a central processing unit (CPU). The algorithmic bottleneck was found to be the multiplication of rectangular matrices. An initial idea to solve this problem is given.  相似文献   

20.
We present a geometry compression scheme for restricted quadtree meshes and use this scheme for the compression of adaptively triangulated digital elevation models (DEMs). A compression factor of 8–9 is achieved by employing a generalized strip representation of quadtree meshes to incrementally encode vertex positions. In combination with adaptive error-controlled triangulation, this allows us to significantly reduce bandwidth requirements in the rendering of large DEMs that have to be paged from disk. The compression scheme is specifically tailored for GPU-based decoding, since it minimizes dependent memory access operations. We can thus trade CPU operations and CPU–GPU data transfer for GPU processing, resulting in twice faster streaming of DEMs from main memory into GPU memory. A novel storage format for decoded DEMs on the GPU facilitates a sustained rendering throughput of about 300 million triangles per second. Due to these properties, the proposed scheme enables scalable rendering with respect to the display resolution independent of the data size. For a maximum screen-space error below 1 pixel it achieves frame rates of over 100 fps, even on high-resolution displays. We validate the efficiency of the proposed method by presenting experimental results on scanned elevation models of several hundred gigabytes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号