首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
随着工业计算需求的激增,计算流体力学 (Computational Fluid Dynamics, CFD) 学科对计算效率问题越来越重视。作者基于自行开发的 Navier-Stokes 解算器,引入多重网格加速收敛算法,并结合NVIDIA GPU 计算平台,从数值方法和高性能计算两个方面为 CFD 实现加速。数值加速算例测试结果表明,基于多重网格算法的 GPU 解算器相对 CPU 版本代码双精度可获得 45 倍以上的加速。  相似文献   

3.
Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail.  相似文献   

4.
We present a scalable dissipative particle dynamics simulation code, fully implemented on the Graphics Processing Units (GPUs) using a hybrid CUDA/MPI programming model, which achieves 10–30 times speedup on a single GPU over 16 CPU cores and almost linear weak scaling across a thousand nodes. A unified framework is developed within which the efficient generation of the neighbor list and maintaining particle data locality are addressed. Our algorithm generates strictly ordered neighbor lists in parallel, while the construction is deterministic and makes no use of atomic operations or sorting. Such neighbor list leads to optimal data loading efficiency when combined with a two-level particle reordering scheme. A faster in situ generation scheme for Gaussian random numbers is proposed using precomputed binary signatures. We designed custom transcendental functions that are fast and accurate for evaluating the pairwise interaction. The correctness and accuracy of the code is verified through a set of test cases simulating Poiseuille flow and spontaneous vesicle formation. Computer benchmarks demonstrate the speedup of our implementation over the CPU implementation as well as strong and weak scalability. A large-scale simulation of spontaneous vesicle formation consisting of 128 million particles was conducted to further illustrate the practicality of our code in real-world applications.  相似文献   

5.
Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially.  相似文献   

6.
A parallel implementation via CUDA of the dynamic programming method for the knapsack problem on NVIDIA GPU is presented. A GTX 260 card with 192 cores (1.4 GHz) is used for computational tests and processing times obtained with the parallel code are compared to the sequential one on a CPU with an Intel Xeon 3.0 GHz. The results show a speedup factor of 26 for large size problems. Furthermore, in order to limit the communication between the CPU and the GPU, a compression technique is presented which decreases significantly the memory occupancy.  相似文献   

7.
The simulation of electromagnetic (EM) waves propagation in the dielectric media is presented using Compute Unified Device Architecture (CUDA) implementation of finite‐difference time‐domain (FDTD) method on graphic processing unit (GPU). The FDTD formulation in the dielectric media is derived in detail, and GPU‐accelerated FDTD method based on CUDA programming model is described in the flowchart. The accuracy and speedup of the presented CUDA‐implemented FDTD method are validated by the numerical simulation of the EM waves propagating into the lossless and lossy dielectric media from the free space on GPU, by comparison with the original FDTD method on CPU. The comparison of the numerical results of CUDA‐implemented FDTD method on GPU and original FDTD method on CPU demonstrates that the CUDA‐implemented FDTD method on GPU can obtain better application speedup performance with reasonable accuracy. © 2016 Wiley Periodicals, Inc. Int J RF and Microwave CAE 26:512–518, 2016.  相似文献   

8.
In this paper, we present the analysis and development of a cross-platform OpenCL implementation of the box-counting algorithm, which is one of the most widely-used methods for estimating the Fractal Dimension. The Fractal Dimension is a relevant image analysis method used in several disciplines, but computing it is in general a time consuming process, especially when working with 3D images. Unlike parallel programming models that strictly depend on the hardware type and manufacturer, like CUDA, OpenCL allows us to provide an implementation suitable for execution on both GPUs and multi-core CPUs, whatever the hardware manufacturer. Sorting is a key part of the fast box-counting algorithm and the final speedup is highly conditioned by the efficiency of the sorting algorithm used. Our study reveals that current OpenCL implementations of sorting algorithms are clearly slower when compared with both CUDA for GPU and specific multi-core CPU implementations. Our OpenCL algorithm has been specifically optimized according the type of the target device and the results show an average speedup of up to 7.46× and 4×, when executed on the GPU and the multi-core CPU respectively, both compared with the single-threaded (sequential) CPU implementation.  相似文献   

9.
首先介绍了CUDA架构特点,在GPU上基于CUDA使用两种方法实现了矩阵乘法,并根据CUDA特有的软硬件架构对矩阵乘法进行了优化。然后计算GPU峰值比并进行了分析。实验结果表明,基于CUDA的矩阵乘法相对于CPU矩阵乘法获得了很高的加速比,最高加速比达到1079.64。GPU浮点运算能力得到有效利用,峰值比最高达到30.85%。  相似文献   

10.
陈颖  林锦贤  吕暾 《计算机应用》2011,31(3):851-855
随着图形处理器(GPU)性能的大幅度提升以及可编程性的发展,已经有许多算法成功地移植到GPU上.LU分解和Laplace算法是科学计算的核心,但计算量往往很大,由此提出了一种在GPU上加速计算的方法.使用Nvidia公司的统一计算设备架构(CUDA)编程模型实现这两个算法,通过对CPU与GPU进行任务划分,同时利用GP...  相似文献   

11.
采用交错网格有限差分方法模拟二维地震弹性/粘弹性波场要花费大量的计算时间,为此利用GPU并行处理特点和绘制管道,将计算区域划分为内部区域和PML边界处理区域,整个计算过程由顶点编程和片段编程处理,采用FBO技术实现差分迭代结果的纹理转换。实验结果表明,与CPU实现相比,GPU方法提高了模拟效率,并且随着网格规模的增加,其效率不断提升,可以实现大规模的高效模拟。  相似文献   

12.
Open Computing Language (OpenCL) is a parallel processing language that is ideally suited for running parallel algorithms on Graphical Processing Units (GPUs). In the present work we report on the development of a generic parallel single-GPU code for the numerical solution of a system of first-order ordinary differential equations (ODEs) based on the OpenCL model. We have applied the code in the case of the Time-Dependent Schrödinger Equation of atomic hydrogen in a strong laser field and studied its performance on NVIDIA and AMD GPUs against the serial performance on a CPU. We found excellent scalability and a significant speedup of the GPU over the CPU device. The speedup in the benchmark tended towards a value of about 40 with significant speedups expected against multi-core CPUs. Furthermore, though we do not present the detailed benchmarks here, we also have achieved speedup values of around 75 by performing a slight optimization of the described algorithm.  相似文献   

13.
An optimized implementation of a block tridiagonal solver based on the block cyclic reduction (BCR) algorithm is introduced and its portability to graphics processing units (GPUs) is explored. The computations are performed on the NVIDIA GTX480 GPU. The results are compared with those obtained on a single core of Intel Core i7-920 (2.67 GHz) in terms of calculation runtime. The BCR linear solver achieves the maximum speedup of 5.84x with block size of 32 over the CPU Thomas algorithm in double precision. The proposed BCR solver is applied to discontinuous Galerkin (DG) simulations on structured grids via alternating direction implicit (ADI) scheme. The GPU performance of the entire computational fluid dynamics (CFD) code is studied for different compressible inviscid flow test cases. For a general mesh with quadrilateral elements, the ADI-DG solver achieves the maximum total speedup of 7.45x for the piecewise quadratic solution over the CPU platform in double precision.  相似文献   

14.
Today, there is a growing demand for computer vision and image processing in different areas and applications such as military surveillance, and biological and medical imaging. Edge detection is a vital image processing technique used as a pre-processing step in many computer vision algorithms. However, the presence of noise makes the edge detection task more challenging; therefore, an image restoration technique is needed to tackle this obstacle by presenting an adaptive solution. As the complexity of processing is rising due to recent high-definition technologies, the expanse of data attained by the image is increasing dramatically. Thus, increased processing power is needed to speed up the completion of certain tasks. In this paper,we present a parallel implementation of hybrid algorithm-comprised edge detection and image restoration along with other processes using Computed Unified Device Architecture (CUDA) platform, exploiting a Single Instruction Multiple Thread (SIMT) execution model on a Graphical Processing Unit (GPU). The performance of the proposed method is tested and evaluated using well-known images from various applications. We evaluated the computation time in both parallel implementation on the GPU, and sequential execution in the Central Processing Unit (CPU) natively and using Hyper-Threading (HT) implementations. The gained speedup for the naïve approach of the proposed edge detection using GPU under global memory direct access is up to 37 times faster, while the speedup of the native CPU implementation when using shared memory approach is up to 25 times and 1.5 times over HT implementation.  相似文献   

15.
Graphics processing units (GPU) have taken an important role in the general purpose computing market in recent years.At present,the common approach to programming GPU units is to write GPU specific cod...  相似文献   

16.
针对非结构网格隐式算法在GPU上的加速效果不佳的问题,通过分析GPU的架构及并行模式,研究并实现了基于非结构网格格点格式的隐式LU-SGS算法的GPU并行加速.通过采用RCM和Metis网格重排序(重组)方法,优化非结构网格的数据局部性,改善非结构网格的隐式算法在GPU上的并行加速效果.通过三维机翼算例验证了本文实现的正确性及效率.结果表明两种网格重排序(重组)方法分别得到了63%和69%的加速效果提高.优化后的LU-SGS隐式GPU并行算法获得了相较于CPU串行算法27倍的加速比,充分说明了本文方法的高效性.  相似文献   

17.
GPU上计算流体力学的加速   总被引:1,自引:0,他引:1  
本文将计算流体力学中的可压缩的纳维叶-斯托克斯(Navier-Stokes),不可压缩的Navier-Stokes和欧拉(Euler)方程移植到NVIDIA GPU上.模拟了3个测试例子,2维的黎曼问题,方腔流问题和RAE2822型的机翼绕流.相比于CPU,我们在GPU平台上最高得到了33.2倍的加速比.为了最大程度提...  相似文献   

18.
Centroidal Voronoi tessellations (CVT) are widely used in computational science and engineering. The most commonly used method is Lloyd's method, and recently the L-BFGS method is shown to be faster than Lloyd's method for computing the CVT. However, these methods run on the CPU and are still too slow for many practical applications. We present techniques to implement these methods on the GPU for computing the CVT on 2D planes and on surfaces, and demonstrate significant speedup of these GPU-based methods over their CPU counterparts. For CVT computation on a surface, we use a geometry image stored in the GPU to represent the surface for computing the Voronoi diagram on it. In our implementation a new technique is proposed for parallel regional reduction on the GPU for evaluating integrals over Voronoi cells.  相似文献   

19.
张丹丹  徐莹  徐磊 《计算机科学》2012,39(4):296-298,303
对CPU+GPU异构平台下的多种并行编程模式进行了研究,并针对格子Boltzmann方法实现了CUDA,MPI+CUDA,MPI+OpenMP+CUDA多级并行算法。结果表明,算法具有较好的加速性能;提出的根据计算量比例参数调节CPU和GPU之间负载均衡的方法,对于在异构平台上实现多级并行处理及资源的有效利用具有一定的参考和应用价值。  相似文献   

20.
Sequence segmentation has gained popularity in bioinformatics and particularly in studying DNA sequences. Information theoretic models have been used in providing accurate solutions in the segmentation of DNA sequences. Existing dynamic programming approaches provide optimal solution to the segmentation problem. However, their quadratic time complexity prohibits their applicability to long sequences. In this paper, we propose a parallel approach to improve the performance of a quasilinear sequence segmentation algorithm. The target segmentation technique is a divide-and-conquer recursive algorithm that is based on information theory principles and models. We present three parallel implementations that aim at reducing the segmentation time. The first implementation uses the multithreading capabilities of CPUs. The second one is a hybrid implementation that utilizes the synergy between the CPU and the multithreading power of GPUs. The third implementation is a variation of the hybrid approach where it utilizes the concept of unified memory between the CPU and the GPU instead of the standard memory copy approach. We demonstrate the applicability of the parallel implementations by testing them on real DNA sequences and randomly generated sequences with different lengths and different number of unique elements. The results show that the hybrid CPU-GPU approach outperforms the sequential implementation with a speedup of up to 5.9X while the CPU parallel implementation provides a poor speedup of only 1.7X.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号