期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

2.

基于GPU的多重网格Navier-Stokes解算器并行优化方法研究

刘冰陆忠华李新亮胡晓东《数据与计算发展前沿》2013,4(3):56-67

随着工业计算需求的激增,计算流体力学 (Computational Fluid Dynamics, CFD) 学科对计算效率问题越来越重视。作者基于自行开发的 Navier-Stokes 解算器,引入多重网格加速收敛算法,并结合NVIDIA GPU 计算平台,从数值方法和高性能计算两个方面为 CFD 实现加速。数值加速算例测试结果表明,基于多重网格算法的 GPU 解算器相对 CPU 版本代码双精度可获得 45 倍以上的加速。相似文献

3.

基于GPU的LARED-P算法加速

下载免费PDF全文

刘来国徐炜遐杨灿群陈娟《计算机工程与科学》2009,31(Z1)

GPU拥有几百GFlops甚至上TFlops的浮点计算能力,将GPU应用于粒子模拟,可有效提高大规模粒子模拟的速度,降低计算成本。本文利用GPU加速三维激光等离子体模拟算法LARED-P,提出了基于CPU+GPU的任务划分、GPU上任务分解、大规模计算核心的分解方法,结合使用了寄存器、纹理内存对算法进行加速。在双精度条件下,移植后的算法在工作频率为1.44GHz的NVIDIA Tesla S1070的单个GPU上获得了相当于主频2.4GHz的Intel(R)Core(TM)2 Quad CPU Q6600单核的6倍加速比。相似文献

4.

GPU上计算流体力学的加速 总被引：1，自引：0，他引：1

董廷星李新亮李森迟学斌《计算机系统应用》2011,20(1):104-109

本文将计算流体力学中的可压缩的纳维叶-斯托克斯(Navier-Stokes),不可压缩的Navier-Stokes和欧拉(Euler)方程移植到NVIDIA GPU上.模拟了3个测试例子,2维的黎曼问题,方腔流问题和RAE2822型的机翼绕流.相比于CPU,我们在GPU平台上最高得到了33.2倍的加速比.为了最大程度提... 相似文献

5.

基于CUDA的SKINNY加密算法并行实现与分析

解文博韦永壮刘争红《计算机应用》2021,41(4):1136-1141

针对SKINNY加密算法在中央处理器（CPU）下实现效率偏低的问题,提出一种基于图形处理器（GPU）的快速实现方法。首先,结合SKINNY算法的结构特征提出优化方案,将5个分步操作优化整合为1个整体运算;然后,分析该算法的电子密码本（ECB）模式和计数器（CTR）模式的特性,并给出并行粒度、内存分配等并行设计方案。实验结果表明,与传统的CPU实现方法下的SKINNY算法相比,基于计算统一设备架构（CUDA）实现的SKINNY算法的效率和吞吐量得到很大提升。具体来说,当处理的数据达到16 MB及以上时,在所提实现方法下,SKINNY算法的ECB模式的加速效率提升峰值为99.85%,加速比峰值为671,CTR模式的加速效率提升峰值为99.87%,加速比峰值为765;而与已有AES-256（ECB）和SKINNY_ECB并行算法比较,新提出的SKINNY-256（ECB）并行算法的吞吐量分别是它们的吞吐量的1.29倍和2.55倍。相似文献

6.

CFD-based analysis and two-level aerodynamic optimization on graphics processing units

I.C. Kampolis X.S. Trompoukis V.G. Asouti K.C. Giannakoglou 《Computer Methods in Applied Mechanics and Engineering》2010,199(9-12):712-722

This paper presents the porting of 2D and 3D Navier–Stokes equations solvers for unstructured grids, from the CPU to the graphics processing unit (GPU; NVIDIA’s Ge-Force GTX 280 and 285), using the CUDA language. The performance of the GPU implementations, with single, double or mixed precision arithmetic operations, is compared to that of the CPU code.Issues regarding the optimal handling of the unstructured grid topology on the GPU, particularly for vertex-centered CFD algorithms, are discussed. Restructuring the existing codes was necessary in order to maximize the parallel efficiency of the GPU implementations. The mixed precision implementation, in which the left-hand-side operators are computed with single precision, was shown to bridge the gap between the single and double precision speed-ups. Based on the different speed-ups and prediction accuracy of the aforementioned GPU implementations of the Navier–Stokes equations solver, a hierarchical optimization method which is suitable for GPUs is proposed and demonstrated in inviscid and turbulent 2D flow problems. The search for the optimal solution(s) splits into two levels, both relying upon evolutionary algorithms (EAs) though with different evaluation tools each. The low level EA uses the very fast single precision GPU implementation with relaxed convergence criteria for the inexpensive evaluation of candidate solutions. Promising solutions are regularly broadcast to the high level EA which uses the mixed precision GPU implementation of the same flow solver. Single- and two-objective aerodynamic shape optimization problems are solved using the developed software. 相似文献

7.

Efficient parallel implementation of three‐point viterbi decoding algorithm on CPU,GPU, and FPGA

Rongchun Li Yong Dou Dan Zou 《Concurrency and Computation》2014,26(3):821-840

In wireless communication, Viterbi decoding algorithm (VDA) is the one of most popular channel decoding algorithms, which is widely used in WLAN, WiMAX, or 3G communications. However, the throughput of Viterbi decoder is constrained by the convolutional characteristic. Recently, the three‐point VDA (TVDA) was proposed to solve this problem. In TVDA, the whole procedure can be divided into three phases, the forward, trace‐back, and decoding phases. In this paper, we analyze the parallelism of TVDA and propose parallel TVDA on the multi‐core CPU, graphics processing unit (GPU), and field programmable gate array (FPGA). We demonstrate approaches that fully exploit its performance potential on CPU, GPU, and FPGA computing platforms. For CPU platforms, we perform two optimization methods, single instruction multiple data and multithreading to gain over 145 × speedup over the naive CPU version on a quad‐core CPU platform. For GPU platforms, we propose the combination of cached memory optimization, coalesced global memory accesses, codeword packing scheme, and asynchronous data transition, achieving the throughput of 404.65 Mbps and 12 × speedup over initial GPU versions on an NVIDIA GeForce GTX580 card and 7 × speedup over Intel quad‐core CPU i5‐2300, under the same manufacturing year and both with fully optimized schemes. In addition, for FPGA platforms, we customize a radix‐4 pipelined architecture for the TVDA in a 45‐nm FPGA chip from Xilinx (XC6VLX760). Under 209.15‐MHz clock rate, it achieves a throughput of 418.30 Mbps. Finally, we also discuss the performance evaluation and efficiency comparison of different flexible architectures for real‐time Viterbi decoding in terms of the decoding throughput, power consumption, optimization schemes, programming costs, and price costs.Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

8.

A high performance hardware accelerator for dynamic texture segmentation

《Journal of Systems Architecture》2015,61(10):639-645

Hardware accelerators such as general-purpose GPUs and FPGAs have been used as an alternative to conventional CPU architectures in scientific computing applications, and have achieved good speed-up results. Within this context, the present study presents a heterogeneous architecture for high-performance computing based on CPUs and FPGAs, which efficiently explores the maximum parallelism degree for processing video segmentation using the concept of dynamic textures. The video segmentation algorithm includes processing the 3-D FFT, calculating the phase spectrum and the 2-D IFFT operation. The performance of the proposed architecture based on CPU and FPGA is compared with the reference implementation of FFTW in CPU and with the cuFFT library in GPU. The performance report of the prototyped architecture in a single Stratix IV FPGA obtained an overall speedup of 37x over the FFTW software library. 相似文献

9.

LU分解和Laplace算法在GPU上的实现

陈颖林锦贤吕暾《计算机应用》2011,31(3):851-855

随着图形处理器(GPU)性能的大幅度提升以及可编程性的发展,已经有许多算法成功地移植到GPU上.LU分解和Laplace算法是科学计算的核心,但计算量往往很大,由此提出了一种在GPU上加速计算的方法.使用Nvidia公司的统一计算设备架构(CUDA)编程模型实现这两个算法,通过对CPU与GPU进行任务划分,同时利用GP... 相似文献

10.

Simulation of one-layer shallow water systems on multicore and CUDA architectures

Marc de la Asunción José M. Mantas Manuel J. Castro 《The Journal of supercomputing》2011,58(2):206-214

The numerical solution of shallow water systems is useful for several applications related to geophysical flows, but the big dimensions of the domains suggests the use of powerful accelerators to obtain numerical results in reasonable times. This paper addresses how to speed up the numerical solution of a first order well-balanced finite volume scheme for 2D one-layer shallow water systems by using modern Graphics Processing Units (GPUs) supporting the NVIDIA CUDA programming model. An algorithm which exploits the potential data parallelism of this method is presented and implemented using the CUDA model in single and double floating point precision. Numerical experiments show the high efficiency of this CUDA solver in comparison with a CPU parallel implementation of the solver and with respect to a previously existing GPU solver based on a shading language. 相似文献

11.

基于OpenCL的Kmeans算法的优化研究

吴再龙张云泉徐建良贾海鹏颜深根解庆春《计算机科学与探索》2014,(10):1162-1176

Kmeans算法是无监督机器学习中一种典型的聚类算法,是对已知数据集进行划分和分组的重要方法,在图像处理、数据挖掘、生物学领域有着广泛的应用。随着实际应用中数据规模的不断变大,对Kmeans算法的性能也提出了更高的要求。在充分考虑不同硬件平台体系架构差异的基础上,系统地研究了Kmeans算法在GPU和APU平台上实现与优化的关键技术：片上全局同步高效实现,冗余计算减少全局同步次数,线程任务重映射,局部内存重用等,实现了Kmeans算法在不同硬件平台上的高性能与性能移植。实验结果表明,优化后的算法在考虑数据传输时间的前提下,在AMD HD7970 GPU上相对于CPU版本取得136.975～170.333倍的加速比,在AMD A10-5800K APU上相对于CPU版本取得22.2365～24.3865倍的加速比,有效验证了优化方法的有效性和平台的可移植性。相似文献

12.

基于OpenCL的图像模糊化算法优化研究

张樱张云泉龙国平《计算机科学》2012,39(3):260-264

现代GPU一般都提供特定硬件(如纹理部件、光栅化部件及各种片上缓存)以加速二维图像的处理和显示过程,相应的编程模型(CUDA、OpenCL)都定义了特定程序设计接口(CUDA的纹理内存,OpenCL的图像对象)以便图像应用能利用相关硬件支持。以典型图像模糊化处理算法在AMD平台GPU的优化为例,探讨了OpenCL的图像对象在图像算法优化上的适用范围,尤其是分析了其相对于更通用的基于全局内存加片上局部存储进行性能优化的方法的优劣。实验结果表明,图像对象只有在图像为四通道且计算过程中需要缓存的数据量较小时才能带来较好的性能改善,其余情况采用全局内存加局部存储都能获得较好性能。优化后的算法性能相对于精心实现的CPU版加速比为200～1000;相对于NVIDIA NPP库相应函数的性能加速比为1.3～5。相似文献

13.

CUDA-enabled Sparse Matrix–Vector Multiplication on GPUs using atomic operations

Hoang-Vu Dang Bertil Schmidt 《Parallel Computing》2013

Existing formats for Sparse Matrix–Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700 K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision. 相似文献

14.

Parallel Monte Carlo simulation in the canonical ensemble on the graphics processing unit

Eyad Hailat Vincent Russo Kamel Rushaidat Jason Mick Loren Schwiebert Jeffrey Potoff 《International Journal of Parallel, Emergent and Distributed Systems》2014,29(4):379-400

Graphics processing units (GPUs) offer parallel computing power that usually requires a cluster of networked computers or a supercomputer to accomplish. While writing kernel code is fairly straightforward, achieving efficiency and performance requires very careful optimisation decisions and changes to the original serial algorithm. We introduce a parallel canonical ensemble Monte Carlo (MC) simulation that runs entirely on the GPU. In this paper, we describe two MC simulation codes of Lennard-Jones particles in the canonical ensemble, a single CPU core and a parallel GPU implementations. Using Compute Unified Device Architecture, the parallel implementation enables the simulation of systems containing over 200,000 particles in a reasonable amount of time, which allows researchers to obtain more accurate simulation results. A remapping algorithm is introduced to balance the load of the device resources and demonstrate by experimental results that the efficiency of this algorithm is bounded by available GPU resource. Our parallel implementation achieves an improvement of up to 15 times on a commodity GPU over our efficient single core implementation for a system consisting of 256k particles, with the speedup increasing with the problem size. Furthermore, we describe our methods and strategies for optimising our implementation in detail. 相似文献

15.

点云重建的并行算法

下载免费PDF全文

杨捷吴素萍《计算机工程与应用》2020,56(6):213-219

在三维重建问题中,为了提高重建模型的精确度和完整性,需要增大三维重建的数据量,由此会增加重建的计算量和运行时间。针对该问题,对点云重建过程进行并行设计,降低耗时、提高三维重建的效率,提出在多核CPU、GPU架构和CPU/GPU异构环境下点云重建的并行算法,并在不同实验平台上对Kermit和hallFeng数据集进行了点云重建的并行实验。实验结果表明,相比于串行的点云重建算法,点云重建并行算法在保证重建精度的条件下,取得了较好的加速比,并且并行算法具有实验平台和数据规模的可扩展性。相似文献

16.

交替方向隐式CFD解法器的GPU并行计算及其优化

邓亮徐传福刘巍张理论《计算机应用》2013,33(10):2783-2786

交替方向隐格式(ADI)是常见的偏微分方程离散格式之一,目前对ADI格式在计算流体力学（CFD）实际应用中的GPU并行工作开展较少。从一个有限体积CFD应用出发,通过分析ADI解法器的特点和计算流程,基于统一计算架构(CUDA)编程模型设计了基于网格点与网格线的两类细粒度GPU并行算法,讨论了若干性能优化方法。在天河-1A系统上,采用128×128×128网格规模的单区结构网格算例,无粘项、粘性项及ADI迭代计算的GPU并行性能相对于单CPU核,分别取得了100.1、40.1和10.3倍的加速比,整体ADI CFD解法器的GPU并行加速比为17.3 相似文献

17.

CPU–GPU hybrid parallel strategy for cosmological simulations

Yueqing Wang Yong Dou Song Guo Yuanwu Lei Dan Zou 《Concurrency and Computation》2014,26(3):748-765

Gadget is a simulation application for N‐body and smoothed particle hydrodynamics problems in cosmology, and it is widely applied in solving series of cosmological problems. N‐body focuses on the motion of the interaction of N particles, and smoothed particle hydrodynamics is a fluid simulation algorithm that studies the movement of fluid through particle simulation. Most scholars focus their attention on accelerating Gadget on multi‐core CPU or graphics processing units (GPUs) platforms. However, these research activities failed to achieve CPU–GPU hybrid computing, which resulted in tremendous waste of CPU computing resources. In this paper, we propose a CPU–GPU hybrid parallel strategy to accelerate Gadget‐2, a massively parallel structure formation code for cosmological simulations. This strategy uses CPU and GPU to process the calculation of short‐range force. To ensure CPU and GPU workload balance, a dynamic task allocation scheme is proposed according to the computational performance difference between the CPU and GPU. Experimental results showed that our CPU–GPU hybrid parallel strategy achieved an overall speedup factor of 18.6 and a partial speedup factor for short‐range force calculation of 28.35 compared with a single‐core CPU implementation for particles in million‐size magnitudes. Moreover, compared with a GPU platform that contained 12 CPU cores and one GPU, our hybrid parallel strategy obtained overall speedup and partial speedup factors of 6% and 20%, respectively. Furthermore, the scalability of the hybrid strategy is very fine – its performance will be enhanced when the problem scale is increasing. However, this strategy also has its limitation that the performance enhancement will be decreasing if the ratio(the number of CPU cores divides that of the GPU cards) reduces. Finally, in our hybrid strategy, the CPU coefficient of utilization improved by 17.14% or better. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

18.

Simulation of shallow-water systems using graphics processing units

Miguel Lastra José M. Mantas Carlos Ureña Manuel J. Castro José A. García-Rodríguez 《Mathematics and computers in simulation》2009

This paper addresses the speedup of the numerical solution of shallow-water systems in 2D domains by using modern graphics processing units (GPUs). A first order well-balanced finite volume numerical scheme for 2D shallow-water systems is considered. The potential data parallelism of this method is identified and the scheme is efficiently implemented on GPUs for one-layer shallow-water systems. Numerical experiments performed on several GPUs show the high efficiency of the GPU solver in comparison with a highly optimized implementation of a CPU solver. 相似文献

19.

双重并行环境下最短路径的研究

下载免费PDF全文

孙玉强李银银顾玉宛《计算机测量与控制》2017,25(3):195-196, 230

并行问题和最短路径问题已成为一个热点研究课题,传统的最短路径算法已不能满足数据爆炸式增长的处理需求,尤其当网络规模很大时,所需的计算时间和存储空间也大大的增加;MapReduce模型的出现,带来了一种新的解决方法来解决最短路径;GPU具有强大的并行计算能力和存储带宽,与CPU相比具有明显的优势;通过研究MapReduce模型和GPU执行过程的分析,指出单独基于MapReduce模型的最短路径并行方法存在的问题,降低了系统的性能;论文的创新点是结合MapReduce和GPU形成双并行模型,并行预处理数据,针对最短路径中的数据传输和同步开销,增加数据动态处理器;最后实验从并行算法的性能评价指标平均加速比进行比较,结果表明,双重并行环境下的最短路径的计算,提高了加速比。相似文献

20.

Transparent partial page migration between CPU and GPU

Shiqing ZHANG Zheng QIN Yaohua YANG Li SHEN Zhiying WANG 《Frontiers of Computer Science》2020,14(3):143101-13

Despite the increasing investment in integrated GPU and next-generation interconnect research,discrete GPU connected by PCIe still account for the dominant position of the market,the management of data communication between CPU and GPU continues to evolve.Initially,the programmer explicitly controls the data transfer between CPU and GPU.To simplify programming and enable systemwide atomic memory operations,GPU vendors have developed a programming model that provides a single,virtual address space for accessing all CPU and GPU memories in the system.The page migration engine in this model automatically migrates pages between CPU and GPU on demand.To meet the needs of high-performance workloads,the page size tends to be larger.Limited by low bandwidth and high latency interconnects compared to GDDR,larger page migration has longer delay,which may reduce the overlap of computation and transmission,waste time to migrate unrequested data,block subsequent requests,and cause serious performance decline.In this paper,we propose partial page migration that only migrates the requested part of a page to reduce the migration unit,shorten the migration latency,and avoid the performance degradation of the full page migration when the page becomes larger.We show that partial page migration is possible to largely hide the performance overheads of full page migration.Compared with programmer controlled data transmission,when the page size is 2MB and the PCIe bandwidth is 16GB/sec,full page migration is 72.72×slower,while our partial page migration achieves 1.29×speedup.When the PCIe bandwidth is changed to 96GB/sec,full page migration is 18.85×slower,while our partial page migration provides 1.37×speedup.Additionally,we examine the performance impact that PCIe bandwidth and migration unit size have on execution time,enabling designers to make informed decisions. 相似文献