期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Computing prestack Kirchhoff time migration on general purpose GPU

Xiaohua Shi Chuang Li Shihu Wang Xu Wang 《Computers & Geosciences》2011,37(10):1702-1710

This paper introduces how to optimize a practical prestack Kirchhoff time migration program by the Compute Unified Device Architecture (CUDA) on a general purpose GPU (GPGPU). A few useful optimization methods on GPGPU are demonstrated, such as how to increase the kernel thread numbers on GPU cores, and how to utilize the memory streams to overlap GPU kernel execution time, etc. The floating-point errors on CUDA and NVidia's GPUs are discussed in detail. Some effective methods that can be used to reduce the floating-point errors are introduced. The images generated by the practical prestack Kirchhoff time migration programs for the same real-world seismic data inputs on CPU and GPU are demonstrated. The final GPGPU approach on NVidia GTX 260 is more than 17 times faster than its original CPU version on Intel's P4 3.0G. 相似文献

2.

Optimization schemes and performance evaluation of Smith–Waterman algorithm on CPU,GPU and FPGA

Dan Zou Yong Dou Fei Xia 《Concurrency and Computation》2012,24(14):1625-1644

With fierce competition between CPU and graphics processing unit (GPU) platforms, performance evaluation has become the focus of various sectors. In this paper, we take a well‐known algorithm in the field of biosequence matching and database searching, the Smith–Waterman (S‐W) algorithm as an example, and demonstrate approaches that fully exploit its performance potentials on CPU, GPU, and field‐programmable gate array (FPGA) computing platforms. For CPU platforms, we perform two optimizations, single instruction, multiple data and multithread, with compiler options, to gain over 70 × speedups over naive CPU versions on quad‐core CPU platforms. For GPU platforms, we propose the combination of coalesced global memory accesses, shared memory tiles, and loop unfolding, achieving 50 × speedups over initial GPU versions on an NVIDIA GeForce GTX 470 card. Experimental results show that the GPU GTX 470 gains 12 × speedups, instead of 100 × reported by some studies, over Intel quadcore CPU Q9400, under the same manufacturing technology and both with fully optimized schemes. In addition, for FPGA platforms, we customize a linear systolic array for the S‐W algorithm in a 45‐nm FPGA chip from Xilinx (XC6VLX760), with up to 1024 processing elements. Under only 133 MHz clock rate, the FPGA platform reaches the highest performance and becomes the most power‐efficient platform, using only 25 W compared with 190 W of the GPU GTX 470. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

3.

Memory bandwidth optimization of SpMV on GPGPUs

Chenggang Clarence YAN Hui YU Weizhi XU Yingping ZHANG Bochuan CHEN Zhu TIAN Yuxuan WANG Jian YIN 《Frontiers of Computer Science》2015,9(3):431

It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that cannot be fully exploited by SpMV due to its irregularity. In this paper, we propose two novel methods to optimize the memory bandwidth for SpMV on GPGPU. First, a new storage format is proposed to exploit memory bandwidth of GPU architecture more efficiently. The new storage format can ensure that there are as many non-zeros as possible in the format which is suitable to exploit the memory bandwidth of the GPU. Second, we propose a cache blocking method to improve the performance of SpMV on GPU architecture. The sparse matrix is partitioned into sub-blocks that are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time to access the global memory for vector x is reduced heavily. Experiments are carried out on three GPU platforms, GeForce 9800 GX2, GeForce GTX 480, and Tesla K40. Experimental results show that both new methods can efficiently improve the utilization of GPU memory bandwidth and the performance of the GPU. 相似文献

4.

基于异构平台的通量分裂格式性能研究

下载免费PDF全文

梁正虹黄俊刘志勤陈波杨茂《计算机测量与控制》2021,29(2):144-149

通量分裂是在方程组条件下实现迎风特性的主要手段,为了实现典型通量分裂格式在CPU/GPU异构平台的性能分析。在NVIDIA GTX1660super上,使用统一设备计算架构(CUDA)编程模型实现一维欧拉求解器;以激波管Riemann问题为算例,对矢通量分裂格式van leer、通量差分分裂格式Roe以及混合通量分裂AUSMPW+进行计算分析;数值结果表明,三种格式在异构计算体系能够得到合理且可用的计算结果;Roe格式激波分辨率最高且在CPU/GPU体系加速效果最好;Van Leer激波分辨率较低于Roe和AUSMPW+,计算效率高但其格式构造中存在大量判断分支,影响了加速性能;AUSMPW+格式激波分辨率与Roe相当,加速性能略好于Van Leer。相似文献

5.

A GPGPU based program to solve the TDSE in intense laser fields through the finite difference approach

Cathal Ó Broin L.A.A. Nikolopoulos 《Computer Physics Communications》2014

We present a General-purpose computing on graphics processing units (GPGPU) based computational program and framework for the electronic dynamics of atomic systems under intense laser fields. We present our results using the case of hydrogen, however the code is trivially extensible to tackle problems within the single-active electron (SAE) approximation. Building on our previous work, we introduce the first available GPGPU based implementation of the Taylor, Runge–Kutta and Lanczos based methods created with strong field ab-initio simulations specifically in mind; CLTDSE. The code makes use of finite difference methods and the OpenCL framework for GPU acceleration. The specific example system used is the classic test system; Hydrogen. After introducing the standard theory, and specific quantities which are calculated, the code, including installation and usage, is discussed in-depth. This is followed by some examples and a short benchmark between an 8 hardware thread (i.e. logical core) Intel Xeon CPU and an AMD 6970 GPU, where the parallel algorithm runs 10 times faster on the GPU than the CPU. 相似文献

6.

Parallelization of a color-entropy preprocessed Chan–Vese model for face contour detection on multi-core CPU and GPU

《Parallel Computing》2015

Face tracking is an important computer vision technology that has been widely adopted in many areas, from cell phone applications to industry robots. In this paper, we introduce a novel way to parallelize a face contour detecting application based on the color-entropy preprocessed Chan–Vese model utilizing a total variation G-norm. This particular application is a complicated and unsupervised computational method requiring a large amount of calculations. Several core parts therein are difficult to parallelize due to heavily correlated data processing among iterations and pixels.We develop a novel approach to parallelize the data-dependent core parts and significantly improve the runtime performance of the model computation. We implement the parallelized program on OpenCL for both multi-core CPU and GPU. For 640 * 480 input images, the parallelized program on a NVidia GTX970 GPU, a NVidia GTX660 GPU, and an AMD FX8530 8-core CPU is on average 18.6, 12.0 and 4.40 times faster than its single-thread C version on the AMD FX8530 CPU, respectively. Some parallelized routines have much higher performance improvement compared to the whole program. For instance, on the NVidia GTX970 GPU, the parallelized entropy filter routine is on average 74.0 times faster than its single-thread C version on the AMD FX8530 8-core CPU. We discuss the parallelization methodologies in detail, including the scalability, thread models, as well as synchronization methods for both multi-core CPU and GPU. 相似文献

7.

PSkel: A stencil programming framework for CPU‐GPU systems

Alyson D. Pereira Luiz Ramos Luís F. W. Ges 《Concurrency and Computation》2015,27(17):4938-4953

The use of Graphics Processing Units (GPUs) for high‐performance computing has gained growing momentum in recent years. Unfortunately, GPU‐programming platforms like Compute Unified Device Architecture (CUDA) are complex, user unfriendly, and increase the complexity of developing high‐performance parallel applications. In addition, runtime systems that execute those applications often fail to fully utilize the parallelism of modern CPU‐GPU systems. Typically, parallel kernels run entirely on the most powerful device available, leaving other devices idle. These observations sparked research in two directions: (1) high‐level approaches to software development for GPUs, which strike a balance between performance and ease of programming; and (2) task partitioning to fully utilize the available devices. In this paper, we propose a framework, called PSkel, that provides a single high‐level abstraction for stencil programming on heterogeneous CPU‐GPU systems, while allowing the programmer to partition and assign data and computation to both CPU and GPU. Our current implementation uses parallel skeletons to transparently leverage Intel Threading Building Blocks (Intel Corporation, Santa Clara, CA, USA) and NVIDIA CUDA (Nvidia Corporation, Santa Clara, CA, USA). In our experiments, we observed that parallel applications with task partitioning can improve average performance by up to 76% and 28% compared with CPU‐only and GPU‐only parallel applications, respectively. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

8.

大规模稀疏矩阵的主特征向量计算优化方法 总被引：1，自引：0，他引：1

王伟陈建平曾国荪俞莉花谭一鸣《计算机科学与探索》2012,6(2):118-124

矩阵主特征向量(principal eigenvectors computing,PEC)的求解是科学与工程计算中的一个重要问题。随着图形处理单元通用计算(general-purpose computing on graphics pro cessing unit,GPGPU)的兴起,利用GPU来优化大规模稀疏矩阵的图形处理单元求解得到了广泛关注。分别从应用特征和GPU体系结构特征两方面分析了PEC运算的性能瓶颈,提出了一种面向GPU的稀疏矩阵存储格式——GPU-ELL和一个针对GPU的线程优化映射策略,并设计了相应的PEC优化执行算法。在ATI HD Radeon5850上的实验结果表明,相对于传统CPU,该方案获得了最多200倍左右的加速,相对于已有GPU上的实现,也获得了2倍的加速。相似文献

9.

面向异构计算平台的列数据库调度方法研究与实现

罗伟良李观钊陈虎荣霓《计算机科学》2013,40(3):142-146

由多核CPU和GPU构成的异构计算平台已经成为当前高性能计算的重要发展方向。为了有效提升列数据库的查询性能,充分利用异构计算平台的计算资源,在一套已定义的列数据库原语集合的基础上,提出了一套原语调度方法。该方法包括原语执行机制、基于动态规划的CPU原语调度方法和基于〔}PU显存管理机制的GPU原语调度方法。这使得系统可合理利用多核CPU计算资源,有效利用GPU显存中数据的局部性,以提升整体性能。对"I'PG H基准程序中几个典型查询进行了测试,结果表示,CPU原语调度方法使查询更稳定,GPU原语调度方法使查询更快。同时通过实验发现了此异构计算平台下的列数据库调度方法存在的不足,这为后续工作指明了改进方向。相似文献

10.

GPU accelerated sparse matrix‐vector multiplication and sparse matrix‐transpose vector multiplication

Yuan Tao Yangdong Deng Shuai Mu Zhenzhong Zhang Mingfa Zhu Limin Xiao Li Ruan 《Concurrency and Computation》2015,27(14):3771-3789

Many high performance computing applications require computing both sparse matrix‐vector product (SMVP) and sparse matrix‐transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi‐core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (eCSB), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU‐based CSB implementation. In addition, our eCSB procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of eCSB by means of wall‐clock time of bi‐conjugate gradient algorithm; our eCSB is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

11.

基于HYB格式稀疏矩阵与向量乘在CPU+GPU异构系统中的实现与优化

阳王东李肯立《计算机工程与科学》2016,38(2):202-209

稀疏矩阵与向量相乘SpMV是求解稀疏线性系统中的一个重要问题,但是由于非零元素的稀疏性,计算密度较低,造成计算效率不高。针对稀疏矩阵存在的一些不规则性,利用混合存储格式来进行SpMV计算,能够提高对稀疏矩阵的压缩效率,并扩大其适应范围。HYB是一种广泛使用的混合压缩格式,其性能较为稳定。而随着GPU并行计算得到普遍应用以及CPU日趋多核化,因此利用GPU和多核CPU构建异构并行计算系统得到了普遍的认可。针对稀疏矩阵的HYB存储格式中的ELL和COO存储特征,把两部分数据分别分割到CPU和GPU进行协同并行计算,既能充分利用CPU和GPU的计算资源,又能够发挥CPU和GPU的计算特性,从而提高了计算资源的利用效能。在分析CPU+GPU异构计算模式的特征的基础上,对混合格式的数据分割和共享方面进行优化,能够较好地发挥在异构计算环境的优势,提高计算性能。相似文献

12.

基于CUDA技术的卷积神经网络识别算法

下载免费PDF全文

张佳康陈庆奎《计算机工程》2010,36(15):179-181

针对具有高浮点运算能力的流处理器设备GPU对神经网络的适用性问题,提出卷积神经网络的并行化识别算法,采用计算统一设备架构(CUDA)技术,并定义其上的并行化数据结构,描述计算任务到CUDA的映射机制。实验结果证明,在GTX200硬件架构的GPU上实现的并行识别算法的平均浮点运算能力峰值较CPU上串行算法提高了近60倍,更适用于神经网络的相关应用。相似文献

13.

利用GPGPU进行快速稀疏磁共振数据重建

下载免费PDF全文

王聪冯衍秋《计算机工程与应用》2011,47(17):203-206

利用GPGPU（General Purpose GPU）强大的并行处理能力,基于NVIDIA CUDA框架对已有的稀疏磁共振（Sparse MRI）重建算法进行了并行化改造,使其能够适应实际应用的要求。稀疏磁共振成像的重建算法包含大量的浮点运算,计算耗时严重,难以应用于实际,必须对其进行加速和优化。实验结果显示,NVIDIA GTX275 GPU使运算时间从4分多钟缩短到3.4秒左右,与Intel Q8200 CPU相比,达到了76倍的加速。相似文献

14.

CPU-GPU混合平台上动态场景光线跟踪的研究 总被引：1，自引：0，他引：1

张健焦良葆陈瑞《计算机工程与应用》2012,48(21):151-154,159

提出了一种动态场景光线跟踪新方法,能有效地调度CPU和GPU的运行,提高渲染速度。根据加速结构kd-tree的特点,将其分成上层部分和下层部分,上层部分由于并行性较小,由CPU创建;而下层部分并行性较大,由GPU创建,提高动态场景加速结构的创建速度。同时充分利用CPU和GPU两个运算平台的特点,有效调度两者的运行,隐藏部分运算时间,进一步提高动态场景的渲染速度。实验结果表明,在安装了GeForce285GTX的PC机上,高真实感地交互渲染了包含11k三角面片的Kitchen动态场景。相似文献

15.

gAC:基于GPU的高性能AC算法

陈虎彭江锋施少怀《计算机工程与应用》2012,48(12):43-48

字符串匹配是计算科学中研究最广泛的问题之一,已成为信息检索和生物计算等领域的核心操作。然而受限于CPU的计算能力和存储器访问带宽,传统的串行字符串匹配算法难以进一步提升性能。GPU在计算能力和存储器访问带宽上有很大提升,已经在很多应用上取得了卓越成效。gAC作为一种基于GPU的并行AC算法,针对GPU的SIMT(Single-Instruction Multiple-Thread)以及合并存储器访问的技术特点,采取了减少条件分支、合并访问全局存储器等优化方法,使得在C1060GPU上的字符串扫描速度达到51Gb/s,比基于CPU的串行算法提升了28倍。相似文献

16.

图形处理器空间插值并行算法的实现

下载免费PDF全文

赵艳伟程振林董慧方金云《中国图象图形学报》2012,17(4):575-581

空间插值是地理信息系统(GIS)空间分析中计算复杂且耗时的操作,因此无法满足实时性的要求。随着图形处理器(GPU)浮点计算能力的大幅提高,GPU通用计算已成为处理GIS领域内复杂计算的研究热点。为实时化一些传统低效的算法提供了良好的契机。利用GPU在并行计算上的优势,将反距离加权法插值算法映射到了统一计算设备架构(CUDA)并行编程架构。首先在GPU中建立二级索引使计算层次得到了合理的划分,然后利用多线程分块策略执行并行插值计算。最后通过实验表明,该方法的插值误差与CPU方法相比能控制在10-6数量级,并且在插值半径较大插值数据较多的情况下,该算法可达到40倍以上的加速比。充分证明了该方法的正确性及高效性。相似文献

17.

CPU-GPGPU异构体系结构相关技术综述

下载免费PDF全文

徐新海林宇斐易伟《计算机工程与科学》2009,31(Z1)

随着GPU的发展,其计算能力和访存带宽都超过了CPU,在GPU上进行通用计算也变得越来越流行,这样就构成了CPU-GPGPU的新型异构体系结构。虽然这种新型体系结构表现出了强大的性能优势并受到了学术界和产业界的广泛关注,但如何更好地在这种结构上高效地编写和运行程序仍然存在很大的挑战。本文综述了针对这一体系结构现有的可编程性技术、可靠性技术和低功耗技术,并结合这些技术展望了CPU-GPGPU这种异构系统的发展趋势。相似文献

18.

基于CPU／GPU异构平台并行优化的研究

杨芳菊《电脑编程技巧与维护》2012,(18):4-7,67

CPU／GPU异构系统具有很大的发展潜力,深入研究CPU／GPU异构平台的并行优化,可实现系统整体计算能力的最大化。通过对CPU／GPU任务划分的优化来平衡CPU和GPU的负载,可提高计算资源的利用率,缩短计算任务的执行时间;通过对GPU线程划分的优化,可使GPU获得更高的速度。从而提高系统整体性能。相似文献

19.

动态模式识别算法的GPU平台实现

林文愉王聪《计算技术与自动化》2013,(1):68-72

研究动态模式识别算法在GPU并行计算平台的实现。随着GPGPU(通用计算图形处理器)硬件的发展,基于GPU的大规模并行计算技术将有效地处理动态模式识别算法带来的海量计算问题。文中通过介绍动态模式识别算法,对算法中涉及的巨大计算量进行分析,并针对性地对其中密集计算部分进行并行化分解,移除原算法中在执行中存在的依赖关系,最终得到算法在特定的GPU平台———Jacket上的并行计算实现。实例验证表明,相比于原CPU串行程序,在GPU上运行的并行化程序能实现明显加速,因而具有很好的工程应用价值。相似文献

20.

Protein database search of hybrid alignment algorithm based on GPU parallel acceleration

Wei Zhou Zhanxiu Cai Bo Lian Jincai Wang Jianping Ma 《The Journal of supercomputing》2017,73(10):4517-4534

In biological research, alignment of protein sequences by computer is often needed to find similarities between them. Although results can be computed in a reasonable time for alignment of two sequences, it is still very central processing unit (CPU) time-consuming when solving massive sequences alignment problems such as protein database search. In this paper, an optimized protein database search method is presented and tested with Swiss-Prot database on graphic processing unit (GPU) devices, and further, the power of CPU multi-threaded computing is also involved to realize a GPU-based heterogeneous parallelism. In our proposed method, a hybrid alignment approach is implemented by combining Smith–Waterman local alignment algorithm with Needleman–Wunsch global alignment algorithm, and parallel database search is realized with compute unified device architecture (CUDA) parallel computing framework. In the experiment, the algorithm is tested on a lower-end and a higher-end personal computers equipped with GeForce GTX 750 Ti and GeForce GTX 1070 graphics cards, respectively. The results show that the parallel method proposed in this paper can achieve a speedup up to 138.86 times over the serial counterpart, improving efficiency and convenience of protein database search significantly. 相似文献