期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

田绪红江敏杰《计算机应用研究》2009,26(5):1679-1681

近年来图形处理器(GPU)快速拓展的可编程性能力加上渲染流水线的高速度及并行性,使得图形处理器通用计算(GPGPU)迅速成为一个研究热点。针对大规模神经网络BP算法效率低下问题,提出了一种GPU加速的神经网络BP算法。将BP网络的前向计算、反向学习转换为GPU纹理的渲染过程,从而利用GPU强大的浮点运算能力和高度并行的计算特性对BP算法进行求解。实验结果表明,在保证求解结果准确度不变的情况下,该方法运行效率有明显的提高。相似文献

2.

GPU加速的分段Top-k查询算法

黄玉龙邹循进刘奎苏本跃《计算机应用》2014,34(11):3112-3116

现有Top-k查询优化算法无法充分利用图形处理器(GPU)强大的并行吞吐量及时获取查询结果,为此提出了一种基于统一计算设备架构(CUDA)模型的大规模分段查询算法。通过划分查询过程以及采用分段并行处理策略,该算法可最大限度地提升查询过程中的计算和比较效率。实验结果表明,与4线程多核优化算法相比,所提算法具有明显的性能优势,当有序列表数量为6,遍历步长为120时,性能达到最优,此时比多核算法快40倍。相似文献

3.

GPU加速卷积反投影算法的滤波并行化方法

《传感器与微系统》2019,(4):69-72

当重建的图像规模偏大、实时性要求高时,卷积反投影(CBP)重建过程比较慢,达不到预期满意的速度。针对这一不足,通过深入研究卷积反投影算法的原理,优化投影数据在图形处理器(GPU)中的存储结构、分析和挖掘算法执行过程中滤波阶段的可并行性,对其中的滤波操作进行并行化处理,从而提出并行滤波过程的方法。通过在MATLAB进行仿真实验,实验结果表明:所提出的并行化方法在保证重建图像精度和清晰度的前提下,同串行卷积法相比较,滤波过程运算的加速比得到了较大程度的提高。相似文献

4.

一种基于GPU加速的细粒度并行蚁群算法 总被引：1，自引：0，他引：1

李建明胡祥培庞占龙钱昆明《控制与决策》2009,24(8)

为改善蚁群算法对大规模旅行商问题的求解性能,提出一种基于图形处理器(GPU)加速的细粒度并行蚁群算法.将并行蚁群算法求解过程转化为统一计算设备架构的线程块并行执行过程,使得蚁群算法在GPU中加速执行.实验结果表明,该算法能提高全局搜索能力,增大细粒度并行蚁群算法的蚂蚁规模,从而提高了算法的运算速度. 相似文献

5.

基于GPU的并行优化技术* 总被引：2，自引：2，他引：2

左颢睿张启衡徐勇赵汝进《计算机应用研究》2009,26(11):4115-4118

针对标准并行算法难以在图形处理器(GPU)上高效运行的问题,以累加和算法为例,基于Nvidia公司统一计算设备架构(CUDA)GPU介绍了指令优化、共享缓存冲突避免、解循环优化和线程过载优化四种优化方法。实验结果表明,并行优化能有效提高算法在GPU上的执行效率,优化后累加和算法的运算速度相比标准并行算法提高了约34倍,相比CPU串行实现提高了约70倍。相似文献

6.

基于SURF与GPU加速数字图像处理

周亮君肖世德李晟尧谭芳喜《传感器与微系统》2022,(3):98-100

针对基于普通PC架构的图像处理速度较慢,难以满足图像数目多、分辨率大、达不到实时性要求的问题,提出基于图形处理器(GPU)的快速图像处理方案.基于SURF算法对图像进行特征提取和特征分类,并实现GPU并行加速的图像处理.实验表明:与基于普通PC架构的图像处理方法相比,GPU的图像处理的速度提高了约5倍,性能得到显著提高... 相似文献

7.

晶硅分子动力学模拟的GPU加速算法优化

林琳祝爱琦赵明璨张帅叶炎昊徐骥韩林赵荣彩侯超峰《计算机工程》2023,(4):166-173

分子动力学(MD)模拟是研究硅纳米薄膜热力学性质的主要方法,但存在数据处理量大、计算密集、原子间作用模型复杂等问题,限制了MD模拟的深入应用。针对晶硅分子动力学模拟算法中数据访问不连续和大量分支判断造成并行资源浪费、线程等待等问题,结合Nvidia Tesla V100 GPU硬件体系结构特点,对晶硅MD模拟算法进行设计。通过全局内存的合并访存、循环展开、原子操作等优化方法,利用GPU强大并行计算和浮点运算能力,减少显存访问及算法执行过程中的分支冲突和判断指令,提升算法整体计算性能。测试结果表明,优化后的晶硅MD模拟算法的计算速度相比于优化前提升了1.69～1.97倍,相比于国际上主流的GPU加速MD模拟软件HOOMDblue和LAMMPS分别提升了3.20～3.47倍和17.40～38.04倍,具有较好的模拟加速效果。相似文献

8.

遥感图像渐进式传输的GPU并行加速研究

下载免费PDF全文

杨靖宇刘昭华张永生《计算机工程与应用》2010,46(15):185-187

遥感图像的渐进式传输大大提高了数据响应效率,但同时也增加了数据接收端的计算量。为进一步提高数据传输效率,研究了基于可编程图形硬件GPU的并行加速方法,通过小波逆变换的GPU并行化来加速图像重构,并通过纹理查找表来提高数据读取效率,利用离线渲染缓存Pbuffer来保存多层小波变换的中间计算结果,进一步提高了并行效率。最后,通过实验验证了该方法的有效性。相似文献

9.

一种基于GPU的高性能稀疏卷积神经网络优化

方程邢座程陈顼颢张洋《计算机工程与科学》2018,40(12):2103-2111

卷积神经网络CNN目前作为神经网络的一个重要分支,相比于其他神经网络方法更适合应用于图像特征的学习和表达。随着CNN的不断发展,CNN将面临更多的挑战。CNN参数规模变得越来越大,这使得CNN对计算的需求量变得非常大。因此,目前产生了许多种方式对CNN的规模进行压缩。然而压缩后的CNN模型往往产生了许多稀疏的数据结构,这种稀疏结构会影响CNN在GPU上的性能。为了解决该问题,采用直接稀疏卷积算法,来加速GPU处理稀疏数据。根据其算法特点将卷积运算转换为稀疏向量与稠密向量内积运算,并将其在GPU平台上实现。本文的优化方案充分利用数据稀疏性和网络结构来分配线程进行任务调度,利用数据局部性来管理内存替换,使得在稀疏卷积神经网络SCNN中的GPU仍能够高效地处理卷积层运算。相比cuBLAS的实现,在AlexNet、GoogleNet、ResNet上的性能提升分别达到1.07×~1.23×、1.17×~3.51×、1.32×~5.00×的加速比。相比cuSPARSE的实现,在AlexNet、GoogleNet、ResNet上的性能提升分别达到1.31×～1.42×、1.09×～2.00×、1.07×～3.22×的加速比。相似文献

10.

基于GPU的大规模人群疏散模拟*

李攀彭伟《计算机应用研究》2012,29(3):1166-1168

基于中央处理器(CPU)串行的人群疏散传统方法对于人群规模较少的场景,可以得到良好的疏散模拟效果,但在人群密度较高的场景中,难以达到实时模拟的要求。为了克服上述问题,实现了一种基于图形处理器(GPU)的人群疏散模拟的方法。该方法通过对个体寻径算法的优化,不仅能使个体快速准确地智能寻径,而且将个体寻径独立性与图形处理器高性能计算特性进行结合,充分利用了图形处理器强大的并行计算能力,从而大幅度提高了人群疏散模拟的人群规模,使人群疏散模拟的实时性得到增强。相似文献

11.

GPU加速的二值图连通域标记并行算法 总被引：1，自引：0，他引：1

覃方涛房斌《计算机应用》2010,30(10):2774-2776

结合NVIDIA公司统一计算设备架构(CUDA)下的图形处理器(GPU)并行结构和硬件特点,提出了一种新的二值图像连通域标记并行算法,高速有效地标识出了二值图的连通域位置及大小,大幅缩减了标记时间耗费。该算法通过搜索邻域内最小标号值的像素点对连通域进行标记,各像素点处理顺序不分先后并且不相互依赖,因此可以并行执行。算法效率不受连通域形状及数量的影响,具有很好的鲁棒性。实验结果表明,该并行算法充分发挥了GPU并行处理能力,在处理高分辨率与多连通域图像时效率为一般CPU标记算法的300倍,比OpenCV的优化函数(CPU)效率高近17倍。相似文献

12.

Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

Scott Rostrup Hans De Sterck 《Computer Physics Communications》2010,181(12):2164-2179

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.

Program summary

Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run. 相似文献

13.

GPU accelerated MCMC for modeling terrorist activity

《Computational statistics & data analysis》2014

The use of graphical processing unit (GPU) parallel processing is becoming a part of mainstream statistical practice. The reliance of Bayesian statistics on Markov Chain Monte Carlo (MCMC) methods makes the applicability of parallel processing not immediately obvious. It is illustrated that there are substantial gains in improved computational time for MCMC and other methods of evaluation by computing the likelihood using GPU parallel processing. Examples use data from the Global Terrorism Database to model terrorist activity in Colombia from 2000 through 2010 and a likelihood based on the explicit convolution of two negative-binomial processes. Results show decreases in computational time by a factor of over 200. Factors influencing these improvements and guidelines for programming parallel implementations of the likelihood are discussed. 相似文献

14.

基于GPU的改进型GIW雷达图像降噪算法

孙彬倪维平严卫东边辉王培忠《计算机工程与设计》2010,31(15)

针对大幅面合成孔径雷达(SAR)图像的斑点抑制问题,提出了一种基于自适应模板的梯度倒数加权(GIW)算法,在编程实现上采用图形处理器(GPU)进行优化,有效地解决了由于像素级运算复杂度高所导致的大幅面雷达图像处理实时性差的问题.高分辨率SAR图像的处理结果显示,该算法在有效抑制斑点噪声的同时较好地保持了边缘细节信息.经过GPU加速后,对于大幅面图像的处理,相对于中央处理器(CPU)实现可以有2个量级以上的速度提升. 相似文献

15.

MPI+GPU accelerated query by humming system

YAO Guang-chao ZHENG Yao XIAO Li-min RUAN Li 《计算机工程与科学》2013,35(11):168

相似文献

16.

Accelerating the cryo-EM structure determination in RELION on GPU cluster

Xin YOU Hailong YANG Zhongzhi LUAN Depei QIAN 《Frontiers of Computer Science》2022,16(3):163102

The cryo-electron microscopy (cryo-EM) is one of the most powerful technologies available today for structural biology. The RELION (Regularized Likelihood Optimization) implements a Bayesian algorithm for cryo-EM structure determination, which is one of the most widely used software in this field. Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets. In this paper, we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations. We propose several performance optimization strategies to improve the overall performance of RELION, including optimization of expectation step, parallelization of maximization step, accelerating the computation of symmetries, and memory affinity optimization. The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets. In addition, we perform roofline model analysis to understand the effectiveness of our optimizations. 相似文献

17.

A two-tier design space exploration algorithm to construct GPU performance model

《Journal of Systems Architecture》2015,61(10):576-583

Graphics Processing Units (GPUs) have a large and complex design space that needs to be explored in order to optimize the performance of future GPUs. Statistical techniques are useful tools to help computer architects to predict performance of complex processors. In this study, these methods are utilized to build a model which predicts the GPU performance efficiently. The design space of targeted Fermi GPU has more than 8 million points which cause exploring this huge design space a challenging process. In order to build an accurate model, we propose a two-tier algorithm in our algorithm which builds a multiple linear regression model from a small set of simulated data. In this algorithm the Plackett–Burman design is used to find the key parameters of the GPU, and further simulations are guided by a fractional factorial design for the most important parameters. Our algorithm is able to construct a GPU performance predictor which can predict the performance of any point in the design space with an average prediction error between 1% and 5% for different benchmark applications. In addition, in comparison to other methods which need a large number of sampling points, the accuracy in our method is achieved by only sampling between 0.0003% and 0.0015% of the full design space. 相似文献

18.

GPU accelerated online multi-particle beam dynamics simulator for ion linear particle accelerators

X. Pang L. Rybarcyk 《Computer Physics Communications》2014

An online beam dynamics simulator is being developed for use in the operation of an ion linear particle accelerator. By employing Graphics Processing Unit (GPU) technology, the performance of the simulator has been significantly increased over that of a single CPU and is therefore viable in the demanding accelerator operations environment. Once connected to the accelerator control system, it can rapidly respond to any control set point changes and predict beam properties along an ion linear accelerator in pseudo-real time. This simulator will be a virtual beam diagnostic tool which is especially useful when direct beam measurements are not available. Details about the code structure design, physics algorithms, GPU implementations, and performance are presented. 相似文献

19.

基于CHMM的背景差算法

李超徐加银丁广太《计算机工程与设计》2012,33(9):3517-3521

为了解决背景差算法在前景提取的过程中对光照变化的敏感性和提取的前景中容易产生椒盐噪声的问题,提出了一种基于耦合隐马尔科夫模型的背景差方法.对像素的马尔科夫性进行了分析,并对像素建立耦合隐马尔科夫模型,通过时间统计的方法统计了像素隐含状态的转移概率,通过实验的方法选取了合适的前景标准差和背景标准差,利用Viterbi算法来求解耦合隐马尔科夫模型的最优隐含状态问题,运用该算法对一段交通监控视频进行分析,表明了该算法能够有效的抑制光照变化的影响,并且能够在一定程度上抑制前景噪声的出现. 相似文献