共查询到19条相似文献,搜索用时 62 毫秒
1.
近年来图形处理器(GPU)快速拓展的可编程性能力加上渲染流水线的高速度及并行性,使得图形处理器通用计算(GPGPU)迅速成为一个研究热点。针对大规模神经网络BP算法效率低下问题,提出了一种GPU加速的神经网络BP算法。将BP网络的前向计算、反向学习转换为GPU纹理的渲染过程,从而利用GPU强大的浮点运算能力和高度并行的计算特性对BP算法进行求解。实验结果表明,在保证求解结果准确度不变的情况下,该方法运行效率有明显的提高。 相似文献
2.
3.
4.
5.
6.
7.
分子动力学(MD)模拟是研究硅纳米薄膜热力学性质的主要方法,但存在数据处理量大、计算密集、原子间作用模型复杂等问题,限制了MD模拟的深入应用。针对晶硅分子动力学模拟算法中数据访问不连续和大量分支判断造成并行资源浪费、线程等待等问题,结合Nvidia Tesla V100 GPU硬件体系结构特点,对晶硅MD模拟算法进行设计。通过全局内存的合并访存、循环展开、原子操作等优化方法,利用GPU强大并行计算和浮点运算能力,减少显存访问及算法执行过程中的分支冲突和判断指令,提升算法整体计算性能。测试结果表明,优化后的晶硅MD模拟算法的计算速度相比于优化前提升了1.69~1.97倍,相比于国际上主流的GPU加速MD模拟软件HOOMDblue和LAMMPS分别提升了3.20~3.47倍和17.40~38.04倍,具有较好的模拟加速效果。 相似文献
8.
遥感图像的渐进式传输大大提高了数据响应效率,但同时也增加了数据接收端的计算量。为进一步提高数据传输效率,研究了基于可编程图形硬件GPU的并行加速方法,通过小波逆变换的GPU并行化来加速图像重构,并通过纹理查找表来提高数据读取效率,利用离线渲染缓存Pbuffer来保存多层小波变换的中间计算结果,进一步提高了并行效率。最后,通过实验验证了该方法的有效性。 相似文献
9.
卷积神经网络CNN目前作为神经网络的一个重要分支,相比于其他神经网络方法更适合应用于图像特征的学习和表达。随着CNN的不断发展,CNN将面临更多的挑战。CNN参数规模变得越来越大,这使得CNN对计算的需求量变得非常大。因此,目前产生了许多种方式对CNN的规模进行压缩。然而压缩后的CNN模型往往产生了许多稀疏的数据结构,这种稀疏结构会影响CNN在GPU上的性能。为了解决该问题,采用直接稀疏卷积算法,来加速GPU处理稀疏数据。根据其算法特点将卷积运算转换为稀疏向量与稠密向量内积运算,并将其在GPU平台上实现。本文的优化方案充分利用数据稀疏性和网络结构来分配线程进行任务调度,利用数据局部性来管理内存替换,使得在稀疏卷积神经网络SCNN中的GPU仍能够高效地处理卷积层运算。相比cuBLAS的实现,在AlexNet、GoogleNet、ResNet上的性能提升分别达到1.07×~1.23×、1.17×~3.51×、1.32×~5.00×的加速比。相比cuSPARSE的实现,在AlexNet、GoogleNet、ResNet上的性能提升分别达到1.31×~1.42×、1.09×~2.00×、1.07×~3.22×的加速比。 相似文献
10.
基于中央处理器(CPU)串行的人群疏散传统方法对于人群规模较少的场景,可以得到良好的疏散模拟效果,但在人群密度较高的场景中,难以达到实时模拟的要求。为了克服上述问题,实现了一种基于图形处理器(GPU)的人群疏散模拟的方法。该方法通过对个体寻径算法的优化,不仅能使个体快速准确地智能寻径,而且将个体寻径独立性与图形处理器高性能计算特性进行结合,充分利用了图形处理器强大的并行计算能力,从而大幅度提高了人群疏散模拟的人群规模,使人群疏散模拟的实时性得到增强。 相似文献
11.
GPU加速的二值图连通域标记并行算法 总被引:1,自引:0,他引:1
结合NVIDIA公司统一计算设备架构(CUDA)下的图形处理器(GPU)并行结构和硬件特点,提出了一种新的二值图像连通域标记并行算法,高速有效地标识出了二值图的连通域位置及大小,大幅缩减了标记时间耗费。该算法通过搜索邻域内最小标号值的像素点对连通域进行标记,各像素点处理顺序不分先后并且不相互依赖,因此可以并行执行。算法效率不受连通域形状及数量的影响,具有很好的鲁棒性。实验结果表明,该并行算法充分发挥了GPU并行处理能力,在处理高分辨率与多连通域图像时效率为一般CPU标记算法的300倍,比OpenCV的优化函数(CPU)效率高近17倍。 相似文献
12.
Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications.
Program summary
Program title: SWsolverCatalogue identifier: AEGY_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GPL v3No. of lines in distributed program, including test data, etc.: 59 168No. of bytes in distributed program, including test data, etc.: 453 409Distribution format: tar.gzProgramming language: C, CUDAComputer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator.Operating system: LinuxHas the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs.RAM: Tested on Problems requiring up to 4 GB per compute node.Classification: 12External routines: MPI, CUDA, IBM Cell SDKNature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA.Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster.Additional comments: Sub-program numdiff is used for the test run. 相似文献13.
The use of graphical processing unit (GPU) parallel processing is becoming a part of mainstream statistical practice. The reliance of Bayesian statistics on Markov Chain Monte Carlo (MCMC) methods makes the applicability of parallel processing not immediately obvious. It is illustrated that there are substantial gains in improved computational time for MCMC and other methods of evaluation by computing the likelihood using GPU parallel processing. Examples use data from the Global Terrorism Database to model terrorist activity in Colombia from 2000 through 2010 and a likelihood based on the explicit convolution of two negative-binomial processes. Results show decreases in computational time by a factor of over 200. Factors influencing these improvements and guidelines for programming parallel implementations of the likelihood are discussed. 相似文献
14.
针对大幅面合成孔径雷达(SAR)图像的斑点抑制问题,提出了一种基于自适应模板的梯度倒数加权(GIW)算法,在编程实现上采用图形处理器(GPU)进行优化,有效地解决了由于像素级运算复杂度高所导致的大幅面雷达图像处理实时性差的问题.高分辨率SAR图像的处理结果显示,该算法在有效抑制斑点噪声的同时较好地保持了边缘细节信息.经过GPU加速后,对于大幅面图像的处理,相对于中央处理器(CPU)实现可以有2个量级以上的速度提升. 相似文献
15.
16.
The cryo-electron microscopy (cryo-EM) is one of the most powerful technologies available today for structural biology. The RELION (Regularized Likelihood Optimization) implements a Bayesian algorithm for cryo-EM structure determination, which is one of the most widely used software in this field. Many researchers have devoted effort to improve the performance of RELION to satisfy the analysis for the ever-increasing volume of datasets. In this paper, we focus on performance analysis of the most time-consuming computation steps in RELION and identify their performance bottlenecks for specific optimizations. We propose several performance optimization strategies to improve the overall performance of RELION, including optimization of expectation step, parallelization of maximization step, accelerating the computation of symmetries, and memory affinity optimization. The experiment results show that our proposed optimizations achieve significant speedups of RELION across representative datasets. In addition, we perform roofline model analysis to understand the effectiveness of our optimizations. 相似文献
17.
《Journal of Systems Architecture》2015,61(10):576-583
Graphics Processing Units (GPUs) have a large and complex design space that needs to be explored in order to optimize the performance of future GPUs. Statistical techniques are useful tools to help computer architects to predict performance of complex processors. In this study, these methods are utilized to build a model which predicts the GPU performance efficiently. The design space of targeted Fermi GPU has more than 8 million points which cause exploring this huge design space a challenging process. In order to build an accurate model, we propose a two-tier algorithm in our algorithm which builds a multiple linear regression model from a small set of simulated data. In this algorithm the Plackett–Burman design is used to find the key parameters of the GPU, and further simulations are guided by a fractional factorial design for the most important parameters. Our algorithm is able to construct a GPU performance predictor which can predict the performance of any point in the design space with an average prediction error between 1% and 5% for different benchmark applications. In addition, in comparison to other methods which need a large number of sampling points, the accuracy in our method is achieved by only sampling between 0.0003% and 0.0015% of the full design space. 相似文献
18.
An online beam dynamics simulator is being developed for use in the operation of an ion linear particle accelerator. By employing Graphics Processing Unit (GPU) technology, the performance of the simulator has been significantly increased over that of a single CPU and is therefore viable in the demanding accelerator operations environment. Once connected to the accelerator control system, it can rapidly respond to any control set point changes and predict beam properties along an ion linear accelerator in pseudo-real time. This simulator will be a virtual beam diagnostic tool which is especially useful when direct beam measurements are not available. Details about the code structure design, physics algorithms, GPU implementations, and performance are presented. 相似文献
19.
为了解决背景差算法在前景提取的过程中对光照变化的敏感性和提取的前景中容易产生椒盐噪声的问题,提出了一种基于耦合隐马尔科夫模型的背景差方法.对像素的马尔科夫性进行了分析,并对像素建立耦合隐马尔科夫模型,通过时间统计的方法统计了像素隐含状态的转移概率,通过实验的方法选取了合适的前景标准差和背景标准差,利用Viterbi算法来求解耦合隐马尔科夫模型的最优隐含状态问题,运用该算法对一段交通监控视频进行分析,表明了该算法能够有效的抑制光照变化的影响,并且能够在一定程度上抑制前景噪声的出现. 相似文献