首页 | 本学科首页   官方微博 | 高级检索  
 共查询到19条相似文献,搜索用时 312 毫秒
在集群与GPU组成的异构并行计算平台上,使用MPI+CUDA混合编程模型,实现基于ABEEMσπ模型的分子动力学模拟中电荷分布的计算.通过对电荷分布分布求解中的计算部分移植到GPU上进行,并针对算法中通信开销大和资源未充分利用的问题,通过异构平台的异步并发方法进行优化,提高了求解效率.性能测试结果表明,相比于单纯MPI并行算法,优化后GPU加速的异构并行算法,在化学大分子模型电荷分布计算上,有着明显的性能优势.  相似文献   

随着图形硬件的快速发展,GPU的通用计算已经成为了一个新的研究领域。本文分析GPU编程模型,介绍使用图形硬件进行通用计算的方法,并把一些常用的算法映射到了GPU上。通过这些算法与CPU上对应的算法进行比较,分析使用GPU进行通用计算的优势和劣势。  相似文献   

张宇  张延松  陈红  王珊 《软件学报》2016,27(5):1246-1265
通用GPU因其强大的并行计算能力成为新兴的高性能计算平台,并逐渐成为近年来学术界在高性能数据库实现技术领域的研究热点.但当前GPU数据库领域的研究沿袭的是ROLAP(relational OLAP)多维分析模型,研究主要集中在关系操作符在GPU平台上的算法实现和性能优化技术,以哈希连接的GPU并行算法研究为中心.GPU拥有数千个并行计算单元,但其逻辑控制单元较少,相对于CPU具有更强的并行计算能力,但逻辑控制和复杂内存管理能力较弱,因此并不适合需要复杂数据结构和复杂内存管理机制的内存数据库查询处理算法直接移植到GPU平台.提出了面向GPU向量计算特性的混合OLAP多维分析模型semi-MOLAP,将MOLAP(multidimensionalOLAP)模型的直接数组访问和计算特性与ROLAP模型的存储效率结合在一起,实现了一个基于完全数组结构的GPU semi-MOLAP多维分析模型,简化了GPU数据管理,降低了GPU semi-MOLAP算法复杂度,提高了GPU semi-MOLAP算法的代码执行率.同时,基于GPU和CPU计算的特点,将semi-MOLAP操作符拆分为CPU和GPU平台的协同计算,提高了CPU和GPU的利用率以及OLAP的查询整体性能.  相似文献   

冯高锋 《计算机应用》2007,27(Z2):281-282
随着GPU的飞速发展,利用GPU进行图形计算之外的高性能计算已经成为一个研究热点.由此提出,将GPU作为协处理器,插入通用计算节点,构建GPU-CPU集群系统,使用相应的分块算法,把计算矩阵分块,然后采用:function offoad编程模型,将动态规划算法映射到CPU上进行加速计算.实验证明,利用该系统对动态规划算法进行优化,获得了很好的性能提高和加速比.  相似文献   

随着GPU通用计算能力的不断发展,一些新的更高效的处理技术应用到图像处理领域.目前已有一些图像处理算法移植到GPU中且取得了不错的加速效果,但这些算法没有充分利用CPU/GPU组成的异构系统中各处理单元的计算能力.文章在研究GPU编程模型和并行算法设计的基础上,提出了CPU/GPU异构环境下图像协同并行处理模型.该模型充分考虑异构系统中各处理单元的计算能力,通过图像中值滤波算法,验证了CPU/GPU环境下协同并行处理模型在高分辨率灰度图像处理中的有效性.实验结果表明,该模型在CPU/GPU异构环境下通用性较好,容易扩展到其他图像处理算法.  相似文献   

CUDA高性能计算并行编程   总被引:1,自引:0,他引:1  
针对GPU的计算处理能力,提出了用GPU解决高性能计算的问题,其中包括详细描述CUDA编程的方法、优化处理原则等。采用了对比实验,结果表明了CUDA在并行计算上有很强的能力,为GPU的通用计算提供了新的方法和思路。  相似文献   

利用GPU进行加速的归一化差分植被指数(Normalized Differential Vegetation Index,NDVI)提取算法通常采用GPU多线程并行模型,存在弱相关计算之间以及CPU与GPU之间数据传输耗时较多等问题,影响了加速效果的进一步提升。针对上述问题,根据NDVI提取算法的特性,文中提出了一种基于GPU多流并发并行模型的NDVI提取算法。通过CUDA流和Hyper-Q特性,GPU多流并发并行模型可以使数据传输与弱相关计算、弱相关计算与弱相关计算之间达到重叠,从而进一步提高算法并行度及GPU资源利用率。文中首先通过GPU多线程并行模型对NDVI提取算法进行优化,并对优化后的计算过程进行分解,找出包含数据传输及弱相关性计算的部分;其次,对数据传输和弱相关计算部分进行重构,并利用GPU多流并发并行模型进行优化,使弱相关计算之间、弱相关计算和数据传输之间达到重叠的效果;最后,以高分一号卫星拍摄的遥感影像作为实验数据,对两种基于GPU实现的NDVI提取算法进行实验验证。实验结果表明,与传统基于GPU多线程并行模型的NDVI提取算法相比,所提算法在影像大于12000*12000像素时平均取得了约1.5倍的加速,与串行提取算法相比取得了约260倍的加速,具有更好的加速效果和并行性。  相似文献   

利用GPU计算的双线性插值并行算法   总被引:1,自引:0,他引:1  
双线性插值算法在数字图像处理中有广泛的应用,但计算速度慢.为提高其计算速度,提出一种基于图形处理器加速的双线性插值并行算法.主要利用Wallis变换双线性插值中各分块之间的独立性适合GPU并行处理架构的特点,把传统串行双线性插值算法映射到CUDA并行编程模型,并从线程分配,内存使用,硬件资源划分等方面进行优化,来充分利用GPU的巨大运算能力.实验结果表明,随着图像分辨率的增大,双线性内插并行算法可以把计算速度提高28倍.  相似文献   

基于GPU编程的地形可视化   总被引:5,自引:1,他引:4       下载免费PDF全文
由于地形模型固有的复杂性,致使计算机硬件水平一直难以满足大规模地形模型的实时显示需求。为了在现有的硬件水平上实现地形模型的快速绘制,在对传统的ROAM算法进行改进的基础上,提出一种基于GPU编程的地形可视化算法,实现了视点依赖的大规模地形的快速可视化。该算法首先基于改进的ROAM(real-time optimallyadaptive meshes)算法生成视点依赖的优化连续LOD模型;然后用GPU编程计算顶点的变换、法向量、纹理坐标、纹理采样和面元光照;最后完成地形的着色。实验结果表明,利用GPU编程不仅能有效提高算法速度,而且能实现较大规模地形的实时漫游。  相似文献   

平板探测器技术的发展使得锥形束计算机断层扫描技术(Cone Beam Computerized Tomography,CBCT)成为一种重要的成像技术,有着十分广泛的应用.基于C形臂的CBCT,除了具有CBCT的技术优势外,还特别适合在影像引导介入手术中应用.然而,如何在满足手术实时性要求的同时获得高分辨率高质量的三维断层图像,仍是个十分具有挑战性的课题.文章提出一种基于GPU加速技术的C形臂CBCT三维图像快速重建方法:在算法层面应用GPU并行加速技术对重建算法进行优化,在系统层面通过设计分布式系统和延迟隐藏机制,大大提升了由二维投影图像重建三维体数据的效率.在保持重建精度的前提下,优化后的GPU加速的FDK算法极大地提升了重建过程的计算效率.延迟隐藏机制进一步提升了系统的运行效率.在使用90帧投影时,系统效率提升了26%,重建延迟加速了2.1倍;当使用120帧投影时,系统效率提升39%,重建延迟加速达到3.3倍.  相似文献   

Hyperspectral imaging, which records a detailed spectrum of light arriving in each pixel, has many potential uses in remote sensing as well as other application areas. Practical applications will typically require real-time processing of large data volumes recorded by a hyperspectral imager. This paper investigates the use of graphics processing units (GPU) for such real-time processing. In particular, the paper studies a hyperspectral anomaly detection algorithm based on normal mixture modelling of the background spectral distribution, a computationally demanding task relevant to military target detection and numerous other applications. The algorithm parts are analysed with respect to complexity and potential for parallellization. The computationally dominating parts are implemented on an Nvidia GeForce 8800 GPU using the Compute Unified Device Architecture programming interface. GPU computing performance is compared to a multi-core central processing unit implementation. Overall, the GPU implementation runs significantly faster, particularly for highly data-parallelizable and arithmetically intensive algorithm parts. For the parts related to covariance computation, the speed gain is less pronounced, probably due to a smaller ratio of arithmetic to memory access. Detection results on an actual data set demonstrate that the total speedup provided by the GPU is sufficient to enable real-time anomaly detection with normal mixture models even for an airborne hyperspectral imager with high spatial and spectral resolution.  相似文献   

The parallel preconditioned conjugate gradient method (CGM) is used in many applications of scientific computing and often has a critical impact on their performance and energy consumption. This article investigates the energy-aware execution of the CGM on multi-core CPUs and GPUs used in an adaptive FEM. Based on experiments, an application-specific execution time and energy model is developed. The model considers the execution speed of the CPU and the GPU, their electrical power, voltage and frequency scaling, the energy consumption of the memory as well as the time and energy needed for transferring the data between main memory and GPU memory. The model makes it possible to predict how to distribute the data to the processing units for achieving the most energy efficient execution: the execution might deploy the CPU only, the GPU only or both simultaneously using a dynamic and adaptive collaboration scheme. The dynamic collaboration enables an execution minimising the execution time. By measuring execution times for every FEM iteration, the data distribution is adapted automatically to changing properties, e.g. the data sizes.  相似文献   

Graphics Processing Units (GPU) have impressively arisen as general-purpose coprocessors in high performance computing applications, since the launch of the Compute Unified Device Architecture (CUDA). However, they present an inherent performance bottleneck in the fact that communication between two separate address spaces (the main memory of the CPU and the memory of the GPU) is unavoidable. The CUDA Application Programming Interface (API) provides asynchronous transfers and streams, which permit a staged execution, as a way to overlap communication and computation. Nevertheless, a precise manner to estimate the possible improvement due to overlapping does not exist, neither a rule to determine the optimal number of stages or streams in which computation should be divided. In this work, we present a methodology that is applied to model the performance of asynchronous data transfers of CUDA streams on different GPU architectures. Thus, we illustrate this methodology by deriving expressions of performance for two different consumer graphic architectures belonging to the more recent generations. These models permit programmers to estimate the optimal number of streams in which the computation on the GPU should be broken up, in order to obtain the highest performance improvements. Finally, we have checked the suitability of our performance models with three applications based on codes from the CUDA Software Development Kit (SDK) with successful results.  相似文献   

在地震资料的处理应用中,逆时偏移等处理技术由于计算资源的需求量巨大,而不能在实际生产中被广泛采用。GPU及CUDA编程架构的引入大幅提高其运算性能,是解决类似技术应用的有效途径。同时,GPU独特的物理特性使得一些应用不仅不能提高性能,甚至使性能急剧下降。通过逆时偏移技术应用实例来说明GPU的加速效果,同时将其和常规流程进行对比和分析给出应用软件的GPU适用性评价方法。  相似文献   

随着硬件功能的不断丰富和软件开发环境的逐渐成熟,GPU开始被应用于通用计算领域,协助CPU加速程序运行。为了追求高性能,GPU往往包含成百上千个核心运算单元,高密度的计算资源使得其性能远高于CPU的同时功耗也高于CPU,功耗问题已经成为制约GPU发展的重要问题之一。在深入研究Fermi GPU架构的基础上,提出一种高精度的体系结构级功耗模型,该模型首先计算不同native指令及每次访问存储器消耗的功耗;然后根据应用在硬件上的执行指令和采样工具获得采样结果,分析预测其功耗;最后通过13个基准测试应用对实际测试与功耗模型测试结果进行对比分析,该模型的预测精度可达90%左右。  相似文献   

The GPU (Graphics Processing Unit) has recently become one of the most power efficient processors in embedded and many other environments, and has been integrated into more and more SoCs (System on Chip). Thus modern GPUs play a very important role in power aware computing. Strongly Connected Component (SCC) decomposition is a fundamental graph algorithm which has wide applications in model checking, electronic design automation, social network analysis and other fields. GPUs have been shown to have great potential in accelerating many types of computations including graph algorithms. Recent work have demonstrated the plausibility of GPU SCC decomposition, but the implementation is inefficient due to insufficient consideration of the distinguishing GPU programming model, which leads to poor performance on irregular and sparse graphs.This paper presents a new GPU SCC decomposition algorithm that focuses on full utilization of the contemporary embedded and desktop GPU architecture. In particular, a subgraph numbering scheme is proposed to facilitate the safe and efficient management of the subgraph IDs and to serve as the basis of efficient source selection. Furthermore, we adopt a multi-source partition procedure that greatly reduces the recursion depth and use a vertex labeling approach that can highly optimize the GPU memory access. The evaluation results show that the proposed approach achieves up to 41× speedup over Tarjan’s algorithm, one of the most efficient sequential SCC decomposition algorithms, and up to 3.8× speedup over the previous GPU algorithms.  相似文献   

Graphics Processing Units (GPUs) have a large and complex design space that needs to be explored in order to optimize the performance of future GPUs. Statistical techniques are useful tools to help computer architects to predict performance of complex processors. In this study, these methods are utilized to build a model which predicts the GPU performance efficiently. The design space of targeted Fermi GPU has more than 8 million points which cause exploring this huge design space a challenging process. In order to build an accurate model, we propose a two-tier algorithm in our algorithm which builds a multiple linear regression model from a small set of simulated data. In this algorithm the Plackett–Burman design is used to find the key parameters of the GPU, and further simulations are guided by a fractional factorial design for the most important parameters. Our algorithm is able to construct a GPU performance predictor which can predict the performance of any point in the design space with an average prediction error between 1% and 5% for different benchmark applications. In addition, in comparison to other methods which need a large number of sampling points, the accuracy in our method is achieved by only sampling between 0.0003% and 0.0015% of the full design space.  相似文献   

将计算密度高的部分迁移到GPU上是加速经典数据挖掘算法的有效途径。首先介绍GPU特性和主要的GPU编程模型,随后针对数据挖掘主要任务类型分别介绍基于GPU加速的工作,包括分类、聚类、关联分析、时序分析和深度学习。最后分别基于CPU和GPU实现协同过滤推荐的两类经典算法,并基于经典的MovieLens数据集的实验验证GPU对加速数据挖掘应用的显著效果,进一步了解GPU加速的工作原理和实际意义。  相似文献   

李海燕  张春元  李礼  任巨 《计算机工程》2008,34(22):258-260
图形处理器极高的流计算能力使其成为实现实时流应用的有效方案。该文抽象出图形处理器的流执行模型,描述图形处理器流处理机制的执行过程,在图形处理器上实现了二维离散余弦变换。实验结果表明,图形处理器对标清格式的视频压缩编码效率可达70 fps。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号