共查询到19条相似文献,搜索用时 46 毫秒
1.
利用GPGPU(General Purpose GPU)强大的并行处理能力,基于NVIDIA CUDA框架对已有的稀疏磁共振(Sparse MRI)重建算法进行了并行化改造,使其能够适应实际应用的要求。稀疏磁共振成像的重建算法包含大量的浮点运算,计算耗时严重,难以应用于实际,必须对其进行加速和优化。实验结果显示,NVIDIA GTX275 GPU使运算时间从4分多钟缩短到3.4秒左右,与Intel Q8200 CPU相比,达到了76倍的加速。 相似文献
2.
通用图形处理器(general purpose graphics processing unit, GPGPU)在面向高性能计算、高吞吐量的通用计算领域的应用日益广泛,它采用的SIMD(single instruction multiple data)执行模式使其能获得强大的并行计算能力.目前主流的通用图形处理器均通过大量高度并行的线程完成计算任务的高效执行.但是在处理条件分支转移的控制流中,由于通用图形处理器采用串行的方式顺序处理不同的分支路径,使得其并行计算能力受到影响.在分析讨论前人针对分支转移处理低效的线程块压缩重组调度方法的基础上,提出了2阶段同步的线程块压缩重组调度方法TSTBC(two-stage synchronization based thread block compaction scheduling),通过线程块压缩重组适合性判断逻辑部件,分2个阶段对线程块进行压缩重组有效性分析,进一步减少了无效的线程块压缩重组次数.模拟实验结果表明:该方法较好地提高了线程块的压缩重组有效性,相对于其他同类方法降低了对线程组内部数据局部性的破坏,并使得片上一级数据cache的访问失效率得到有效降低;相对于基准体系结构,系统性能提升了19.27%. 相似文献
3.
4.
5.
基于GPGPU的数字图像并行化预处理 总被引:2,自引:0,他引:2
首先简要介绍了统一设备架构CUDA(Compute Unified Device Architecture)技术的背景、特点、内存模型,利用通用计算图形处理单元GPGPU(General Purpose GPU)及CUDA技术,实现了图像直方图均衡化和薄云去除的并行化处理,与传统的基于CPU的方法相比,两个基于GPGPU的图像预处理操作的执行效率分别提高了40倍与80倍左右,在大规模实时性图像处理操作中,有很大的实用价值。 相似文献
6.
使用CUDA平台,提出在通用图形处理器(GPGPU)上实现并行的全选主元、归一和消去等操作,加速实现并行全选主元高斯-约当消去法求解线性方程组的一种基本方法。该方法在CPU上完成解向量的恢复。根据NVIDIA公司最新Fermi架构图形处理器的特点,通过一系列的优化设计,使通用GPGPU相对Intel最新架构CPU的加速比超过了6.5倍,比Intel上一代CPU的加速比超过了10倍。 相似文献
7.
8.
近年来GPU通用计算蓬勃发展。程序开发者和GPU通用计算应用程序的数量增长很快。针对不同的应用程序的要求和程序开发者不同的使用习惯,围绕着CUDA架构的 GPU,NVIDIA及其合作伙伴共同开发了很多种不同的编程技术。本文详细介绍了它们的特点和适用对象。希望可以帮助广大开发人员针对自己的编程习惯和程序要求选择最为合适的编程技术。 相似文献
9.
随着图形处理器(GPU)从仅用来进行图形图像渲染,脱离成为并行计算平台通用图形处理器(GPGPU),其计算能力越来越强,本文在研究GPGPU体系结构的基础上对GPGPU并行计算线程调度进行深入研究,阐述了GPU线程调度原理,揭示了SIMT调度模式的不足.通过公式推导阐述了系统功耗与系统运行频率的关系. 相似文献
10.
为实时提取三维实体表面,提出一种基于GPGPU并行计算的实体表面实时提取方法。在分析深度剥离算法原理和GPU图形绘制管线的基础上,给出在GPU上利用深度剥离算法实现实时提取三维实体表面的算法;通过OpenGL的高级着色语言GLSL控制GPU的图形绘制管线实现了该算法,给出其伪代码。以龙、叶轮和刀具扫描体的模型为应用实例验证了该算法效果良好,特别是对于刀具扫描体表面的提取,可满足实时性要求。 相似文献
11.
MDx差分攻击算法改进及GPGPU上的有效实现 总被引:1,自引:0,他引:1
Hash函数广泛应用于商业、安全等领域,其中MDx系列Hash算法应用最为广泛.因此对MDx系列Hash算法的攻击在理论上和实际应用上都有重要的意义.自王小云教授提出差分攻击算法并攻破MD5、MD4等MDx系列算法以来,对该算法的研究日益受到关注.文中以攻击MD5的差分攻击算法为例,改进了Klima提出的MD5隧道差分攻击算法,分析其在GPGPU上实现的可行性和技术要求并在Visual studio 6.0的环境下利用CUDA语言开发完成.算法的CUDA程序在GeForce 9800 GX2平台下运行,平均每1.35s能找到一对MD5碰撞.通过同4核Core 2 Quad Q9000(2.0GHz)PC上的实现相比较,在GeForce 9800 GX2上的实现能达到11.5倍的性价比. 相似文献
12.
Carlos Reao Federico Silla Adrin Castell Antonio J. Pea Rafael Mayo Enrique S. Quintana‐Ortí Jos Duato 《Concurrency and Computation》2015,27(14):3746-3770
Graphics processing units (GPUs) are being increasingly embraced by the high‐performance computing community as an effective way to reduce execution time by accelerating parts of their applications. remote CUDA (rCUDA) was recently introduced as a software solution to address the high acquisition costs and energy consumption of GPUs that constrain further adoption of this technology. Specifically, rCUDA is a middleware that allows a reduced number of GPUs to be transparently shared among the nodes in a cluster. Although the initial prototype versions of rCUDA demonstrated its functionality, they also revealed concerns with respect to usability, performance, and support for new CUDA features. In response, in this paper, we present a new rCUDA version that (1) improves usability by including a new component that allows an automatic transformation of any CUDA source code so that it conforms to the needs of the rCUDA framework, (2) consistently features low overhead when using remote GPUs thanks to an improved new communication architecture, and (3) supports multithreaded applications and CUDA libraries. As a result, for any CUDA‐compatible program, rCUDA now allows the use of remote GPUs within a cluster with low overhead, so that a single application running in one node can use all GPUs available across the cluster, thereby extending the single‐node capability of CUDA. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献
13.
The Grover quantum search algorithm, one of only a few representative quantum algorithms, can speed up many classical algorithms that use search heuristics. No true quantum computer has yet been developed. For the present, simulation is one effective means of verifying the search algorithm. In this work, we focus on the simulation workflow using a compute unified device architecture (CUDA). Two simulation workflow schemes are proposed. These schemes combine the characteristics of the Grover algorithm and the parallelism of general-purpose computing on graphics processing units (GPGPU). We also analyzed the optimization of memory space and memory access from this perspective. We implemented four programs on CUDA to evaluate the performance of schemes and optimization. Through experimentation, we analyzed the organization of threads suited to Grover algorithm simulations, compared the storage costs of the four programs, and validated the effectiveness of optimization. Experimental results also showed that the distinguished program on CUDA outperformed the serial program of libquantum on a CPU with a speedup of up to 23 times (12 times on average), depending on the scale of the simulation. 相似文献
14.
区域填充是图形处理中常用操作,利用目前多核CPU的优势和NVIDA显卡的通用计算能力,实现对指定区域进行并行填充的方法。算法利用多种子算法,采用多线程技术快速地完成填充,同时算法避免传统算法需要人为设置种子位置的缺点。完成后再对填充结果进行判断,丢弃无效的填充区域最终得到需要的结果。实验证明,对于比较大的图片多核CPU的加速性能明显。 相似文献
15.
16.
17.
Jinwoong KimAuthor Vitae Sul-Gi KimAuthor VitaeBeomseok Nam 《Journal of Parallel and Distributed Computing》2013
The general purpose computing on graphics processing unit (GP-GPU) has emerged as a new cost effective parallel computing paradigm in high performance computing research that enables large amount of data to be processed in parallel. Large scale scientific data intensive applications have been playing an important role in modern high performance computing research. A common access pattern into such scientific data analysis applications is multi-dimensional range query, but not much research has been conducted on multi-dimensional range query on the GPU. Inherently multi-dimensional indexing trees such as R-Trees are not well suited for GPU environment because of its irregular tree traversal. Traversing irregular tree search path makes it hard to maximize the utilization of massively parallel architectures. In this paper, we propose a novel MPTS (Massively Parallel Three-phase Scanning) R-tree traversal algorithm for multi-dimensional range query, that converts recursive access to tree nodes into sequential access. Our extensive experimental study shows that MPTS R-tree traversal algorithm on NVIDIA Tesla M2090 GPU consistently outperforms traditional recursive R-trees search algorithm on Intel Xeon E5506 processors. 相似文献
18.
We introduce a new GPGPU-based real-time dense stereo matching algorithm. The algorithm is based on a progressive multi-resolution pipeline which includes background modeling and dense matching with adaptive windows. For applications in which only moving objects are of interest, this approach effectively reduces the overall computation cost quite significantly, and preserves the high definition details. Running on an off-the-shelf commodity graphics card, our implementation achieves a 36 fps stereo matching on 1024 × 768 stereo video with a fine 256 pixel disparity range. This is effectively same as 7200 M disparity evaluations per second. For scenes where the static background assumption holds, our approach outperforms all published alternative algorithms in terms of the speed performance, by a large margin. We envision a number of potential applications such as real-time motion capture, as well as tracking, recognition and identification of moving objects in multi-camera networks. 相似文献
19.
Dennis Weyland Roberto Montemanni Luca Maria Gambardella 《Journal of Parallel and Distributed Computing》2013
In this work we propose a general metaheuristic framework for solving stochastic combinatorial optimization problems based on general-purpose computing on graphics processing units (GPGPU). This framework is applied to the probabilistic traveling salesman problem with deadlines (PTSPD) as a case study. Computational studies reveal significant improvements over state-of-the-art methods for the PTSPD. Additionally, our results reveal the huge potential of the proposed framework and sampling-based methods for stochastic combinatorial optimization problems. 相似文献