期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Accelerating Advanced MRI Reconstructions on GPUs 总被引：1，自引：0，他引：1

Stone SS Haldar JP Tsao SC Hwu WM Sutton BP Liang ZP 《Journal of Parallel and Distributed Computing》2008,68(10):1307-1318

Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. This paper describes the acceleration of such an algorithm on NVIDIA's Quadro FX 5600. The reconstruction of a 3D image with 128(3) voxels achieves up to 180 GFLOPS and requires just over one minute on the Quadro, while reconstruction on a quad-core CPU is twenty-one times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%. 相似文献

2.

Performance analysis and optimization strategies for a D3Q19 lattice Boltzmann kernel on nVIDIA GPUs using CUDA

J. Habich T. ZeiserG. Hager G. Wellein 《Advances in Engineering Software》2011,42(5):266-272

This paper presents implementation strategies and optimization approaches for a D3Q19 lattice Boltzmann flow solver on nVIDIA graphics processing units (GPUs). Using the STREAM benchmarks we demonstrate the GPU parallelization approach and obtain an upper limit for the flow solver performance. We discuss the GPU-specific implementation of the solver with a focus on memory alignment and register shortage. The optimized code is up to an order of magnitude faster than standard two-socket x86 servers with AMD Barcelona or Intel Nehalem CPUs. We further analyze data transfer rates for the PCI-express bus to evaluate the potential benefits of multi-GPU parallelism in a cluster environment. 相似文献

3.

一种输入感知的雷达回波快速聚类实现

周伟安虹刘谷李小强吴石磊《计算机科学》2012,39(12):295-299

聚类算法作为数据挖掘中的经典算法,在雷达回波的数据分析中经常被采用。然而对于规模和维度都较大的输入数据集,算法十分耗时。很多研究虽然对聚类算法进行了GPU平台的并行和优化的工作,但都忽略了输入数据集对优化的影响。因此,提出了在GPU/CUDA平台上的一种新颖的雷达快速聚类实现。该实现通过运行时的方式对输入的回波数据进行观察,以获取数据的分布信息,用以指导聚类计算在GPU上执行时的线程块调度。而该运行时模块本身的开销非常小。实验表明,引入这种输入感知的运行时调度支持后,大大削减了GPU的计算负载,获得了相对于一般策略的CUDA实现的20%-40%的性能提升,加强了算法的实时性能。相似文献

4.

Solving lattice QCD systems of equations using mixed precision solvers on GPUs

M.A. Clark R. Babich K. Barros R.C. Brower C. Rebbi 《Computer Physics Communications》2010,181(9):1517-1878

Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodynamics (lattice QCD), where the main computational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIA's CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40, 135 and 212 Gflops for double, single and half precision respectively on NVIDIA's GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision. 相似文献

5.

Swan: A tool for porting CUDA programs to OpenCL 总被引：1，自引：0，他引：1

M.J. Harvey G. De Fabritiis 《Computer Physics Communications》2011,(4):1093-1099

The use of modern, high-performance graphical processing units (GPUs) for acceleration of scientific computation has been widely reported. The majority of this work has used the CUDA programming model supported exclusively by GPUs manufactured by NVIDIA. An industry standardisation effort has recently produced the OpenCL specification for GPU programming. This offers the benefits of hardware-independence and reduced dependence on proprietary tool-chains. Here we describe a source-to-source translation tool, “Swan” for facilitating the conversion of an existing CUDA code to use the OpenCL model, as a means to aid programmers experienced with CUDA in evaluating OpenCL and alternative hardware. While the performance of equivalent OpenCL and CUDA code on fixed hardware should be comparable, we find that a real-world CUDA application ported to OpenCL exhibits an overall 50% increase in runtime, a reduction in performance attributable to the immaturity of contemporary compilers. The ported application is shown to have platform independence, running on both NVIDIA and AMD GPUs without modification. We conclude that OpenCL is a viable platform for developing portable GPU applications but that the more mature CUDA tools continue to provide best performance.

Program summary

Program title: SwanCatalogue identifier: AEIH_v1_0Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIH_v1_0.htmlProgram obtainable from: CPC Program Library, Queen's University, Belfast, N. IrelandLicensing provisions: GNU Public License version 2No. of lines in distributed program, including test data, etc.: 17 736No. of bytes in distributed program, including test data, etc.: 131 177Distribution format: tar.gzProgramming language: CComputer: PCOperating system: LinuxRAM: 256 MbytesClassification: 6.5External routines: NVIDIA CUDA, OpenCLNature of problem: Graphical Processing Units (GPUs) from NVIDIA are preferentially programed with the proprietary CUDA programming toolkit. An alternative programming model promoted as an industry standard, OpenCL, provides similar capabilities to CUDA and is also supported on non-NVIDIA hardware (including multicore ×86 CPUs, AMD GPUs and IBM Cell processors). The adaptation of a program from CUDA to OpenCL is relatively straightforward but laborious. The Swan tool facilitates this conversion.Solution method:Swan performs a translation of CUDA kernel source code into an OpenCL equivalent. It also generates the C source code for entry point functions, simplifying kernel invocation from the host program. A concise host-side API abstracts the CUDA and OpenCL APIs. A program adapted to use Swan has no dependency on the CUDA compiler for the host-side program. The converted program may be built for either CUDA or OpenCL, with the selection made at compile time.Restrictions: No support for CUDA C++ featuresRunning time: Nominal 相似文献

6.

3D人耳点云配准的并行Softassign算法 总被引：1，自引：0，他引：1

蔡静董琳孙晓鹏《计算机工程与设计》2013,34(10)

提出了一种新的人耳点云并行Softassign配准算法.在基于CUDA对Softassign算法进行并行加速的基础上,利用三维点云离散曲率估计和三维空间kd-tree相结合的方法,对三维人耳点云进行点云简化,使简化后人耳点云能够保留足够的几何特征,然后对简化人耳点云进行Softassign配准,提高Softassign算法在人耳点云配准中的配准精度,从而避免了局部配准等缺陷,并在实际应用中验证了算法效率和精度. 相似文献

7.

基于GPU的快速三维医学图像刚性配准技术* 总被引：2，自引：1，他引：2

秦安徐建冯前进孟晓林陈武凡《计算机应用研究》2010,27(3):1198-1200

自动三维配准将多个图像数据映射到同一坐标系中,在医学影像分析中有广泛的应用。但现有主流三维刚性配准算法(如FLIRT)速度较慢,2563大小数据的刚性配准需要300 s左右,不能满足快速临床应用的需求。为此提出了一种基于CUDA(compute unified device architecture)架构的快速三维配准技术,利用GPU(gra-phic processing unit)并行计算实现配准中的坐标变换、线性插值和相似性测度计算。临床三维医学图像上的实验表明,该技术在保持配准精度的前提下将速度提相似文献

8.

BiELL: A bisection ELLPACK-based storage format for optimizing SpMV on GPUs

Cong Zheng Shuo Gu Tong-Xiang Gu Bing Yang Xing-Ping Liu 《Journal of Parallel and Distributed Computing》2014

Sparse matrix–vector multiplication (SpMV) is one of the most important high level operations for basic linear algebra. Nowadays, the GPU has evolved into a highly parallel coprocessor which is suited to compute-intensive, highly parallel computation. Achieving high performance of SpMV on GPUs is relatively challenging, especially when the matrix has no specific structure. For these general sparse matrices, a new data structure based on the bisection ELLPACK format, BiELL, is designed to realize the load balance better, and thus improve the performance of the SpMV. Besides, based on the same idea of JAD format, the BiJAD format can be obtained. Experimental results on various matrices show that the BiELL and BiJAD formats perform better than other similar formats, especially when the number of non-zero elements per row varies a lot. 相似文献

9.

A parallel scheme for accelerating parameter sweep applications on a GPU

Fumihiko Ino Kentaro Shigeoka Tomohiro Okuyama Masaya Motokubota Kenichi Hagihara 《Concurrency and Computation》2014,26(2):516-531

This paper proposes a parallel scheme for accelerating parameter sweep applications on a graphics processing unit. By using hundreds of cores on the graphics processing unit, we found that our scheme simultaneously processes multiple parameters rather than a single parameter. The simultaneous sweeps exploit the similarity of computing behaviors shared by different parameters, thus allowing memory accesses to be coalesced into a single access if similar irregularities appear among the parameters’ computational tasks. In addition, our scheme reduces the amount of off‐chip memory access by unifying the data that are commonly referenced by multiple parameters and by placing the unified data in the fast on‐chip memory. In several experiments, we applied our scheme to practical applications and found that our scheme can perform up to 8.5 times faster than a naive scheme that processes a single parameter at a time. We also include a discussion on application characteristics that are required for our scheme to outperform the naive scheme. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

10.

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

下载免费PDF全文

杨杨崔慧敏冯晓兵薛京灵《计算机科学技术学报》2012,27(1):57-74

In this paper,we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory.Unlike ea... 相似文献

11.

A new approach for sparse matrix vector product on NVIDIA GPUs

F. Vzquez J. J. Fernndez E. M. Garzn 《Concurrency and Computation》2011,23(8):815-826

The sparse matrix vector product (SpMV) is a key operation in engineering and scientific computing and, hence, it has been subjected to intense research for a long time. The irregular computations involved in SpMV make its optimization challenging. Therefore, enormous effort has been devoted to devise data formats to store the sparse matrix with the ultimate aim of maximizing the performance. Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. SpMV implementations for NVIDIA GPUs have already appeared on the scene. This work proposes and evaluates a new implementation of SpMV for NVIDIA GPUs based on a new format, ELLPACK‐R, that allows storage of the sparse matrix in a regular manner. A comparative evaluation against a variety of storage formats previously proposed has been carried out based on a representative set of test matrices. The results show that, although the performance strongly depends on the specific pattern of the matrix, the implementation based on ELLPACK‐R achieves higher overall performance. Moreover, a comparison with standard state‐of‐the‐art superscalar processors reveals that significant speedup factors are achieved with GPUs. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

12.

MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs

下载免费PDF全文

李焱张云泉刘益群龙国平贾海鹏《计算机科学技术学报》2013,28(1):90-105

Fourier methods have revolutionized many fields of science and engineering,such as astronomy,medical imaging,seismology and spectroscopy,and the fast Fourier transform(FFT) is a computationally efficient method of generating a Fourier transform.The emerging class of high performance computing architectures,such as GPU,seeks to achieve much higher performance and efficiency by exposing a hierarchy of distinct memories to software.However,the complexity of GPU programming poses a significant challenge to developers.In this paper,we propose an automatic performance tuning framework for FFT on various OpenCL GPUs,and implement a high performance library named MPFFT based on this framework.For power-of-two length FFTs,our library substantially outperforms the clAmdFft library on AMD GPUs and achieves comparable performance as the CUFFT library on NVIDIA GPUs.Furthermore,our library also supports non-power-of-two size.For 3D non-power-of-two FFTs,our library delivers 1.5x to 28x faster than FFTW with 4 threads and 20.01x average speedup over CUFFT 4.0 on Tesla C2050. 相似文献

13.

一种高效直方图生成算法在GPU上的实现

狄鹏胡长军李建江《计算机科学》2012,39(3):304-307

直方图生成算法(Histogram Generation)是一种顺序的非规则数据依赖的循环运算,已在许多领域被广泛应用。但是,由于非规则的内存访问,使得多线程对共享内存访问会产生很多存储体冲突(Bank Conflict),从而阻碍并行效率。如何在并行处理器平台,特别是当前最先进的图像处理单元(Graphic Processing Unit,GPU)实现高效的直方图生成算法是很有研究价值的。为了减少直方图生成过程中的存储体冲突,通过内存填充技术,将多线程的共享内存访问均匀地分散到各个存储体,可以大幅减少直方图生成算法在GPU上的内存访问延时。同时,通过提出有效可靠的近似最优配置搜索模型,可以指导用户配置GPU执行参数,以获得更高的性能。经实验验证,在实际应用中,改良后的算法比原有算法性能提高了42%～88%。相似文献

14.

GPGPU implementation of growing neural gas: Application to 3D scene reconstruction

Sergio Orts Jose Garcia-Rodriguez Diego Viejo Miguel Cazorla Vicente Morell 《Journal of Parallel and Distributed Computing》2012

Self-organising neural models have the ability to provide a good representation of the input space. In particular the Growing Neural Gas (GNG) is a suitable model because of its flexibility, rapid adaptation and excellent quality of representation. However, this type of learning is time-consuming, especially for high-dimensional input data. Since real applications often work under time constraints, it is necessary to adapt the learning process in order to complete it in a predefined time. This paper proposes a Graphics Processing Unit (GPU) parallel implementation of the GNG with Compute Unified Device Architecture (CUDA). In contrast to existing algorithms, the proposed GPU implementation allows the acceleration of the learning process keeping a good quality of representation. Comparative experiments using iterative, parallel and hybrid implementations are carried out to demonstrate the effectiveness of CUDA implementation. The results show that GNG learning with the proposed implementation achieves a speed-up of 6×

6 \times

compared with the single-threaded CPU implementation. GPU implementation has also been applied to a real application with time constraints: acceleration of 3D scene reconstruction for egomotion, in order to validate the proposal. 相似文献

15.

Enhancing data parallelism for Ant Colony Optimization on GPUs

José M. Cecilia José M. García Andy Nisbet Martyn Amos Manuel Ujaldón 《Journal of Parallel and Distributed Computing》2013

Graphics Processing Units (GPUs) have evolved into highly parallel and fully programmable architecture over the past five years, and the advent of CUDA has facilitated their application to many real-world applications. In this paper, we deal with a GPU implementation of Ant Colony Optimization (ACO), a population-based optimization method which comprises two major stages: tour construction and pheromone update. Because of its inherently parallel nature, ACO is well-suited to GPU implementation, but it also poses significant challenges due to irregular memory access patterns. Our contribution within this context is threefold: (1) a data parallelism scheme for tour construction tailored to GPUs, (2) novel GPU programming strategies for the pheromone update stage, and (3) a new mechanism called I-Roulette to replicate the classic roulette wheel while improving GPU parallelism. Our implementation leads to factor gains exceeding 20x for any of the two stages of the ACO algorithm as applied to the TSP when compared to its sequential counterpart version running on a similar single-threaded high-end CPU. Moreover, an extensive discussion focused on different implementation paths on GPUs shows the way to deal with parallel graph connected components. This, in turn, suggests a broader area of inquiry, where algorithm designers may learn to adapt similar optimization methods to GPU architecture. 相似文献

16.

CUIRRE: An open-source library for load balancing and characterizing irregular applications on GPUs

Tao Zhang Wei Shu Min-You Wu 《Journal of Parallel and Distributed Computing》2014

While Graphics Processing Units (GPUs) show high performance for problems with regular structures, they do not perform well for irregular tasks due to the mismatches between irregular problem structures and SIMD-like GPU architectures. In this paper, we introduce a new library, CUIRRE, for improving performance of irregular applications on GPUs. CUIRRE reduces the load imbalance of GPU threads resulting from irregular loop structures. In addition, CUIRRE can characterize irregular applications for their irregularity, thread granularity and GPU utilization. We employ this library to characterize and optimize both synthetic and real-world applications. The experimental results show that a 1.63× on average and up to 2.76× performance improvement can be achieved with the centralized task pool approach in the library at a 4.57% average overhead with static loading ratios. To avoid the cost of exhaustive searches of loading ratios, an adaptive loading ratio method is proposed to derive appropriate loading ratios for different inputs automatically at runtime. Our task pool approach outperforms other load balancing schemes such as the task stealing method and the persistent threads method. The CUIRRE library can easily be applied on many other irregular problems. 相似文献

17.

基于四叉树的三维地形模拟的LOD算法 总被引：4，自引：0，他引：4

荆涛《计算机仿真》2005,22(11):123-126

细节层次显示和简化技术（LOD技术）是实时真实感图形学技术中应用比较多的一个技术,通过这种技术可以较好地简化场景的复杂度,同时对图形真实度损失很少,并满足一定的实时性.在众多文献所提到的LOD算法中,一种比较常用的算法就是基于四叉树的LOD算法,这种算法的基本思想极为简单,即利用一个距离的阈值来控制四叉树递归运算的深度,当这个阈值比较大时,得到较少的三角面片数量,反之则得到较多的三角面片.文中实验也是采用了这种方法进行LOD的计算.文中还讲述了LOD技术的原理以及算法实现,探讨了LOD算法的实现中的问题和改进的方法,研究了节点评价系统的改进方法,最后展望了LOD技术的进一步发展. 相似文献

18.

Parallel multi‐level 2D‐DWT on CUDA GPUs and its application in ring artifact removal

Leqing Zhu Yadong Zhou Daxing Zhang Dadong Wang Huiyan Wang Xun Wang 《Concurrency and Computation》2015,27(17):5188-5202

This paper presented two schemes of parallel 2D discrete wavelet transform (DWT) on Compute Unified Device Architecture graphics processing units. For the first scheme, the image and filter are transformed to spectral domain by using Fast Fourier Transformation (FFT), multiplied and then transformed back to space domain by using inverse FFT. For the second scheme, the image pixels are convolved directly with filters. Because there is no data relevance, the convolution for data points on different positions could be executed concurrently. To reduce data transfer, the boundary extension and down‐sampling are processed during data loading stage, and transposing is completed implicitly during data storage. A similar skill is adopted when parallelizing inverse 2D DWT. To further speed up the data access, the filter coefficients are stored in the constant memory. We have parallelized the 2D DWT for dozens of wavelet types and achieved a speedup factor of over 380 times compared with that of its CPU version. We applied the parallel 2D DWT in a ring artifact removal procedure; the executing speed was accelerated near 200 times compared with its CPU version. The experimental results showed that the proposed parallel 2D DWT on graphics processing units can significantly improve the performance for a wide variety of wavelet types and is promising for various applications. Copyright © 2015 John Wiley & Sons, Ltd. 相似文献

19.

基于CUDA的SAR成像CS算法研究

高跃清张焱刘伟光《计算机与网络》2012,(7):55-57

针对通用计算平台下SAR成像算法效率低下的问题,提出了一种基于CUDA的SAR成像算法并行化实现方法。在分析CUDA工作原理及CS算法并行性特征的基础上,详细描述了算法每个步骤的CUDA实现。实验结果表明了该算法的高效性,优化后的CS算法提速比达到了10～20倍。相似文献

20.

Connected component labeling on a 2D grid using CUDA 总被引：2，自引：0，他引：2

Oleksandr KalentevAuthor Vitae Abha RaiAuthor VitaeStefan KemnitzAuthor Vitae Ralf SchneiderAuthor Vitae 《Journal of Parallel and Distributed Computing》2011,71(4):615-620

Connected component labeling is an important but computationally expensive operation required in many fields of research. The goal in the present work is to label connected components on a 2D binary map. Two different iterative algorithms for doing this task are presented. The first algorithm (Row-Col Unify) is based upon the directional propagation labeling, whereas the second algorithm uses the Label Equivalence technique. The Row-Col Unify algorithm uses a local array of references and the reduction technique intrinsically. The usage of shared memory extensively makes the code efficient. The Label Equivalence algorithm is an extended version of the one presented by Hawick et al. (2010) [3]. At the end the comparison depending on the performances of both of the algorithms is presented. 相似文献