首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 572 毫秒
1.
For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU–GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy.  相似文献   

2.
研究了基于GPU的稀疏线性方程组的预条件共轭梯度法加速求解问题,并基于统一计算设备架构(CUDA)平台编制了程序,在NVIDIAGT430 GPU平台上进行了程序性能测试和分析。稀疏矩阵采用压缩稀疏行(CSR)格式压缩存储,针对预条件共轭梯度法的算法特性,研究了基于GPU的稀疏矩阵与向量相乘的性能优化、数据从CPU端传到GPU端的加速传输措施。将编制的稀疏矩阵与向量相乘的kernel函数和CUSPARSE函数库中的cusparseDcsrmv函数性能进行了对比,最优得到了2.1倍的加速效果。对于整个预条件共轭梯度法,通过自编kernel函数来实现的算法较之采用CUBLAS库和CUSPARSE库实现的算法稍具优势,与CPU端的预条件共轭梯度法相比,最优可以得到7.4倍的加速效果。  相似文献   

3.
针对GPU上应用开发移植困难的问题,提出了一种串行计算源程序到并行计算源程序的映射方法。该方法从串行源程序中获得可并行化循环的层次信息,建立循环体结构与GPU线程的对应关系,生成GPU端核心函数代码;根据变量引用读写属性生成CPU端控制代码。基于该方法实现了一个编译原型系统,完成了C语言源程序到CUDA源程序的自动生成。对原型系统在功能和性能方面的测试结果表明,该系统生成的CUDA源程序与C语言源程序在功能上一致,其性能有显著提高,在一定程度上解决了计算密集型应用向CPU-GPU异构多核系统移植困难的问题。  相似文献   

4.
General Purpose Graphic Processing Unit (GPGPU) computing with CUDA has been effectively used in scientific applications, where huge accelerations have been achieved. However, while today’s traditional GPGPU can reduce the execution time of parallel code by many times, it comes at the expense of significant power and energy consumption. In this paper, we propose ubiquitous parallel computing approach for construction of decision tree on GPU. In our approach, we exploit parallelism of well-known ID3 algorithm for decision tree learning by two levels: at the outer level of building the tree node-by-node, and at the inner level of sorting data records within a single node. Thus, our approach not only accelerates the construction of decision tree via GPU computing, but also does so by taking care of the power and energy consumption of the GPU. Experiment results show that our approach outperforms purely GPU-based implementation and CPU-based sequential implementation by several times.  相似文献   

5.
In this paper, a graphics processor unit (GPU) accelerated particle filtering algorithm is presented with an introduction to a novel resampling technique. The aim remains in the mitigation of particle impoverishment as well as computational burden, problems which are commonly associated with classical (systematic) resampled particle filtering. The proposed algorithm employs a priori-space dependent distribution in addition to the likelihood, and hence is christened as dual distribution dependent (D3) resampling method. Simulation results exhibit lesser values for root mean square error (RMSE) in comparison to that for systematic resampling. D3 resampling is shown to improve particle diversity after each iteration, thereby affecting the overall quality of estimation. However, computational burden is significantly increased owing to few excessive computations within the newly formulated resampling framework. With a view to obtaining parallel speedup we introduce a CUDA version of the proposed method for necessary acceleration by GPU. The GPU programming model is detailed in the context of this paper. Implementation issues are discussed along with illustration of empirical computational efficiency, as obtained by executing the CUDA code on Quadro 2000 GPU. The GPU enabled code has a speedup of 3 and 4 over the sequential executions of systematic and D3 resampling methods respectively. Performance both in terms of RMSE and running time have been elaborated with respect to different selections for threads per block towards effective implementations. It is in this context that, we further introduce a cost to performance metric (CPM) for assessing the algorithmic efficiency of the estimator, involving both quality of estimation and running time as comparative factors, transformed into a unified parameter for assessment. CPM values for estimators obtained from all such different choices for threads per block have been determined and a final value for the chosen parameter is resolved for generation of a holistic effective estimator.  相似文献   

6.
张杰  柴志雷  喻津 《计算机科学》2015,42(10):297-300, 324
特征提取与描述是众多计算机视觉应用的基础。局部特征提取与描述因像素级处理产生的高维计算而导致其计算复杂、实时性差,影响了算法在实际系统中的应用。研究了局部特征提取与描述中的关键共性计算模块——图像金字塔机制及图像梯度计算。基于NVIDIA GPU/CUDA架构设计并实现了共性模块的并行计算,并通过优化全局存储、纹理存储及共享存储的访问方式进一步实现了其高效计算。实验结果表明,基于GPU的图像金字塔和图像梯度计算比CPU获得了30倍左右的加速,将实现的图像金字塔和图像梯度计算应用于HOG特征提取与描述算法,相比CPU获得了40倍左右的加速。该研究对于基于GPU实现局部特征的高速提取与描述具有现实意义。  相似文献   

7.

Efficient collision detection is critical in 3D geometric modeling. In this paper, we first implement three parallel triangle-triangle intersection algorithms on a GPU and then compare the computational efficiency of these three GPU-accelerated parallel triangle-triangle intersection algorithms in an application that detects collisions between triangulated models. The presented GPU-based parallel collision detection method for triangulated models has two stages: first, we propose a straightforward and efficient parallel approach to reduce the number of potentially intersecting triangle pairs based on AABBs, and second, we conduct intersection tests with the remaining triangle pairs in parallel based on three triangle-triangle intersection algorithms, i.e., the Möller’s algorithm, Devillers’ and Guigue’s algorithm, and Shen’s algorithm. To evaluate the performance of the presented GPU-based parallel collision detection method for triangulated models, we conduct four groups of benchmarks. The experimental results show the following: (1) the time required to detect collisions for the triangulated model consisting of approximately 1.5 billion triangle pairs is less than 0.5 s; (2) the GPU-based parallel collision detection method speedup over the corresponding serial version is 50x - 60x, and (3) Devillers’ and Guigue’s algorithm is comparatively and comprehensively the best of the three GPU-based parallel triangle-triangle intersection algorithms. The presented GPU-accelerated method is capable of efficiently detecting the potential collisions of triangulated models. Overall, the GPU-accelerated parallel Devillers’ and Guigue’s triangle-triangle intersection algorithm is recommended when performing practical collision detections between large triangulated models.

  相似文献   

8.
二维扩散方程的GPU加速   总被引:1,自引:0,他引:1  
近几年来,GPU因拥有比CPU更强大的浮点性能备受瞩目。NVIDIA推出的CUDA架构,使得GPU上的通用计算成为现实。本文将计算流体力学中Benchmark问题的二维扩散方程移植到GPU,并采用了全局存储和纹理存储两种方法。结果显示,当网格达到百万量级的时候,得到了34倍的加速。  相似文献   

9.
This paper presents a Graphics Processing Unit (GPU)-based implementation of a Bellman-Ford (BF) routing algorithm using NVIDIA’s Compute Unified Device Architecture (CUDA). In the proposed GPU-based approach, multiple threads run concurrently over numerous streaming processors in the GPU to dynamically update routing information. Instead of computing the individual vertex distances one-by-one, a number of threads concurrently update a larger number of vertex distances, and an individual vertex distance is represented in a single thread. This paper compares the performance of the GPU-based approach to an equivalent CPU implementation while varying the number of vertices. Experimental results show that the proposed GPU-based approach outperforms the equivalent sequential CPU implementation in terms of execution time by exploiting the massive parallelism inherent in the BF routing algorithm. In addition, the reduction in energy consumption (about 99 %) achieved by using the GPU is reflective of the overall merits of deploying GPUs across the entire landscape of IP routing for emerging multimedia communications.  相似文献   

10.
Techniques for applying graphics processing units (GPU) to the general-purpose nongraphics computations proposed in recent years by the companies ATI (AMD FireStream, 2006) and NVIDIA (CUDA: Compute Unified Device Architecture, 2007) have given an impetus to developing algorithms and software packages for solving problems of diffractive optics with the aid of the GPU. The computations based on the wide-spread Ray Tracing method were among the first to be implemented using the GPU. The method attracted the attention of the CUDA technology architects, who proposed its GPU-based implementation at the conference NVISION08 (2008). The potential of this direction is associated both with the research into the general issues of mapping of the Ray Tracing method onto the GPU architecture (involving the use of various grid domains and trees) and with developing dedicated software packages (RTE and Linzik projects). In this work, a special attention is given to the overview of techniques for the GPU-aided implementation of the FDTD (finite-difference time-domain) method, which offers an instrument for solving problems of micro- and nanooptics using the rigorous electromagnetic theory. The review of the related papers ranges from the initial research (based on the use of textures) to the complete software solutions (like FDTD Software and FastFDTD).  相似文献   

11.
Magnetic resonance imaging (MRI) is a widely deployed medical imaging technique used for various applications such as neuroimaging, cardiovascular imaging and musculoskeletal imaging. However, MR images degrade in quality due to noise. The magnitude MRI data in the presence of noise generally follows a Rician distribution if acquired with single-coil systems. Several methods are proposed in the literature for denoising MR images corrupted with Rician noise. Amongst the methods proposed in literature for denoising MR images corrupted with Rician noise, the non-local maximum likelihood methods (NLML) and its variants are popular. In spite of the performance and denoising quality, NLML algorithm suffers from a tremendous time complexity \(O\left( {m^{3} N^{3} } \right)\), where \(m^{3}\) and \(N^{3}\) represent the search window and image size, respectively, for a 3D image. This makes the algorithm challenging for deployment in the real-time applications where fast and prompt results are required. A viable solution to this shortcoming would be the application of a data parallel processing framework such as Nvidia CUDA so as to utilize the mutually exclusive and computationally intensive calculations to our advantage. The GPU-based implementation of NLML-based image denoising achieves significant speedup compared to the serial implementation. This research paper describes the first successful attempt to implement a GPU-accelerated version of the NLML algorithm. The main focus of the research was on the parallelization and acceleration of one computationally intensive section of the algorithm so as to demonstrate the execution time improvement through the application of parallel processing concepts on a GPU. Our results suggest the possibility of practical deployment of NLML and its variants for MRI denoising.  相似文献   

12.
In this paper, we analyze the potential of asynchronous relaxation methods on Graphics Processing Units (GPUs). We develop asynchronous iteration algorithms in CUDA and compare them with parallel implementations of synchronous relaxation methods on CPU- or GPU-based systems. For a set of test matrices from UFMC we investigate convergence behavior, performance and tolerance to hardware failure. We observe that even for our most basic asynchronous relaxation scheme, the method can efficiently leverage the GPUs computing power and is, despite its lower convergence rate compared to the Gauss–Seidel relaxation, still able to provide solution approximations of certain accuracy in considerably shorter time than Gauss–Seidel running on CPUs- or GPU-based Jacobi. Hence, it overcompensates for the slower convergence by exploiting the scalability and the good fit of the asynchronous schemes for the highly parallel GPU architectures. Further, enhancing the most basic asynchronous approach with hybrid schemes–using multiple iterations within the “subdomain” handled by a GPU thread block–we manage to not only recover the loss of global convergence but often accelerate convergence of up to two times, while keeping the execution time of a global iteration practically the same. The combination with the advantageous properties of asynchronous iteration methods with respect to hardware failure identifies the high potential of the asynchronous methods for Exascale computing.  相似文献   

13.
We present algorithms for evaluating and performing modeling operations on NURBS surfaces using the programmable fragment processor on the Graphics Processing Unit (GPU). We extend our GPU-based NURBS evaluator that evaluates NURBS surfaces to compute exact normals for either standard or rational B-spline surfaces for use in rendering and geometric modeling. We build on these calculations in our new GPU algorithms to perform standard modeling operations such as inverse evaluations, ray intersections, and surface-surface intersections on the GPU. Our modeling algorithms run in real time, enabling the user to sketch on the actual surface to create new features. In addition, the designer can edit the surface by interactively trimming it without the need for retessellation. Our GPU-accelerated algorithm to perform surface-surface intersection operations with NURBS surfaces can output intersection curves in the model space as well as in the parametric spaces of both the intersecting surfaces at interactive rates. We also extend our surface-surface intersection algorithm to evaluate self-intersections in NURBS surfaces.  相似文献   

14.
The operational processing of remote sensing data usually requires high-performance radiative transfer model (RTM) simulations. To date, multi-core CPUs and also Graphical Processing Units (GPUs) have been used for highly intensive parallel computations. In this paper, we have compared multi-core and GPU implementations of an RTM based on the discrete ordinate solution method. To implement GPUs, the original CPU code has been redesigned using the C-oriented Compute Unified Device Architecture (CUDA) developed by NVIDIA.  相似文献   

15.
基于CUDA的并行粒子群优化算法的设计与实现   总被引:1,自引:0,他引:1  
针对处理大量数据和求解大规模复杂问题时粒子群优化(PSO)算法计算时间过长的问题, 进行了在显卡(GPU)上实现细粒度并行粒子群算法的研究。通过对传统PSO算法的分析, 结合目前被广泛使用的基于GPU的并行计算技术, 设计实现了一种并行PSO方法。本方法的执行基于统一计算架构(CUDA), 使用大量的GPU线程并行处理各个粒子的搜索过程来加速整个粒子群的收敛速度。程序充分使用CUDA自带的各种数学计算库, 从而保证了程序的稳定性和易写性。通过对多个基准优化测试函数的求解证明, 相对于基于CPU的串行计算方法, 在求解收敛性一致的前提下, 基于CUDA架构的并行PSO求解方法可以取得高达90倍的计算加速比。  相似文献   

16.
网络编码允许网络节点在数据存储转发的基础上参与数据处理,已成为提高网络吞吐量、均衡网络负载和提高网络带宽利用率的有效方法,但是网络编码的计算复杂性严重影响了系统性能。基于众核GPU加速的系统可以充分利用众核GPU强大的计算能力和有效利用GPU的存储层次结构来优化加速网络编码。基于CUDA架构提出了以片段并行的技术来加速网络编码和基于纹理Cache的并行解码方法。利用提出的方法实现了线性随机编码,同时结合体系结构对其进行优化。实验结果显示,基于众核GPU的网络编码并行化技术是行之有效的,系统性能提升显著。  相似文献   

17.
动态规划方法求解梯级泵站调度问题十分经典,但在计算上存在“维数灾难”问题,GPU并行计算技术能对重复性计算进行加速,提高算法计算性能。本文对梯级泵站调度问题进行动态规划方法分析,利用CUDA(统一计算设备架构)对调度算法进行改进,给出改进动态规划方法的算法实现,并比较不同计算规模下调度算法计算耗时。实验结果表明,基于CUDA改进动态规划方法实现的梯级泵站调度算法能够降低计算维度,在计算规模较大时,加速效果较好。  相似文献   

18.
The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture - has been developed by NVIDIA to utilize this performance in general purpose computations. Here we show for the first time a possible application of GPU for environmental studies serving as a basement for decision making strategies. A stochastic Lagrangian particle model has been developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during an accidental release. Our results show that parallel implementation achieves typical acceleration values in the order of 80-120 times compared to CPU using a single-threaded implementation on a 2.33 GHz desktop computer. Only very small differences have been found between the results obtained from GPU and CPU simulations, which are comparable with the effect of stochastic transport phenomena in atmosphere. The relatively high speedup with no additional costs to maintain this parallel architecture could result in a wide usage of GPU for diversified environmental applications in the near future.  相似文献   

19.
研究基于GPU的有限元求解中的总刚矩阵生成和线性方程组求解问题.通过对单元着色和分组完成总刚矩阵的生成,并以行压缩存储(Compressed Sparse Row,CSR)格式存储,用预处理共轭梯度法求解所生成的大规模线性稀疏方程组.在CUDA(Compute Unified Device Architecture)平台上完成程序设计,并用GT430 GPU对弹性力学的平面问题和空间问题进行试验.结果表明,总刚矩阵生成和方程组求解分别得到最高11.7和8的计算加速比.  相似文献   

20.
A complete and efficient CUDA-sharing solution for HPC clusters   总被引:1,自引:0,他引:1  
In this paper we detail the key features, architectural design, and implementation of rCUDA, an advanced framework to enable remote and transparent GPGPU acceleration in HPC clusters. rCUDA allows decoupling GPUs from nodes, forming pools of shared accelerators, which brings enhanced flexibility to cluster configurations. This opens the door to configurations with fewer accelerators than nodes, as well as permits a single node to exploit the whole set of GPUs installed in the cluster. In our proposal, CUDA applications can seamlessly interact with any GPU in the cluster, independently of its physical location. Thus, GPUs can be either distributed among compute nodes or concentrated in dedicated GPGPU servers, depending on the cluster administrator’s policy. This proposal leads to savings not only in space but also in energy, acquisition, and maintenance costs. The performance evaluation in this paper with a series of benchmarks and a production application clearly demonstrates the viability of this proposal. Concretely, experiments with the matrix–matrix product reveal excellent performance compared with regular executions on the local GPU; on a much more complex application, the GPU-accelerated LAMMPS, we attain up to 11x speedup employing 8 remote accelerators from a single node with respect to a 12-core CPU-only execution. GPGPU service interaction in compute nodes, remote acceleration in dedicated GPGPU servers, and data transfer performance of similar GPU virtualization frameworks are also evaluated.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号