首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The radiosity method is usually employed for the rendering of highly realistic synthetic images. In this paper we present an implementation of the Monte Carlo radiosity algorithm on the GPU using CUDA. Our proposal is based on the partition of the scene into sub-scenes to be processed in parallel to exploit the graphics card structure. The convex partition method employed permits the exploitation of data locality and the optimization of the ray shooting procedure due to the minimization of the number of objects to be tested in the intersection calculation. The results are good in terms of execution times, increasing the flexibility of previous solutions and demonstrating that the GPU can outperform the CPU results even for non-regular algorithms.  相似文献   

2.
For microdosimetric calculations event-by-event Monte Carlo (MC) methods are considered the most accurate. The main shortcoming of those methods is the extensive requirement for computational time. In this work we present an event-by-event MC code of low projectile energy electron and proton tracks for accelerated microdosimetric MC simulations on a graphic processing unit (GPU). Additionally, a hybrid implementation scheme was realized by employing OpenMP and CUDA in such a way that both GPU and multi-core CPU were utilized simultaneously. The two implementation schemes have been tested and compared with the sequential single threaded MC code on the CPU. Performance comparison was established on the speed-up for a set of benchmarking cases of electron and proton tracks. A maximum speedup of 67.2 was achieved for the GPU-based MC code, while a further improvement of the speedup up to 20% was achieved for the hybrid approach. The results indicate the capability of our CPU–GPU implementation for accelerated MC microdosimetric calculations of both electron and proton tracks without loss of accuracy.  相似文献   

3.
平面波赝势密度泛函(PWP-DFT)计算是材料计算中应用最广泛的方法,其中映射计算是PWP-DFT方法求解自洽迭代中重要的一部分。针对映射势能计算成为软件加速的瓶颈,提出了针对该部分的图形处理器(GPU)加速算法,其中考虑GPU的特点:1)使用了新的并行机制求解非局部映射势能;2)重新设计了数据分布结构;3)减少内存的使用;4)提出了一种解决算法中数据相关问题的方法。最终获得了18~57倍加速,使每步分子动力学模拟最终降为12s。详细分析了该模块在GPU平台上的测试时间,同时对该算法在GPU集群上的计算瓶颈进行了讨论。  相似文献   

4.
With the development of steganography, it is required to build high-dimensional feature spaces to detect those sophisticated steganographic schemes. However, the huge time cost prevents the practical deployment of high-dimensional features for steganalysis. SRM and DCTR are important steganalysis feature sets in spatial domain and JPEG domain, respectively. It is necessary to accelerate the extraction of DCTR and SRM to make them more usable in practice, especially for some real-time applications. In this paper, both DCTR and SRM are implemented on the GPU device to exploit the parallel power of the GPU and some optimization methods are presented. For implementation of DCTR, we first utilize the separability and symmetry of two-dimensional discrete cosine transform in decompression and convolution. Then, in order to make phase-aware histograms favorable for parallel GPU processing, we convert them into ordinary 256-dimensional histograms. For SRM, in computing residuals, we specify the computation sequence and spilt the inseparable two-dimensional kernel into several row vectors. When computing the four-dimensional co-occurrences, we convert them into one-dimensional histograms which are more suitable for parallel computing. The experimental results show that the proposed methods can greatly accelerate the extraction of DCTR and SRM, especially for images of large size. Our methods can be applied to the real-time steganalysis system.  相似文献   

5.
The one‐step leapfrog alternative‐direction‐implicit finite‐difference time‐domain (ADI‐FDTD), free from the Courant‐Friedrichs‐Lewy (CFL) stability condition and sub‐step computations, is efficient when dealing with fine grid problems. However, solution of the numerous tridiagonal systems still imposes a great computational burden and makes the method hard to execute in parallel. In this paper, we proposed an efficient graphic processing unit (GPU)‐based parallel implementation of the one‐step leapfrog ADI‐FDTD for the far‐field EM scattering simulation of objects, in which we present and analyze the manners of calculation area division and thread allocation and a data layout transformation of z components is proposed to achieve better memory access mode, which is a key factor affecting GPU execution efficiency. The simulation experiment is carried out to verify the accuracy and efficiency of the GPU‐based implementation. The simulation results show that there is a good agreement between the proposed one‐step leapfrog ADI‐FDTD method and Yee's FDTD in solving the far‐field scattering problem and huge benefits in performance were encountered when the method was accelerated using GPU technology.  相似文献   

6.
This work is focused on the assessment of the use of GPU computation in dynamic texture segmentation under the mixture of dynamic textures (MDT) model. In this generative video model, the observed texture is a time-varying process commanded by a hidden state process. The use of mixtures in this model allows simultaneously handling of different visual processes. Nowadays, the use of GPU computing is growing in high-performance applications, but the adaptation of existing algorithms in such a way as to obtain a benefit from its use is not an easy task. In this paper, we made two implementations, one in CPU and the other in GPU, of a known segmentation algorithm based on MDT. In the MDT algorithm, there is a matrix inversion process that is highly demanding in terms of computing power. We make a comparison between the gain in performance obtained by porting to GPU this matrix inversion process and the gain obtained by porting to GPU the whole MDT segmentation process. We also study real-time motion segmentation performance by separating the learning part of the algorithm from the segmentation part, leaving the learning stage as an off-line process and keeping the segmentation as an online process. The results of performance analyses allow us to decide the cases in which the full GPU implementation of the motion segmentation process is worthwhile.  相似文献   

7.
The numerical solution of two-layer shallow water systems is required to simulate accurately stratified fluids, which are ubiquitous in nature: they appear in atmospheric flows, ocean currents, oil spills, etc. Moreover, the implementation of the numerical schemes to solve these models in realistic scenarios imposes huge demands of computing power. In this paper, we tackle the acceleration of these simulations in triangular meshes by exploiting the combined power of several CUDA-enabled GPUs in a GPU cluster. For that purpose, an improvement of a path conservative Roe-type finite volume scheme which is specially suitable for GPU implementation is presented, and a distributed implementation of this scheme which uses CUDA and MPI to exploit the potential of a GPU cluster is developed. This implementation overlaps MPI communication with CPU–GPU memory transfers and GPU computation to increase efficiency. Several numerical experiments, performed on a cluster of modern CUDA-enabled GPUs, show the efficiency of the distributed solver.  相似文献   

8.
In this paper, we present the graphics processing unit (GPU)‐based parallel implementation of visibility calculation from multiple viewpoints on raster terrain grids. Two levels of parallelism are introduced in the GPU kernels — parallel traversal of visibility rays from a single viewpoint and parallel processing of viewpoints. The obtained visibility maps are combined in parallel using the selected logical operator. A comparison with multi‐threaded CPU implementation is performed to establish the expected speed‐ups of viewshed construction when the source and destination types are sets of scattered locations, paths, or regions. The results demonstrate that using the GPU, the acceleration of an order of magnitude can be achieved on average with both point sampling and bilinear filtering of the elevation map. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

9.
One of the main challenges in real-time rendering is to enable more and more effects that were previously available in offline rendering only. An important effect among these is physically correct reflections of arbitrary objects in curved reflectors like windshields. In this paper we propose fragment tracing on the GPU as a solution to interactively realizing this effect for large scenes as employed in industrial applications. For each rasterized fragment, a ray is traced through an octree representing the original geometry and surface material. By introducing a GPU implementation of an octree traversal, for the first time hierarchical data structures can efficiently be used on the GPU. As a result, the approach allows both handling of large geometries such as those employed in virtual prototyping and accurate rendering. Several examples show the generality and achievable rendering quality of our method.  相似文献   

10.
This work parallelized a widely used structural analysis platform called OpenSees using graphical processing units (GPU). This paper presents task decomposition diagrams with data flow and the sequential and parallel flowcharts for element matrix/vector calculations. It introduces a Bulk Model to ease the parallelization of the element matrix/vector calculations. An implementation of this model for shell elements is presented. Three versions of the Bulk Model—sequential, OpenMP multi-threaded, and CUDA GPU parallelized—were implemented in this work. Nonlinear dynamic analyses of two building models subjected to a tri-axial earthquake were tested. The results demonstrate speedups higher than four on a 4-core system, while the GPU parallelism achieves speedups higher than 7.6 on a single GPU device in comparison to the original sequential implementation.  相似文献   

11.
目前目标识别领域,在人体检测中精确度最高的算法就是可变形部件模型(DPM)算法,针对DPM算法计算量大的缺点,提出了一种基于图形处理器(GPU)的并行化解决方法.采用GPU编程模型OpenCL,对DPM算法的整个算法的实现细节采用了并行化的思想进行重新设计实现,优化算法实现的内存模型和线程分配.通过对OpenCV库和采用GPU重新实现的程序进行对比,在保证了检测效果的前提下,使得算法的执行效率有了近8倍的提高.  相似文献   

12.
Existing formats for Sparse Matrix–Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700 K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision.  相似文献   

13.
GPU图像处理的FFT和卷积算法及性能分析   总被引:2,自引:0,他引:2       下载免费PDF全文
图像滤波器是当前绝大多数图像处理软件中的重要组成部分;然而,图像滤波对于计算量的要求是巨大的,为了加强图像处理软件的人机交互性能,使用GPU(可编程图形处理器)来加速图像滤波,是一个很好的选择。讨论了在GPU上两种图像处理工具的实现:频域上的快速傅立叶变换和空间域上的卷积运算,并评估了这两种工具在GPU上的性能表现。卷积运算在一般情况下表现出来比FFT更好的性能;并同时讨论了在FFT运算有更佳性能的情况。  相似文献   

14.
利用GPU进行加速的归一化差分植被指数(Normalized Differential Vegetation Index,NDVI)提取算法通常采用GPU多线程并行模型,存在弱相关计算之间以及CPU与GPU之间数据传输耗时较多等问题,影响了加速效果的进一步提升。针对上述问题,根据NDVI提取算法的特性,文中提出了一种基于GPU多流并发并行模型的NDVI提取算法。通过CUDA流和Hyper-Q特性,GPU多流并发并行模型可以使数据传输与弱相关计算、弱相关计算与弱相关计算之间达到重叠,从而进一步提高算法并行度及GPU资源利用率。文中首先通过GPU多线程并行模型对NDVI提取算法进行优化,并对优化后的计算过程进行分解,找出包含数据传输及弱相关性计算的部分;其次,对数据传输和弱相关计算部分进行重构,并利用GPU多流并发并行模型进行优化,使弱相关计算之间、弱相关计算和数据传输之间达到重叠的效果;最后,以高分一号卫星拍摄的遥感影像作为实验数据,对两种基于GPU实现的NDVI提取算法进行实验验证。实验结果表明,与传统基于GPU多线程并行模型的NDVI提取算法相比,所提算法在影像大于12000*12000像素时平均取得了约1.5倍的加速,与串行提取算法相比取得了约260倍的加速,具有更好的加速效果和并行性。  相似文献   

15.
Modern GPUs excel in parallel computations, so they are an interesting target to perform matrix transformations such as the DCT, a fundamental part of MPEG video coding algorithms. Considering a system to encode synthetic video (e.g., computer-generated frames), this approach becomes even more appealing, since the images to encode are already in the GPU, eliminating the costs of transferring raw video from the CPU to the GPU. However, after a raw frame has been transformed and quantized by the GPU, the resulting coefficients must be reordered, entropy encoded and framed into the resulting MPEG bitstream. These last steps are essentially sequential and their straightforward GPU implementation is inefficient compared to CPU-based implementations. We present different approaches to implement part of these steps in GPU, aiming for a better usage of the memory bus, compensating the suboptimal use of the GPU with the gains in transfer time. We analyze three approaches to perform the zigzag scan and Huffman coding combining GPU and CPU, and two approaches to assemble the results to build the actual output bitstream both in GPU and CPU memory. Our experiments show that optimising the amount of data transferred from GPU to CPU implementing the last sequential compression steps in the GPU, and using a parallel fast scan implementation of the zigzag scanning improve the overall performance of the system. Savings in transfer time outweigh the extra cost incurred in the GPU.  相似文献   

16.
由于传统的渲染技术是使用CPU 进行数据体颜色计算或融合处理的,这种技术对大规模数据体进行渲染时 效率低、时间长,针对这种情况提出一种采用GPU 进行数据体颜色计算和融合处理的方法。该方法充分利用GPU 强大的并 行处理能力,将待渲染的数据以纹理形式提交给GPU,由GPU 进行必要的颜色插值和融合处理后直接渲染。实验结果表明, 该方法能够将多种属性融为一体,有机地结合了各属性的优点,能对油气储层进行综合评价,提高储层分析和解释的准确度, 并且使用了硬件加速功能,渲染速度快。  相似文献   

17.
【目的】高超声速湍流直接数值模拟(DNS)对空间及时间分辨率要求高,计算量非常大。过大的计算量及过长的计算时间是导致DNS难以在工程中被大范围应用的重要原因。为加快计算速度,作者设计并开发了一套CPU/GPU异构系统架构(HSA)下的高性能计算流体力学程序OpenCFD-SCU。【方法】该程序以作者前期开发的高精度有限差分求解器OpenCFD-SC为基础,经GPU系统的移植及优化而得。GPU程序的计算部分使用CUDA编程,确保所有算术运算都在GPU上完成。【结果】利用GPU程序OpenCFD-SCU,进行了来流Mach数6,6°攻角钝锥边界层转捩的直接数值模拟,得到了转捩过程中的时空演化流场。针对这一算例,GPU程序OpenCFD-SCU与CPU程序OpenCFD-SC相比,实现了60倍的加速效果(单GPU卡对单CPU核心),大大加速了DNS计算过程。【结论】未来,相信会有更多高超声速湍流模拟选择在GPU上开展。  相似文献   

18.

Deep learning techniques based on Convolutional Neural Networks (CNNs) are extensively used for the classification of hyperspectral images. These techniques present high computational cost. In this paper, a GPU (Graphics Processing Unit) implementation of a spatial-spectral supervised classification scheme based on CNNs and applied to remote sensing datasets is presented. In particular, two deep learning libraries, Caffe and CuDNN, are used and compared. In order to achieve an efficient GPU projection, different techniques and optimizations have been applied. The implemented scheme comprises Principal Component Analysis (PCA) to extract the main features, a patch extraction around each pixel to take the spatial information into account, one convolutional layer for processing the spectral information, and fully connected layers to perform the classification. To improve the initial GPU implementation accuracy, a second convolutional layer has been added. High speedups are obtained together with competitive classification accuracies.

  相似文献   

19.

The Louvain community detection algorithm is a hierarchal clustering method categorized in the NP-hard problem. Its execution time to find communities in large graphs is, therefore, a challenge. Parallelization is an effective solution for amortizing Louvain's execution time. In this paper, we propose an adaptive CUDA Louvain method (ACLM) algorithm that benefits from the graphic processing unit (GPU). ACLM uses the shared memory in GPU, as well as the optimal number of threads in the GPU blocks. These features minimize parallelization overhead and accelerate the calculation of modularity parameters. The proposed algorithm allocates threads to each block based on the number of required streaming multiprocessors (SMs) and warps on GPU. The implementation results show that ACLM can effectively accelerate the execution time by 77% compared to the competitive method in the large graph benchmarks.

  相似文献   

20.
The Building-Cube Method (BCM) based on equally-spaced Cartesian meshes has been proposed as a next generation CFD method. Due to the equally-spaced meshes, it is well suited for highly parallel computation. This paper proposes a parallel implementation scheme of BCM on a GPU cluster system, which needs efficient hierarchical parallel processing to exploit the potential of the cluster system. The proposed scheme employs the Red-Black SOR method for the pressure calculations, which is the most time-consuming part of BCM, to obtain massive data parallelism of BCM. By exploiting the coarse-grain and fine-grain parallelism of BCM, the proposed scheme hierarchically assigns equally-divided tasks into the GPU cluster system. Furthermore, to exploit the computational power of GPUs in the cluster system, the proposed scheme employs an efficient data management such as coalesced data transfer and reusing data on an on-chip memory. Experimental results show that the single GPU implementation can achieve about three times higher performance than the single CPU one. Moreover, the multiple GPU implementation can achieve an almost ideal scalability. Finally, the possibility of further acceleration of not only the pressure calculation but also the whole BCM is discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号