首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
General purpose computation on graphics processing unit (GPU) is rapidly entering into various scientific and engineering fields. Many applications are being ported onto GPUs for better performance. Various optimizations, frameworks, and tools are being developed for effective programming of GPU. As part of communication and computation optimizations for GPUs, this paper proposes and implements an optimization method called as kernel coalesce that further enhances GPU performance and also optimizes CPU to GPU communication time. With kernel coalesce methods, proposed in this paper, the kernel launch overheads are reduced by coalescing the concurrent kernels and data transfers are reduced incase of intermediate data generated and used among kernels. Computation optimization on a device (GPU) is performed by optimizing the number of blocks and threads launched by tuning it to the architecture. Block level kernel coalesce method resulted in prominent performance improvement on a device without the support for concurrent kernels. Thread level kernel coalesce method is better than block level kernel coalesce method when the design of a grid structure (i.e., number of blocks and threads) is not optimal to the device architecture that leads to underutilization of the device resources. Both the methods perform similar when the number of threads per block is approximately the same in different kernels, and the total number of threads across blocks fills the streaming multiprocessor (SM) capacity of the device. Thread multi‐clock cycle coalesce method can be chosen if the programmer wants to coalesce more than two concurrent kernels that together or individually exceed the thread capacity of the device. If the kernels have light weight thread computations, multi clock cycle kernel coalesce method gives better performance than thread and block level kernel coalesce methods. If the kernels to be coalesced are a combination of compute intensive and memory intensive kernels, warp interleaving gives higher device occupancy and improves the performance. Multi clock cycle kernel coalesce method for micro‐benchmark1 considered in this paper resulted in 10–40% and 80–92% improvement compared with separate kernel launch, without and with shared input and intermediate data among the kernels, respectively, on a Fermi architecture device, that is, GTX 470. A nearest neighbor (NN) kernel from Rodinia benchmark is coalesced to itself using thread level kernel coalesce method and warp interleaving giving 131.9% and 152.3% improvement compared with separate kernel launch and 39.5% and 36.8% improvement compared with block level kernel coalesce method, respectively.Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
In this work, we propose the use of the neural gas (NG), a neural network that uses an unsupervised Competitive Hebbian Learning (CHL) rule, to develop a reverse engineering process. This is a simple and accurate method to reconstruct objects from point clouds obtained from multiple overlapping views using low-cost sensors. In contrast to other methods that may need several stages that include downsampling, noise filtering and many other tasks, the NG automatically obtains the 3D model of the scanned objects. To demonstrate the validity of our proposal we tested our method with several models and performed a study of the neural network parameterization computing the quality of representation and also comparing results with other neural methods like growing neural gas and Kohonen maps or classical methods like Voxel Grid. We also reconstructed models acquired by low cost sensors that can be used in virtual and augmented reality environments for redesign or manipulation purposes. Since the NG algorithm has a strong computational cost we propose its acceleration. We have redesigned and implemented the NG learning algorithm to fit it onto Graphics Processing Units using CUDA. A speed-up of 180× faster is obtained compared to the sequential CPU version.  相似文献   

3.
Computing on graphics processors is maybe one of the most important developments in computational science to happen in decades. Not since the arrival of the Beowulf cluster, which combined open source software with commodity hardware to truly democratize high-performance computing, has the community been so electrified. Like then, the opportunity comes with challenges. The formulation of scientific algorithms to take advantage of the performance offered by the new architecture requires rethinking core methods. Here, we have tackled fast summation algorithms (fast multipole method and fast Gauss transform), and applied algorithmic redesign for attaining performance on gpus. The progression of performance improvements attained illustrates the exercise of formulating algorithms for the massively parallel architecture of the gpu. The end result has been gpu kernels that run at over 500 Gop/s on one nvidiatesla C1060 card, thereby reaching close to practical peak.  相似文献   

4.
A vertex reconstruction algorithm that is based on the Gaussian-sum filter (GSF) was developed and implemented in the framework of the CMS reconstruction program. While linear least-square estimators are optimal in case all observation errors are Gaussian distributed, the GSF offers a better treatment of non-Gaussian distributions of track parameter errors when these are modeled by Gaussian mixtures. The algorithm has been verified and evaluated with simulated data. The results are compared to the Kalman filter and to an adaptive vertex estimator.  相似文献   

5.
Computational Visual Media - The concept of using multiple deep images, under a variety of different names, has been explored as a possible acceleration approach for finding ray-geometry...  相似文献   

6.
Pattern Analysis and Applications - Accurate brain tumor segmentation plays a significant role in the area of radiotherapy diagnosis and in the proper treatment for brain tumor detection....  相似文献   

7.
8.
Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in “flat” three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction (NWP) models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and to meet operational requirements an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units (GPUs) have been shown to be highly efficient (both in terms of absolute performance and power consumption) for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. In this article we describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the PCG algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it both to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing CUDA libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.  相似文献   

9.
目的 针对灰度分水岭算法存在过分割且难以直接应用到彩色图像分割的问题,提出一种自适应梯度重建分水岭分割算法。方法 该方法首先利用PCA技术对彩色图像降维,然后计算降维后的梯度图像,并采用自适应重建算法修正梯度图像,最后对优化后的梯度图像应用分水岭变换实现对彩色图像的正确分割。结果 采用融合了颜色距离、均方差和区域信息的性能指标和分割区域数对分割效果进行评估,对不同类型的彩色图像进行分割实验,本文算法在正确分割图像的同时获得了较高的性能指标。与现有的分水岭分割算法相比,提出的方法能有效剔除图像中的伪极小值,减少图像中的极小值数目,从而解决了过分割问题,有效提升了分割效果。结论 本文算法具有较好的适用性和较高的鲁棒性。  相似文献   

10.
针对正电子发射断层成像重建过程中存在的系统模型误差和投影数据不确定性,提出了基于状态空间体系的鲁棒自适应Kalman滤波法。该方法根据药物动力学先验信息建立状态方程,结合PET测量方程组成状态空间模型。引入虚拟噪声来表示模型的系统矩阵误差之后,通过应用鲁棒自适应Kalman滤波法对未知的系统噪声以及观测噪声进行估计的同时完成PET放射性浓度的重建。实验结果表明,此算法比传统的最大似然法和滤波反投影法更具鲁棒性,适合应用于实际PET系统中。  相似文献   

11.
电容层析成像技术(ECT-Electrical Capacitance Tomography)是基于电容敏感原理的过程成像技术,具有非侵入性、造价低、安装方便、实时性好等优点。图像重构作为ECT系统的关键技术,其实质是根据物体内部介电常数的空间分布推导出管道中各相分布的过程。本文针对重构问题的非线性、病态性等特点,采用了基于BP神经网络的ECT图像重构算法,并引入中值滤波对重构图像进行增强。仿真结果表明,该算法可以有效地实现图像重构和令人满意的增强效果,它大大提高了重建图像的质量,是一种有效的ECT图像重构算法。  相似文献   

12.
光学层析成像是一个病态重建过程,为降低重建过程中的病态特性,需加入合适的先验信息。目前,大多数重建都是基于扩散方程的,在某些情况下,这种重建会失败。直接基于玻耳兹曼传输模型,并以图像熵为正则化项的梯度迭代重建是一种有效的方法。该方法中,梯度计算是个难点。对此,提出一种基于梯度树的求解方法,降低光学层析图像重建的病态性,有效地重建光学层析图像。  相似文献   

13.
The age of Internet technology has introduced new types of attacks to new assets that did not exist before. Databases that represent information assets are subject to attacks that have malicious intentions, such as stealing sensitive data, deleting records or violating the integrity of the database. Many counter measures have been designed and implemented to protect the databases and the information they host from attacks. While preventive measures could be overcome and detection measures could detect an attack late after damage has occurred, there is a need for a recovery algorithm that will recover the database to its correct previous state before the attack. Numerous damage assessment and recovery algorithms have been proposed by researchersIn this work, we present an efficient lightweight detection and recovery algorithm that is based on the matrix approach and that can be used to recover from malicious attacks. We compare our algorithm with other approaches and show the performance results.  相似文献   

14.
15.
Multimedia Tools and Applications - Low-dose Computed Tomography (CT) reconstruction techniques have been implemented to minimize the X-ray radiation in a human body. Many researchers have designed...  相似文献   

16.
This paper deals with the classical problem of regulating a feedback linearizable system when only the samples of the output are available for measurement. The feedback linearization is done via an impulsive observer. The realizable reconstruction filter, which is a minimum order generalized hold device, connects the discrete controller with the continuous input of the system. The rest of the problem is similar to the classical regulator theory. An example of single link manipulator with flexible joints concludes the paper.  相似文献   

17.
Cone beam computed tomography (CBCT) enables volumetric image reconstruction from 2D projection data and plays an important role in image guided radiation therapy (IGRT). Filtered back projection is still the most frequently used algorithm in applications. The algorithm discretizes the scanning process (forward projection) into a system of linear equations, which must then be solved to recover images from measured projection data. The conjugate gradients (CG) algorithm and its variants can be used to solve (possibly regularized) linear systems of equations Ax=b and linear least squares problems minx∥b-Ax∥(2), especially when the matrix A is very large and sparse. Their applications can be found in a general CT context, but in tomography problems (e.g. CBCT reconstruction) they have not widely been used. Hence, CBCT reconstruction using the CG-type algorithm LSQR was implemented and studied in this paper. In CBCT reconstruction, the main computational challenge is that the matrix A usually is very large, and storing it in full requires an amount of memory well beyond the reach of commodity computers. Because of these memory capacity constraints, only a small fraction of the weighting matrix A is typically used, leading to a poor reconstruction. In this paper, to overcome this difficulty, the matrix A is partitioned and stored blockwise, and blockwise matrix-vector multiplications are implemented within LSQR. This implementation allows us to use the full weighting matrix A for CBCT reconstruction without further enhancing computer standards. Tikhonov regularization can also be implemented in this fashion, and can produce significant improvement in the reconstructed images.  相似文献   

18.
The Support Vector Machine (SVM) is an efficient tool in machine learning with high accuracy performance. However, in order to achieve the highest accuracy performance, n-fold cross validation is commonly used to identify the best hyperparameters for SVM. This becomes a weak point of SVM due to the extremely long training time for various hyperparameters of different kernel functions. In this paper, a novel parallel SVM training implementation is proposed to accelerate the cross validation procedure by running multiple training tasks simultaneously on a Graphics Processing Unit (GPU). All of these tasks with different hyperparameters share the same cache memory which stores the kernel matrix of the support vectors. Therefore, this heavily reduces redundant computations of kernel values across different training tasks. Considering that the computations of kernel values are the most time consuming operations in SVM training, the total time cost of the cross validation procedure decreases significantly. The experimental tests indicate that the time cost for the multitask cross validation training is very close to the time cost of the slowest task trained alone. Comparison tests have shown that the proposed method is 10 to 100 times faster compared to the state of the art LIBSVM tool.  相似文献   

19.
Results of numerical experiments in solving mass problems of determining membership of a set of points in a set of arbitrary shapes covering a domain or intersecting with each other in a space of arbitrary dimension are discussed. The problems are solved using geometrical techniques on graphics processors. The proposed solution can outperform the fastest classical algorithms by a factor from 6 to 700 in terms of speed. As an example, the construction of grids for computations within a geophysical model of the Earth is used. Such problems are typical for all the numerical computations involving geometric modeling where coverings or triangulations are used or rendering problems are solved.  相似文献   

20.
为了快速得到高质量的重建图像,提出了对称共轭梯度法成像算法,大大缩减了迭代次数,同时,将ERT物理模型进行规范化和Tikhonov正则化处理,进而将QR分解的思想引入ERT方程的求解中,提出基于QR分解的对称共轭梯度算法,实现了单步图像重建.理论分析表明,该算法具有良好的收敛性.通过典型流型的仿真实验,证明了该算法可以...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号