首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
We describe the porting of the Lattice Boltzmann component of MUPHY, a multi‐physics/scale simulation software, to multiple graphics processing units using the Compute Unified Device Architecture. The novelty of this work is the development of ad hoc techniques for optimizing the indirect addressing that MUPHY uses for efficient simulations of irregular domains. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

2.
Exponentially growing number of devices on Internet incurs an ever‐increasing load on the network routers in executing network protocols. Parallel processing has recently become an unavoidable means to scale up the router performance. The research effort elaborated in this paper is focused on exploiting the modern trends of general‐purpose computing on graphics processing unit computing in speeding up the execution of network protocols. An additional benefit is off‐loading the CPU, which can now be fully dedicated to the packet processing and forwarding. To this end, the Shortest Path First algorithm in the Open Shortest Path First protocol and the choice of the best routes in the Border Gateway Protocol are parallelized for efficient execution on Compute Unified Device Architecture platform. An evaluation study was conducted on three different graphics processing units with representative network workload for a varying number of routes and devices. The obtained speedup results confirmed the viability and cost‐effectiveness of such an approach. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

3.
4.
刘刚  梁晓庚  贺学剑 《计算机科学》2012,39(1):285-286,294
针对模糊C均值聚类图像分割算法运算量大、难于实时处理的问题,提出了一种基于图形处理器的加速算法。通过分析模糊C均值聚类算法各阶段可以并行处理的运算部分,利用计算统一设备架构软硬件结构,分别将隶属度矩阵计算、聚类中心计算和像素按隶属度归类3个部分改造成适合图形处理器硬件并行运行的形式。实验结果表明,相对于CPU串行算法,基于图形处理器的加速算法效率提升明显。鉴于大多数图像处理算法均具有可并行处理的部分,利用图形处理器进行加速具有普适性。  相似文献   

5.
Xiao  Han  Xiao  Shiyang  Ma  Ge  Li  Cailin 《The Journal of supercomputing》2022,78(14):16236-16265
The Journal of Supercomputing - Aiming at the low processing speed of the Sobel edge detection algorithm and the equipment limitations of Compute Unified Device Architecture (CUDA) implementation...  相似文献   

6.
In this article a very efficient implementation of a 2D-Lattice Boltzmann kernel using the Compute Unified Device Architecture (CUDA?) interface developed by nVIDIA® is presented. By exploiting the explicit parallelism exposed in the graphics hardware we obtain more than one order in performance gain compared to standard CPUs. A non-trivial example, the flow through a generic porous medium, shows the performance of the implementation.  相似文献   

7.
Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind–magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax–Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision.  相似文献   

8.
基于CUDA的双三次B样条缩放方法   总被引:4,自引:2,他引:2       下载免费PDF全文
Nvidia在GeForce 8系列显卡上推出的CUDA(统一计算设备架构)技术使GPU通用计算(GPGPU)从图形硬件流水线和高级绘制语言中解放出来,开发人员无须掌握图形学编程方法即可在单任务多数据模式(SIMD)下完成高性能并行计算。研究了CUDA的设计思想和编程方式,改进了基于双三次B样条曲面的图像缩放算法,使用多个线程将计算中耗时的B样条重采样部分改造成SIMD模式,并分别采用CUDA中全局存储器和共享存储器策略在CUDA上完成图像缩放的全过程。实验结果表明,基于CUDA的B样条曲面并行插值方法成功实现了硬件加速,相对于CPU上运行的B样条缩放算法,其执行效率明显提高,易于扩展,对于大规模数据处理呈现出良好的实时处理能力。  相似文献   

9.
In this paper, we aim at exploiting the power computing of a graphics processing unit (GPU) cluster for solving large sparse linear systems. We implement the parallel algorithm of the generalized minimal residual iterative method using the Compute Unified Device Architecture programming language and the MPI parallel environment. The experiments show that a GPU cluster is more efficient than a CPU cluster. In order to optimize the performances, we use a compressed storage format for the sparse vectors and the hypergraph partitioning. These solutions improve the spatial and temporal localization of the shared data between the computing nodes of the GPU cluster.  相似文献   

10.
为研究基于GPU的高性能并行计算技术,利用集成448个处理核心的NVIDIA GPU GTX470实现了脉冲压缩雷达的基本数据处理算法,包括脉冲压缩算法与相参积累算法;同时根据GPU的并行处理架构,将脉冲压缩、相参积累算法完成了并行优化设计,有效地将算法并行映射到GPU GTX470的448个处理核心中,完成了脉冲压缩雷达基本处理算法的GPU并行处理实现;最后验证了并行计算的结果,并针对处理结果效果与实时性进行了评估。  相似文献   

11.
BF算法是串匹配算法经典算法之一,但并不适合GPU这种并行体系结构.提出了基于统一计算设备架构(CUDA)架构的解决方案,通过对需要处理的数据增加一定比例的冗余信息,设计了适合CUDA计算数据的独立性特点的并行BF算法.实验结果表明,基于CUDA架构的并行串匹配算法比同等CPU算法获得约10倍的加速比.此外还对该算法性能的影响因子做了分析.  相似文献   

12.
The subset‐sum problem is a well‐known non‐deterministic polynomial‐time complete (NP‐complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two‐list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU–GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide‐and‐conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector‐based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation. Copyright © 2014 John Wiley & Sons, Ltd.  相似文献   

13.
Hyperspectral imaging, which records a detailed spectrum of light arriving in each pixel, has many potential uses in remote sensing as well as other application areas. Practical applications will typically require real-time processing of large data volumes recorded by a hyperspectral imager. This paper investigates the use of graphics processing units (GPU) for such real-time processing. In particular, the paper studies a hyperspectral anomaly detection algorithm based on normal mixture modelling of the background spectral distribution, a computationally demanding task relevant to military target detection and numerous other applications. The algorithm parts are analysed with respect to complexity and potential for parallellization. The computationally dominating parts are implemented on an Nvidia GeForce 8800 GPU using the Compute Unified Device Architecture programming interface. GPU computing performance is compared to a multi-core central processing unit implementation. Overall, the GPU implementation runs significantly faster, particularly for highly data-parallelizable and arithmetically intensive algorithm parts. For the parts related to covariance computation, the speed gain is less pronounced, probably due to a smaller ratio of arithmetic to memory access. Detection results on an actual data set demonstrate that the total speedup provided by the GPU is sufficient to enable real-time anomaly detection with normal mixture models even for an airborne hyperspectral imager with high spatial and spectral resolution.  相似文献   

14.
The AVC video coding standard adopts variable block sizes for inter frame coding to increase compression efficiency, among other new features. As a consequence of this, an AVC encoder has to employ a complex mode decision technique that requires high computational complexity. Several techniques aimed at accelerating the inter prediction process have been proposed in the literature in recent years. Recently, with the emergence of many-core processors or accelerators, a new way of supporting inter frame prediction has presented itself. In this paper, we present a step forward in the implementation of an AVC inter prediction algorithm in a graphics processing unit, using Compute Unified Device Architecture. The results show a negligible drop in rate distortion with a time reduction, on average, of over 98.8 % compared with full search and fast full search, and of over 80 % compared with UMHexagonS search.  相似文献   

15.
基于CUDA的尺度不变特征变换快速算法   总被引:2,自引:2,他引:0       下载免费PDF全文
田文  徐帆  王宏远  周波 《计算机工程》2010,36(8):219-221
针对尺度不变特征变换(SIFT)算法耗时多限制其应用范围的缺点,提出一种基于统一计算设备架构(CUDA)的尺度不变特征变换快速算法,分析其并行特性,在图像处理单元(GPU)的线程和内存模型方面对算法进行优化。实验证明,相对于CPU,算法速度提升了30~50倍,对640×480图像的处理速度达到每秒24帧,满足实时应用的需求。  相似文献   

16.
Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially.  相似文献   

17.
Three‐dimensional curve skeletons are a very compact representation of three‐dimensional objects with many uses and applications in fields such as computer graphics, computer vision, and medical imaging. An important problem is that the calculation of the skeleton is a very time‐consuming process. Thinning is a widely used technique for calculating the curve skeleton because of the properties it ensures and the ease of implementation. In this paper, we present parallel versions of a thinning algorithm for efficient implementation in both graphics processing units and multicore CPUs. The parallel programming models used in our implementations are Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL). The speedup achieved with the optimized parallel algorithms for the graphics processing unit achieves 106.24x against the CPU single‐process version and more than 19x over the CPU multithreaded version. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

18.
CUDA架构下的快速图像去噪   总被引:5,自引:2,他引:3       下载免费PDF全文
图像处理通常需要较大的计算量,其中图像去噪是经常使用的一种预处理算法,研究其快速算法具有重要意义。图形处理器具有强大的并行计算能力,但大部分时间处于闲置状态。统一计算设备架构提供了一种简单易用的开发环境,可利用图形处理器进行通用计算。提出了基于统一计算设备架构的快速图像去噪算法,可以利用GPU的计算能力,加快去噪过程,显著地减少计算时间。  相似文献   

19.
This paper presented two schemes of parallel 2D discrete wavelet transform (DWT) on Compute Unified Device Architecture graphics processing units. For the first scheme, the image and filter are transformed to spectral domain by using Fast Fourier Transformation (FFT), multiplied and then transformed back to space domain by using inverse FFT. For the second scheme, the image pixels are convolved directly with filters. Because there is no data relevance, the convolution for data points on different positions could be executed concurrently. To reduce data transfer, the boundary extension and down‐sampling are processed during data loading stage, and transposing is completed implicitly during data storage. A similar skill is adopted when parallelizing inverse 2D DWT. To further speed up the data access, the filter coefficients are stored in the constant memory. We have parallelized the 2D DWT for dozens of wavelet types and achieved a speedup factor of over 380 times compared with that of its CPU version. We applied the parallel 2D DWT in a ring artifact removal procedure; the executing speed was accelerated near 200 times compared with its CPU version. The experimental results showed that the proposed parallel 2D DWT on graphics processing units can significantly improve the performance for a wide variety of wavelet types and is promising for various applications. Copyright © 2015 John Wiley & Sons, Ltd.  相似文献   

20.
为解决信道译码在高吞吐量通信系统中的瓶颈问题,通过对CUDA并行计算的了解和对维特比译码并行实现的探索,为卷积码提出了一种基于CUDA的截断重叠维特比译码器。算法通过截断式的子网格图相互重叠的方式,并行执行独立的正向度量计算和回溯过程。实验结果表明,在保证了译码算法误码率性能的同时,获得了良好的吞吐量提升表现,相比现有的实现方式有1.3~3.5倍的提升,降低了硬件开销,能够有效运用于实际高吞吐量通信系统中。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号