期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

GPU-LMDDA: a bit-vector GPU-based deadlock detection algorithm for multi-unit resource systems

Stephen Abell Nhan Do 《International Journal of Parallel, Emergent and Distributed Systems》2016,31(6):562-590

相似文献

2.

Feature tracking and matching in video using programmable graphics hardware 总被引：2，自引：0，他引：2

Sudipta N. Sinha Jan-Michael Frahm Marc Pollefeys Yakup Genc 《Machine Vision and Applications》2011,22(1):207-217

This paper describes novel implementations of the KLT feature tracking and SIFT feature extraction algorithms that run on the graphics processing unit (GPU) and is suitable for video analysis in real-time vision systems. While significant acceleration over standard CPU implementations is obtained by exploiting parallelism provided by modern programmable graphics hardware, the CPU is freed up to run other computations in parallel. Our GPU-based KLT implementation tracks about a thousand features in real-time at 30 Hz on 1,024 × 768 resolution video which is a 20 times improvement over the CPU. The GPU-based SIFT implementation extracts about 800 features from 640 × 480 video at 10 Hz which is approximately 10 times faster than an optimized CPU implementation. 相似文献

3.

PMVS算法的CPU多线程和GPU两级粒度并行策略

刘金硕江庄毅徐亚渤邓娟章岚昕《计算机科学》2017,44(2):296-301

PMVS(Patch-based Multi-View Stereo)三维重建算法被广泛应用于无人机航拍影像的三维场景重建中。针对PMVS三维重建算法计算量大、时间复杂度高的问题,提出了PMVS算法的CPU多线程和GPU两级粒度并行策略(Multithread and GPU Parallel Schema,MGPS),方法具体包括:基于GPU的PMVS算法特征提取和片面扩散的并行设计;多影像的GPU和CPU任务分配机制,以使得部分任务分配给CPU采用多线程并行,部分任务分配给GPU并行时,程序总运行时间最短。实验采用搭载24核CPU和NVIDIA Tesla K20 GPU的高性能服务器作为测试平台,针对分辨率为4081×2993的16幅无人机影像进行三维重建。实验结果表明,相比串行的PMVS算法,基于MGPS的PMVS算法取得4倍左右的加速比,其中特征提取最高加速13倍,计算误差在10%以内,该方法实现了更高效的PMVS三维重建。基于MGPS的PMVS算法还可用于文物保护、医学图像处理、虚拟现实等领域。相似文献

4.

Scalable and fast SVM regression using modern hardware

Zeyi?Wen Email author Rui?Zhang Kotagiri?Ramamohanarao Li?Yang Email author 《World Wide Web》2018,21(2):261-287

Support Vector Machine (SVM) regression is an important technique in data mining. The SVM training is expensive and its cost is dominated by: (i) the kernel value computation, and (ii) a search operation which finds extreme training data points for adjusting the regression function in every training iteration. Existing training algorithms for SVM regression are not scalable to large datasets because: (i) each training iteration repeatedly performs expensive kernel value computations, which is inefficient and requires holding the whole training dataset in memory; (ii) the search operation used in each training iteration considers the whole search space which is very expensive. In this article, we significantly improve the scalability and efficiency of SVM regression by exploiting the high performance of Graphics Processing Units (GPUs) and solid state drives (SSDs). Our key ideas are as follows. (i) To reduce the cost of repeated kernel value computations and avoid holding the whole training dataset in the GPU memory, we precompute all the kernel values and store them in the CPU memory extended by the SSD; together with an efficient strategy to read the precomputed kernel values, reusing precomputed kernel values with an efficient retrieval is much faster than computing them on-the-fly. This also alleviates the restriction that the training dataset has to fit into the GPU memory, and hence makes our algorithm scalable to large datasets, especially for large datasets with very high dimensionality. (ii) To enhance the performance of the frequently used search operation, we design an algorithm that minimizes the search space and the number of accesses to the GPU global memory; this optimized search algorithm also avoids branch divergence (one of the causes for poor performance) among GPU threads to achieve high utilization of the GPU resources. Our proposed techniques together form a scalable solution to the SVM regression which we call SIGMA. Our extensive experimental results show that SIGMA is highly efficient and can handle very large datasets which the state-of-the-art GPU-based algorithm cannot handle. On the datasets of size that the state-of-the-art GPU-based algorithm can handle, SIGMA consistently outperforms the state-of-the-art GPU-based algorithm by an order of magnitude and achieves up to 86 times speedup. 相似文献

5.

Parallel Computing Experiences with CUDA 总被引：2，自引：0，他引：2

Garland Michael Le Grand Scott Nickolls John Anderson Joshua Hardwick Jim Morton Scott Phillips Everett Zhang Yao Volkov Vasily 《Micro, IEEE》2008,28(4):13-27

The CUDA programming model provides a straightforward means of describing inherently parallel computations, and NVIDIA's Tesla GPU architecture delivers high computational throughput on massively parallel problems. This article surveys experiences gained in applying CUDA to a diverse set of problems and the parallel speedups over sequential codes running on traditional CPU architectures attained by executing key computations on the GPU. 相似文献

6.

图形处理器在数据管理领域的应用研究综述 总被引：1，自引：0，他引：1

下载免费PDF全文

周国亮冯海军何国明陈红《计算机科学与探索》2010,4(4):289-303

比较了中央处理器和图形处理器体系结构的异同,并简要介绍了最新的图形处理器通用计算平台及不同体系结构间并行算法的异同。详细叙述了图形处理器在空间数据库、关系数据库、数据流和数据挖掘及信息检索等方面应用的技术特点;探讨了基于图形处理器的各种内外存排序算法及性能;描述了基于图形处理器的各种数据结构和索引技术;阐述了图形处理器算法优化方面的工作。最后,展望了图形处理器应用于数据管理的发展前景,并分析了这一领域未来所面临的挑战。相似文献

7.

A CPU-GPU hybrid approach for the unsymmetric multifrontal method 总被引：1，自引：0，他引：1

Chenhan D. YuWeichung Wang Dan’l Pierce 《Parallel Computing》2011,37(12):759-770

Multifrontal is an efficient direct method for solving large-scale sparse and unsymmetric linear systems. The method transforms a large sparse matrix factorization process into a sequence of factorizations involving smaller dense frontal matrices. Some of these dense operations can be accelerated by using a graphic processing unit (GPU). We analyze the unsymmetric multifrontal method from both an algorithmic and implementational perspective to see how a GPU, in particular the NVIDIA Tesla C2070, can be used to accelerate the computations. Our main accelerating strategies include (i) performing BLAS on both CPU and GPU, (ii) improving the communication efficiency between the CPU and GPU by using page-locked memory, zero-copy memory, and asynchronous memory copy, and (iii) a modified algorithm that reuses the memory between different GPU tasks and sets thresholds to determine whether certain tasks be performed on the GPU. The proposed acceleration strategies are implemented by modifying UMFPACK, which is an unsymmetric multifrontal linear system solver. Numerical results show that the CPU-GPU hybrid approach can accelerate the unsymmetric multifrontal solver, especially for computationally expensive problems. 相似文献

8.

Accelerating IP routing algorithm using graphics processing unit for high speed multimedia communication

Jia Uddin In-Kyu Jeong Myeongsu Kang Cheol-Hong Kim Jong-Myon Kim 《Multimedia Tools and Applications》2016,75(23):15365-15379

This paper presents a Graphics Processing Unit (GPU)-based implementation of a Bellman-Ford (BF) routing algorithm using NVIDIA’s Compute Unified Device Architecture (CUDA). In the proposed GPU-based approach, multiple threads run concurrently over numerous streaming processors in the GPU to dynamically update routing information. Instead of computing the individual vertex distances one-by-one, a number of threads concurrently update a larger number of vertex distances, and an individual vertex distance is represented in a single thread. This paper compares the performance of the GPU-based approach to an equivalent CPU implementation while varying the number of vertices. Experimental results show that the proposed GPU-based approach outperforms the equivalent sequential CPU implementation in terms of execution time by exploiting the massive parallelism inherent in the BF routing algorithm. In addition, the reduction in energy consumption (about 99 %) achieved by using the GPU is reflective of the overall merits of deploying GPUs across the entire landscape of IP routing for emerging multimedia communications. 相似文献

9.

基于图形处理器的Cuboid算法

周国亮冯海军何国明陈红李翠平王珊《计算机研究与发展》2009,46(Z2)

近年来,基于图形处理器的通用计算获得了广泛关注,并在多个领域取得了进展.内存OLAP减少了磁盘I/O,但基于单核或多核CPU的计算能力及cache miss成为新的性能瓶颈,从而无法保证好的效率.而图形处理器由于其众多核和高带宽能够很好地适应OLAP计算特性.通过图形处理器来加速任一cuboid的计算,从而提高整个内存OLAP系统的性能.提出了基于图形处理器的分块并行算法,并对算法进行了优化及讨论了数据稀疏和数据分布倾斜等不同条件下的算法.算法通过扩展可以突破内存限制,组成磁盘、内存、显存三级流水线,适应海量数据计算;同时算法也可以作为计算整个cube的基础.通过实验比较,基于图形处理器的算法明显优于四核CPU算法. 相似文献

10.

A GPU‐based Streaming Algorithm for High‐Resolution Cloth Simulation

Min Tang Ruofeng Tong Rahul Narain Chang Meng Dinesh Manocha 《Computer Graphics Forum》2013,32(7):21-30

We present a GPU‐based streaming algorithm to perform high‐resolution and accurate cloth simulation. We map all the components of cloth simulation pipeline, including time integration, collision detection, collision response, and velocity updating to GPU‐based kernels and data structures. Our algorithm perform intra‐object and inter‐object collisions, handles contacts and friction, and is able to accurately simulate folds and wrinkles. We describe the streaming pipeline and address many issues in terms of obtaining high throughput on many‐core GPUs. In practice, our algorithm can perform high‐fidelity simulation on a cloth mesh with 2M triangles using 3GB of GPU memory. We highlight the parallel performance of our algorithm on three different generations of GPUs. On a high‐end NVIDIA Tesla K20c, we observe up to two orders of magnitude performance improvement as compared to a single‐threaded CPU‐based algorithm, and about one order of magnitude improvement over a 16‐core CPU‐based parallel implementation. 相似文献

11.

A mixed-precision algorithm for the solution of Lyapunov equations on hybrid CPU-GPU platforms 总被引：1，自引：0，他引：1

Peter Benner 《Parallel Computing》2011,37(8):439-450

We describe a hybrid Lyapunov solver based on the matrix sign function, where the intensive parts of the computation are accelerated using a graphics processor (GPU) while executing the remaining operations on a general-purpose multi-core processor (CPU). The initial stage of the iteration operates in single-precision arithmetic, returning a low-rank factor of an approximate solution. As the main computation in this stage consists of explicit matrix inversions, we propose a hybrid implementation of Gauß-Jordan elimination using look-ahead to overlap computations on GPU and CPU. To improve the approximate solution, we introduce an iterative refinement procedure that allows to cheaply recover full double-precision accuracy. In contrast to earlier approaches to iterative refinement for Lyapunov equations, this approach retains the low-rank factorization structure of the approximate solution. The combination of the two stages results in a mixed-precision algorithm, that exploits the capabilities of both general-purpose CPUs and many-core GPUs and overlaps critical computations. Numerical experiments using real-world data and a platform equipped with two Intel Xeon QuadCore processors and an Nvidia Tesla C1060 show a significant efficiency gain of the hybrid method compared to a classical CPU implementation. 相似文献

12.

Parallelization of 2D MPDATA EULAG algorithm on hybrid architectures with GPU accelerators

《Parallel Computing》2014,40(8):425-447

EULAG (Eulerian/semi-Lagrangian fluid solver) is an established computational model developed for simulating thermo-fluid flows across a wide range of scales and physical scenarios. The dynamic core of EULAG includes the multidimensional positive definite advection transport algorithm (MPDATA) and elliptic solver. In this work we investigate aspects of an optimal parallel version of the 2D MPDATA algorithm on modern hybrid architectures with GPU accelerators, where computations are distributed across both GPU and CPU components.Using the hybrid OpenMP–OpenCL model of parallel programming opens the way to harness the power of CPU–GPU platforms in a portable way. In order to better utilize features of such computing platforms, comprehensive adaptations of MPDATA computations to hybrid architectures are proposed. These adaptations are based on efficient strategies for memory and computing resource management, which allow us to ease memory and communication bounds, and better exploit the theoretical floating point efficiency of CPU–GPU platforms. The main contributions of the paper are:

•method for the decomposition of the 2D MPDATA algorithm as a tool to adapt MPDATA computations to hybrid architectures with GPU accelerators by minimizing communication and synchronization between CPU and GPU components at the cost of additional computations;
•method for the adaptation of 2D MPDATA computations to multicore CPU platforms, based on space and temporal blocking techniques;
•method for the adaptation of the 2D MPDATA algorithm to GPU architectures, based on a hierarchical decomposition strategy across data and computation domains, with support provided by the developed GPU task scheduler allowing for the flexible management of available resources;
•approach to the parametric optimization of 2D MPDATA computations on GPUs using the autotuning technique, which allows us to provide a portable implementation methodology across a variety of GPUs.

Hybrid platforms tested in this study contain different numbers of CPUs and GPUs – from solutions consisting of a single CPU and a single GPU to the most elaborate configuration containing two CPUs and two GPUs. Processors of different vendors are employed in these systems – both Intel and AMD CPUs, as well as GPUs from NVIDIA and AMD. For all the grid sizes and for all the tested platforms, the hybrid version with computations spread across CPU and GPU components allows us to achieve the highest performance. In particular, for the largest MPDATA grids used in our experiments, the speedups of the hybrid versions over GPU and CPU versions vary from 1.30 to 1.69, and from 1.95 to 2.25, respectively. 相似文献

13.

Combining CPU and GPU architectures for fast similarity search

Martin Kruli? Tomá? Skopal Jakub Loko? Christian Beecks 《Distributed and Parallel Databases》2012,30(3-4):179-207

The Signature Quadratic Form Distance on feature signatures represents a flexible distance-based similarity model for effective content-based multimedia retrieval. Although metric indexing approaches are able to speed up query processing by two orders of magnitude, their applicability to large-scale multimedia databases containing billions of images is still a challenging issue. In this paper, we propose a parallel approach that balances the utilization of CPU and many-core GPUs for efficient similarity search with the Signature Quadratic Form Distance. In particular, we show how to process multiple distance computations and other parts of the search procedure in parallel, achieving maximal performance of the combined CPU/GPU system. The experimental evaluation demonstrates that our approach implemented on a common workstation with 2?GPU cards outperforms traditional parallel implementation on a high-end 48-core NUMA server in terms of efficiency almost by an order of magnitude. If we consider also the price of the high-end server that is ten times higher than that of the GPU workstation then, based on price/performance ratio, the GPU-based similarity search beats the CPU-based solution by almost two orders of magnitude. Although proposed for the SQFD, our approach of fast GPU-based similarity search is applicable for any distance function that is efficiently parallelizable in the SIMT execution model. 相似文献

14.

Direct volume rendering of unstructured tetrahedral meshes using CUDA and OpenMP

Erhan Okuyan Uğur Güdükbay 《The Journal of supercomputing》2014,67(2):324-344

Direct volume visualization is an important method in many areas, including computational fluid dynamics and medicine. Achieving interactive rates for direct volume rendering of large unstructured volumetric grids is a challenging problem, but parallelizing direct volume rendering algorithms can help achieve this goal. Using Compute Unified Device Architecture (CUDA), we propose a GPU-based volume rendering algorithm that itself is based on a cell projection-based ray-casting algorithm designed for CPU implementations. We also propose a multicore parallelized version of the cell-projection algorithm using OpenMP. In both algorithms, we favor image quality over rendering speed. Our algorithm has a low memory footprint, allowing us to render large datasets. Our algorithm supports progressive rendering. We compared the GPU implementation with the serial and multicore implementations. We observed significant speed-ups that, together with progressive rendering, enables reaching interactive rates for large datasets. 相似文献

15.

基于GPU的多分辨率体数据重构和渲染 总被引：4，自引：1，他引：3

罗月童薛晔刘晓平《计算机辅助设计与图形学学报》2009,21(1)

基于小波变换的多分辨率压缩算法能够获得很高的压缩比,因而被广泛地用于压缩体数据.针对这种压缩策略,研究基于GPU的数据重构的方法,可以只从CPU向GPU传输少量的压缩数据,从而提高数据传输效率.因为好的数据结构是实现基于GPU的重构算法的关键,所以文中提出适合使用矩形纹理表示的数据结构--Nested Tileboard;然后给出基于该数据结构在GPU上实现多分辨率重构的方法,使用Nested Tileboard保存中间数据及重构结果;还提出了基于Nested Tileboard的多分辨率体绘制方法,直接对重构数据进行体绘制,从而实现数据重构和体绘制的无缝连接. 相似文献

16.

Deployment of parallel linear genetic programming using GPUs on PC and video game console platforms

Garnett Wilson Wolfgang Banzhaf 《Genetic Programming and Evolvable Machines》2010,11(2):147-184

相似文献

17.

基于GPU的LARED-P算法加速

刘来国徐炜遐杨灿群陈娟《计算机工程与科学》2009,31(Z1)

GPU拥有几百GFlops甚至上TFlops的浮点计算能力,将GPU应用于粒子模拟,可有效提高大规模粒子模拟的速度,降低计算成本。本文利用GPU加速三维激光等离子体模拟算法LARED-P,提出了基于CPU+GPU的任务划分、GPU上任务分解、大规模计算核心的分解方法,结合使用了寄存器、纹理内存对算法进行加速。在双精度条件下,移植后的算法在工作频率为1.44GHz的NVIDIA Tesla S1070的单个GPU上获得了相当于主频2.4GHz的Intel(R)Core(TM)2 Quad CPU Q6600单核的6倍加速比。相似文献

18.

基于GPU的遥感图像配准并行程序设计与存储优化

周海芳赵进《计算机研究与发展》2012,(Z1):281-286

遥感图像配准是遥感图像应用的一个重要处理步骤.随着遥感图像数据规模与遥感图像配准算法计算复杂度的增大,遥感图像配准面临着处理速度的挑战.最近几年,GPU计算能力得到极大提升,面向通用计算领域得到了快速发展.结合GPU面向通用计算领域的优势与遥感图像配准面临的处理速度问题,研究了GPU加速处理遥感图像配准的算法.选取计算量大计算精度高的基于互信息小波分解配准算法进行GPU并行设计,提出了GPU并行设计模型;同时选取GPU程序常用面向存储级的优化策略应用于遥感图像配准GPU程序,并利用CUDA(compute unified device architecture)编程语言在nVIDIA Tesla M2050GPU上进行了实验.实验结果表明,提出的并行设计模型与面向存储级的优化策略能够很好地适用于遥感图像配准领域,最大加速比达到了19.9倍.研究表明GPU通用计算技术在遥感图像处理领域具有广阔的应用前景. 相似文献

19.

A hybrid parallel solver for systems of multivariate polynomials using CPUs and GPUs

Cheon-Hyeon Park Gershon Elber Ku-Jin Kim Gye-Young Kim Joon-Kyung Seong 《Computer aided design》2011,43(11):1360-1369

This paper deals with a problem of finding valid solutions to systems of polynomial constraints. Although there have been several quite successful algorithms based on domain subdivision to resolve this problem, some major issues are still demanding further research. Prime obstacles in developing an efficient subdivision-based polynomial constraint solver are the exhaustive, although hierarchical, search of the zero-set in the parameter domain, which is computationally demanding, and their scalability in terms of the number of variables. In this paper, we present a hybrid parallel algorithm for solving systems of multivariate constraints by exploiting both the CPU and the GPU multicore architectures. We dedicate the CPU for the traversal of the subdivision tree and the GPU for the multivariate polynomial subdivision. By decomposing the constraint solving technique into two different components, hierarchy traversal and polynomial subdivision, each of which is more suitable to CPUs and GPUs, respectively, our solver can fully exploit the availability of hybrid, multicore architectures of CPUs and GPUs. Furthermore, our GPU-based subdivision method takes advantage of the inherent parallelism in the multivariate polynomial subdivision. We demonstrate the efficacy and scalability of the proposed parallel solver through several examples in geometric applications, including Hausdorff distance queries, contact point computations, surface–surface intersections, ray trap constructions, and bisector surface computations. In our experiments, the proposed parallel method achieves up to two orders of magnitude improvement in performance compared to the state-of-the-art subdivision-based CPU solver. 相似文献

20.

A fast algorithm for constructing inverted files on heterogeneous platforms

Zheng Wei Joseph JaJa 《Journal of Parallel and Distributed Computing》2012

Given a collection of documents residing on a disk, we develop a new strategy for processing these documents and building the inverted files extremely quickly. Our approach is tailored for a heterogeneous platform consisting of multicore CPUs and highly multithreaded GPUs. Our algorithm is based on a number of novel techniques, including a high-throughput pipelined strategy, a hybrid trie and B-tree dictionary data structure, dynamic work allocation to CPU and GPU threads, and optimized CUDA indexer implementation. We have performed extensive tests of our algorithm on a single node (two Intel Xeon X5560 Quad-core CPUs) with two NVIDIA Tesla C1060 GPUs attached to it, and were able to achieve a throughput of more than 262 MB/s on the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those run on large clusters. 相似文献