期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王启聪吴泽彬刘建军韦志辉叶舜柳家福《计算机工程与科学》2014,36(12):2321-2330

高光谱图像分类是遥感信息处理领域的热点问题,在核稀疏表示分类框架下,联合光谱信息和像元空间信息,空谱联合核稀疏表示高光谱图像分类能够取得较好的分类效果,但较高的计算复杂度及高光谱图像较大的数据量限制了其在实时性要求较高情况下的应用。基于GPU/CUDA架构,提出了一种空谱联合核稀疏表示高光谱分类的并行优化方法,设计访存优化策略对主机和设备端数据交互进行优化;充分利用GPU并行计算能力,加速分类过程中核矩阵的计算;采用依据GPU并行特性实现的矩阵运算,优化基于交替方向乘子法的分类模型求解过程。利用实际高光谱图像数据进行的实验,验证了该方法的有效性和高效性。相似文献

2.

基于GPU加速的实时视频超分辨率重建

陈湘骥韩国强张芝源《计算机应用》2013,33(12):3540-3543

基于稀疏表示的超分辨率算法的图像重建质量好,但算法复杂,现有的CPU串行执行算法无法满足视频实时处理的需要。为此提出了基于GPU加速的稀疏表示的实时视频超分辨率算法。该算法着重于优化数据并行处理流程,提高GPU资源利用率,通过设置视频帧队列、提高显存访问并发率、采用主成分分析(PCA)降维、优化字典查找等手段,使算法执行速度比现有CPU串行算法提高了2个数量级,在显示分辨率为669×546的视频回放测试中达到每秒33帧。相似文献

3.

A fast Hough Transform algorithm for straight lines detection in an image using GPU parallel computing with CUDA-C

R. Yam-Uicab J. L. Lopez-Martinez J. A. Trejo-Sanchez H. Hidalgo-Silva S. Gonzalez-Segura 《The Journal of supercomputing》2017,73(11):4823-4842

The Hough Transform (HT) is a digital image processing method for the detection of shapes which has multiple uses today. A disadvantage of this method is its sequential computational complexity, particularly when a single processor is used. An optimized algorithm of HT for straight lines detection in an image is presented in this article. Optimization is realized by using a decomposition of the input image recently proposed via central processing unit (CPU), and the technique known as segment decomposition. Optimized algorithms improve execution times significantly. In this paper, the optimization is implemented in parallel using graphics processing unit (GPU) programming, allowing a reduction of total run time and achieving a performance more than 20 times better than the sequential method and up to 10 times better than the implementation recently proposed. Additionally, we introduce the concept of Performance Ratio, to emphasize the outperforming of the GPU over the CPUs. 相似文献

4.

Algorithmic performance studies on graphics processing units

Olaf Schenk Matthias Christen Helmar Burkhart 《Journal of Parallel and Distributed Computing》2008

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献

5.

Large-scale paralleled sparse principal component analysis

W. Liu H. Zhang D. Tao Y. Wang K. Lu 《Multimedia Tools and Applications》2016,75(3):1481-1493

Principal component analysis (PCA) is a statistical technique commonly used in multivariate data analysis. However, PCA can be difficult to interpret and explain since the principal components (PCs) are linear combinations of the original variables. Sparse PCA (SPCA) aims to balance statistical fidelity and interpretability by approximating sparse PCs whose projections capture the maximal variance of original data. In this paper we present an efficient and paralleled method of SPCA using graphics processing units (GPUs), which can process large blocks of data in parallel. Specifically, we construct parallel implementations of the four optimization formulations of the generalized power method of SPCA (GP-SPCA), one of the most efficient and effective SPCA approaches, on a GPU. The parallel GPU implementation of GP-SPCA (using CUBLAS) is up to eleven times faster than the corresponding CPU implementation (using CBLAS), and up to 107 times faster than a MatLab implementation. Extensive comparative experiments in several real-world datasets confirm that SPCA offers a practical advantage. 相似文献

6.

基于GPU的遥感图像配准并行程序设计与存储优化

周海芳赵进《计算机研究与发展》2012,(Z1):281-286

遥感图像配准是遥感图像应用的一个重要处理步骤.随着遥感图像数据规模与遥感图像配准算法计算复杂度的增大,遥感图像配准面临着处理速度的挑战.最近几年,GPU计算能力得到极大提升,面向通用计算领域得到了快速发展.结合GPU面向通用计算领域的优势与遥感图像配准面临的处理速度问题,研究了GPU加速处理遥感图像配准的算法.选取计算量大计算精度高的基于互信息小波分解配准算法进行GPU并行设计,提出了GPU并行设计模型;同时选取GPU程序常用面向存储级的优化策略应用于遥感图像配准GPU程序,并利用CUDA(compute unified device architecture)编程语言在nVIDIA Tesla M2050GPU上进行了实验.实验结果表明,提出的并行设计模型与面向存储级的优化策略能够很好地适用于遥感图像配准领域,最大加速比达到了19.9倍.研究表明GPU通用计算技术在遥感图像处理领域具有广阔的应用前景. 相似文献

7.

Optimization of minimum volume constrained hyperspectral image unmixing on CPU–GPU heterogeneous platform

Zebin Wu Jianjun Liu Shun Ye Le Sun Zhihui Wei 《Journal of Real-Time Image Processing》2018,15(2):265-277

Hyperspectral unmixing is essential for efficient hyperspectral image processing. Nonnegative matrix factorization based on minimum volume constraint (MVC-NMF) is one of the most widely used methods for unsupervised unmixing for hyperspectral image without the pure-pixel assumption. But the model of MVC-NMF is unstable, and the traditional solution based on projected gradient algorithm (PG-MVC-NMF) converges slowly with low accuracy. In this paper, a novel parallel method is proposed for minimum volume constrained hyperspectral image unmixing on CPU–GPU Heterogeneous Platform. First, a optimized unmixing model of minimum logarithmic volume regularized NMF is introduced and solved based on the second-order approximation of function and alternating direction method of multipliers (SO-MVC-NMF). Then, the parallel algorithm for optimized MVC-NMF (PO-MVC-NMF) is proposed based on the CPU–GPU heterogeneous platform, taking advantage of the parallel processing capabilities of GPUs and logic control abilities of CPUs. Experimental results based on both simulated and real hyperspectral images indicate that the proposed algorithm is more accurate and robust than the traditional PG-MVC-NMF, and the total speedup of PO-MVC-NMF compared to PG-MVC-NMF is over 50 times. 相似文献

8.

基于神经动力学优化的压缩感知信号恢复方法

熊飞杨清山《计算机应用研究》2015,32(8)

针对稀疏信号的准确和实时恢复问题,提出了一种基于神经动力学优化的压缩感知信号恢复方法。通过引入反馈神经网络（Recurrent Neural Network, RNN）模型求解l1范数最小化优化问题,计算RNN的稳态解以恢复稀疏信号。对不同方法的测试结果表明,提出的方法在恢复稀疏信号时所需的观测点数最少,并且可推广到压缩图像的恢复应用中,获得了更高的信噪比。RNN模型也适合并行实现,通过GPU并行计算获得了超过百倍的加速比。与传统的方法相比,所提出的方法不仅能够更加准确地恢复信号,并具有更强的实时处理能力。相似文献

9.

基于GPU的遥感图像IHS小波融合并行算法设计与实现

徐如林周海芳姜晶菲《计算机工程与科学》2012,34(8):135-141

遥感图像融合是遥感图像应用的一个重要处理步骤。随着遥感图像数据规模与融合算法计算复杂度的增大,遥感图像融合面临着处理速度的挑战。最近几年,GPU计算能力得到极大提升,面向通用计算的应用得到了快速发展。本文基于GPU编程模型和硬件特性,深入研究了遥感图像融合的并行加速算法,提出了适合融合执行流的并行映射模型。本文选取计算量大、计算精度高的IHS增强小波融合算法进行GPU并行设计,并针对主流的GPU平台在数据传输、循环优化、线程设计等方面进行了优化,最后在nVIDIA GTX 460 GPU上进行了实验。实验结果表明,本文设计的并行映射模型及优化策略能够很好地适用于遥感图像融合应用,最大加速比达到了114倍。研究表明,GPU通用计算技术在遥感图像处理领域具有广阔的应用前景。相似文献

10.

Supernodal sparse Cholesky factorization on graphics processing units

Dan Zou Yong Dou Song Guo Rongchun Li Lin Deng 《Concurrency and Computation》2014,26(16):2713-2726

Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

11.

Accelerating 2D orthogonal matching pursuit algorithm on GPU

Yuan Dai Dongjian He Yong Fang Long Yang 《The Journal of supercomputing》2014,69(3):1363-1381

Two-dimensional orthogonal matching pursuit (2D-OMP) algorithm is an extension of the one-dimensional OMP (1D-OMP), whose complexity and memory usage are lower than the 1D-OMP when they are applied to 2D sparse signal recovery. However, the major shortcoming of the 2D-OMP still resides in long computing time. To overcome this disadvantage, we develop a novel parallel design strategy of the 2D-OMP algorithm on a graphics processing unit (GPU) in this paper. We first analyze the complexity of the 2D-OMP and point out that the bottlenecks lie in matrix inverse and projection. After adopting the strategy of matrix inverse update whose performance is superior to traditional methods to reduce the complexity of original matrix inverse, projection becomes the most time-consuming module. Hence, a parallel matrix–matrix multiplication leveraging tiling algorithm strategy is launched to accelerate projection computation on GPU. Moreover, a fast matrix–vector multiplication, a parallel reduction algorithm, and some other parallel skills are also exploited to boost the performance of the 2D-OMP further on GPU. In the case of the sensing matrix of size 128 \(\times \) 256 (176 \(\times \) 256, resp.) for a 256 \(\times \) 256 scale image, experimental results show that the parallel 2D-OMP achieves 17 \(\times \) to 41 \(\times \) (24 \(\times \) to 62 \(\times \) , resp.) speedup over the original C code compiled with the O \(_2\) optimization option. Higher speedup would be further obtained with larger-size image recovery. 相似文献

12.

基于OpenCL的Kmeans算法的优化研究

吴再龙张云泉徐建良贾海鹏颜深根解庆春《计算机科学与探索》2014,(10):1162-1176

Kmeans算法是无监督机器学习中一种典型的聚类算法,是对已知数据集进行划分和分组的重要方法,在图像处理、数据挖掘、生物学领域有着广泛的应用。随着实际应用中数据规模的不断变大,对Kmeans算法的性能也提出了更高的要求。在充分考虑不同硬件平台体系架构差异的基础上,系统地研究了Kmeans算法在GPU和APU平台上实现与优化的关键技术：片上全局同步高效实现,冗余计算减少全局同步次数,线程任务重映射,局部内存重用等,实现了Kmeans算法在不同硬件平台上的高性能与性能移植。实验结果表明,优化后的算法在考虑数据传输时间的前提下,在AMD HD7970 GPU上相对于CPU版本取得136.975～170.333倍的加速比,在AMD A10-5800K APU上相对于CPU版本取得22.2365～24.3865倍的加速比,有效验证了优化方法的有效性和平台的可移植性。相似文献

13.

Multiplicative noise removal via adaptive learned dictionaries and TV regularization

《Digital Signal Processing》2016

Multiplicative noise removal is a key issue in image processing problem. While a large amount of literature on this subject are total variation (TV)-based and wavelet-based methods, recently sparse representation of images has shown to be efficient approach for image restoration. TV regularization is efficient to restore cartoon images while dictionaries are well adapted to textures and some tricky structures. Following this idea, in this paper, we propose an approach that combines the advantages of sparse representation over dictionary learning and TV regularization method. The method is proposed to solve multiplicative noise removal problem by minimizing the energy functional, which is composed of the data-fidelity term, a sparse representation prior over adaptive learned dictionaries, and TV regularization term. The optimization problem can be efficiently solved by the split Bregman algorithm. Experimental results validate that the proposed model has a superior performance than many recent methods, in terms of peak signal-to-noise ratio, mean absolute-deviation error, mean structure similarity, and subjective visual quality. 相似文献

14.

基于GPU的快速Level Set图像分割 总被引：5，自引：1，他引：5

下载免费PDF全文

吴仲乐王遵亮罗立民《中国图象图形学报》2004,9(6):679-683

水平集(1evel set)图像分割方法是图像分割中的一个重要方法，但是该算法的计算量大，往往不能达到实时处理的要求。给出了利用新一代的可编程图形处理器(GPU)实现level set的加速算法。首先介绍了如何在GPU上利用片元渲染程序进行网格化的线性运算和有限差分PDE计算，把level set方法的离散化算子映射到GPU上。由于以数据流处理方式的GPU的存储访问快，具有并行运算能力，同时level set算法演化的显示不再需要把数据从CPU传到GPU，因此较大地提高了算法速度与交互显示。文中实现并测试了一个与初始化状态独立的二维level set的算子用于图像分割，并对其运算结果和性能进行了比较，结果表明该方法具有更快的速度。相似文献

15.

Ternary Sparse Matrix Representation for Volumetric Mesh Subdivision and Processing on GPUs

下载免费PDF全文

J. S. Mueller‐Roemer C. Altenhofen A. Stork 《Computer Graphics Forum》2017,36(5):59-69

In this paper, we present a novel volumetric mesh representation suited for parallel computing on modern GPU architectures. The data structure is based on a compact, ternary sparse matrix storage of boundary operators. Boundary operators correspond to the first‐order top‐down relations of k‐faces to their (k ? 1)‐face facets. The compact, ternary matrix storage format is based on compressed sparse row matrices with signed indices and allows for efficient parallel computation of indirect and bottom‐up relations. This representation is then used in the implementation of several parallel volumetric mesh algorithms including Laplacian smoothing and volumetric Catmull‐Clark subdivision. We compare these algorithms with their counterparts based on OpenVolumeMesh and achieve speedups from 3× to 531×, for sufficiently large meshes, while reducing memory consumption by up to 36%. 相似文献

16.

基于稀疏表示的多幅图像快速超分辨率重建

杨飚邸苗《传感器与微系统》2018,(1):43-45

针对基于稀疏表示的图像超分辨率重建(SRR)提高图像的重建质量,但一般存在计算量大、耗时长的问题,通过粒子群优化稀疏表示算法获得稀疏表示;对多幅图像的稀疏系数进行融合;根据融合后的稀疏系数重建得到高分辨率图像.实验结果表明:方法的重建速度更快,重建质量更高. 相似文献

17.

Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters

Lilia Ziane Khodja Raphaël Couturier Arnaud Giersch Jacques M. Bahi 《The Journal of supercomputing》2014,69(1):200-224

In this paper, we aim at exploiting the power computing of a graphics processing unit (GPU) cluster for solving large sparse linear systems. We implement the parallel algorithm of the generalized minimal residual iterative method using the Compute Unified Device Architecture programming language and the MPI parallel environment. The experiments show that a GPU cluster is more efficient than a CPU cluster. In order to optimize the performances, we use a compressed storage format for the sparse vectors and the hypergraph partitioning. These solutions improve the spatial and temporal localization of the shared data between the computing nodes of the GPU cluster. 相似文献

18.

利用CNN和PCA约束优化模型实现稀疏表示分类

下载免费PDF全文

石亮那天宋晓宁朱玉全《中国图象图形学报》2019,24(4):503-512

目的传统的稀疏表示分类方法运用高维数据提升算法的稀疏分类能力,早已引起了广泛关注,但其忽视了测试样本与训练样本间的信息冗余,导致了不确定性的决策分类问题。为此,本文提出一种基于卷积神经网络和PCA约束优化模型的稀疏表示分类方法（EPCNN-SRC）。方法首先通过深度卷积神经网络计算,在输出层提取对应的特征图像,用以表征原始样本的鲁棒人脸特征。然后在此特征基础上,构建一个PCA（principal component analysis）约束优化模型来线性表示测试样本,计算对应的PCA系数。最后使用稀疏表示分类算法重构测试样本与每类训练样本的PCA系数来完成分类。结果本文设计的分类模型与一些典型的稀疏分类方法相比,取得了更好的分类性能,在AR、FERET、FRGC和LFW人脸数据库上的实验结果显示,当每类仅有一个训练样本时,EPCNN-SRC算法的识别率分别达到96.92%、96.15%、86.94%和42.44%,均高于传统的表示分类方法,充分验证了本文算法的有效性。同时,本文方法不仅提升了对测试样本稀疏表示的鲁棒性,而且在保证识别率的基础上,有效降低了算法的时间复杂度,在FERET数据库上的运行时间为4.92 s,均低于一些传统方法的运行时间。结论基于卷积神经网络和PCA约束优化模型的稀疏表示分类方法,将深度学习特征与PCA方法相结合,不仅具有较好的识别准确度,而且对稀疏分类也具有很好的鲁棒性,尤其在小样本问题上优势显著。相似文献

19.

一种基于GPU的高性能稀疏卷积神经网络优化

方程邢座程陈顼颢张洋《计算机工程与科学》2018,40(12):2103-2111

卷积神经网络CNN目前作为神经网络的一个重要分支,相比于其他神经网络方法更适合应用于图像特征的学习和表达。随着CNN的不断发展,CNN将面临更多的挑战。CNN参数规模变得越来越大,这使得CNN对计算的需求量变得非常大。因此,目前产生了许多种方式对CNN的规模进行压缩。然而压缩后的CNN模型往往产生了许多稀疏的数据结构,这种稀疏结构会影响CNN在GPU上的性能。为了解决该问题,采用直接稀疏卷积算法,来加速GPU处理稀疏数据。根据其算法特点将卷积运算转换为稀疏向量与稠密向量内积运算,并将其在GPU平台上实现。本文的优化方案充分利用数据稀疏性和网络结构来分配线程进行任务调度,利用数据局部性来管理内存替换,使得在稀疏卷积神经网络SCNN中的GPU仍能够高效地处理卷积层运算。相比cuBLAS的实现,在AlexNet、GoogleNet、ResNet上的性能提升分别达到1.07×~1.23×、1.17×~3.51×、1.32×~5.00×的加速比。相比cuSPARSE的实现,在AlexNet、GoogleNet、ResNet上的性能提升分别达到1.31×～1.42×、1.09×～2.00×、1.07×～3.22×的加速比。相似文献

20.

Optimized Schwarz method without overlap for the gravitational potential equation on cluster of graphics processing unit

Frédéric Magoulès Abal-Kassim Cheik Ahamed Roman Putanowicz 《国际计算机数学杂志》2016,93(6):955-980

Many engineering and scientific problems need to solve boundary value problems for partial differential equations or systems of them. For most cases, to obtain the solution with desired precision and in acceptable time, the only practical way is to harness the power of parallel processing. In this paper, we present some effective applications of parallel processing based on hybrid CPU/GPU domain decomposition method. Within the family of domain decomposition methods, the so-called optimized Schwarz methods have proven to have good convergence behaviour compared to classical Schwarz methods. The price for this feature is the need to transfer more physical information between subdomain interfaces. For solving large systems of linear algebraic equations resulting from the finite element discretization of the subproblem for each subdomain, Krylov method is often a good choice. Since the overall efficiency of such methods depends on effective calculation of sparse matrix–vector product, approaches that use graphics processing unit (GPU) instead of central processing unit (CPU) for such task look very promising. In this paper, we discuss effective implementation of algebraic operations for iterative Krylov methods on GPU. In order to ensure good performance for the non-overlapping Schwarz method, we propose to use optimized conditions obtained by a stochastic technique based on the covariance matrix adaptation evolution strategy. The performance, robustness, and accuracy of the proposed approach are demonstrated for the solution of the gravitational potential equation for the data acquired from the geological survey of Chicxulub crater. 相似文献