期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Solving time-invariant differential matrix Riccati equations using GPGPU computing

Jesús Peinado Pedro Alonso Javier Ibáñez Vicente Hernández Murilo Boratto 《The Journal of supercomputing》2014,70(2):623-636

Differential matrix Riccati equations (DMREs) enable to model many physical systems appearing in different branches of science, in some cases, involving very large problem sizes. In this paper, we propose an adaptive algorithm for time-invariant DMREs that uses a piecewise-linearized approach based on the Padé approximation of the matrix exponential. The algorithm designed is based upon intensive use of matrix products and linear system solutions so we can seize the large computational capability that modern graphics processing units (GPUs) have on these types of operations using CUBLAS and CULATOOLS libraries (general purpose GPU), which are efficient implementations of BLAS and LAPACK libraries, respectively, for NVIDIA \(\copyright \) GPUs. A thorough analysis showed that some parts of the algorithm proposed can be carried out in parallel, thus allowing to leverage the two GPUs available in many current compute nodes. Besides, our algorithm can be used by any interested researcher through a friendly MATLAB \(\copyright \) interface. 相似文献

2.

基于图形处理器的数据流快速聚类 总被引：17，自引：1，他引：16

曹锋周傲英《软件学报》2007,18(2):291-302

在数据流环境下,聚类算法不仅需要有较高的聚类质量,同时需要有实时处理速度.因而,提出了一类基于图形处理器(graphics processing unit,简称GPU)的快速聚类方法,包括基于K-means的基本聚类方法、基于GPU的数据流聚类以及数据流簇进化分析方法.这些方法的共同特点是充分利用了GPU强大的处理能力和流水线特性.与以往具有独立框架的数据流聚类算法不同,这些基于GPU的聚类算法具有同一框架和多种聚类分析功能,为数据流聚类分析提供了统一的平台.从分析可知,数据流聚类分析的核心操作实际上就是距离计算和比较.基于这一认识,利用GPU的子素向量处理功能进行距离计算.性能验证实验是在配有Pentium IV 3.4G CPU和NVIDIA GeForce 6800 GT显卡的PC上进行的.综合分析和实验结果表明,基于GPU的数据流聚类算法比传统的CPU算法平均快7倍,从而为高速数据流应用提供了良好的支持. 相似文献

3.

图形处理器在分层聚类算法中的通用计算研究* 总被引：1，自引：0，他引：1

李琳李肯立朱雅丽《计算机应用研究》2008,25(8):2319-2321

ROCK是一种采用数据点间的公共链接数来衡量相似度的分层聚类方法,该方法对于高维、稀疏特征的分类数据具有高效的聚类效果。其邻接度矩阵计算是影响时间复杂度的关键步骤,将图形处理器(GPU)强大的浮点运算和超强的并行计算能力应用于此步骤,而其余步骤由CPU完成。基于GPU的ROCK算法的运算效率在AMD 643500+ CPU和NVIDIA GeForce 6800 GT显卡的硬件环境下经过实验测试,证明其运算速度比完全采用CPU计算速度要快。改进的分层聚类算法适合在数据流环境下对大量数据进行实时高效的聚类的相似文献

4.

SDP-based approximation of stabilising solutions for periodic matrix Riccati differential equations

Sergei V. Gusev Anton S. Shiriaev Leonid B. Freidovich 《International journal of control》2016,89(7):1396-1405

Numerically finding stabilising feedback control laws for linear systems of periodic differential equations is a nontrivial task with no known reliable solutions. The most successful method requires solving matrix differential Riccati equations with periodic coefficients. All previously proposed techniques for solving such equations involve numerical integration of unstable differential equations and consequently fail whenever the period is too large or the coefficients vary too much. Here, a new method for numerical computation of stabilising solutions for matrix differential Riccati equations with periodic coefficients is proposed. Our approach does not involve numerical solution of any differential equations. The approximation for a stabilising solution is found in the form of a trigonometric polynomial, matrix coefficients of which are found solving a specially constructed finite-dimensional semidefinite programming (SDP) problem. This problem is obtained using maximality property of the stabilising solution of the Riccati equation for the associated Riccati inequality and sampling technique. Our previously published numerical comparisons with other methods shows that for a class of problems only this technique provides a working solution. Asymptotic convergence of the computed approximations to the stabilising solution is proved below under the assumption that certain combinations of the key parameters are sufficiently large. Although the rate of convergence is not analysed, it appeared to be exponential in our numerical studies. 相似文献

5.

基于图形处理器的点云快速光顺

张连伟刘大学刘肖琳李焱徐昕贺汉根《计算机工程与科学》2011,33(4):86

点云数据光顺是点模型数字几何处理的一个重要研究内容。在海量数据规模应用中,不仅需要较高的光顺质量,而且需要有快速的处理速度。传统的基于CPU的光顺算法串行地处理每个采样点,导致巨大的时间开销。本文提出一种适应于图形处理器的点云快速光顺算法,将多个采样点处的协方差矩阵组织成一个大规模稀疏矩阵,以纹理图像的形式保存该稀疏矩阵,在像素程序中利用图形处理器强大的并行计算能力迭代求解协方差矩阵的最小特征值与特征向量,并据此计算光顺的速度和方向。实验在配有GeForce 8600GTS显卡的平台上进行。实验结果表明,基于GPU的点云光顺算法较之基于CPU的算法能够显著提高计算效率,从而为快速点云处理提供了良好的支持。相似文献

6.

A new approach for sparse matrix vector product on NVIDIA GPUs

F. Vzquez J. J. Fernndez E. M. Garzn 《Concurrency and Computation》2011,23(8):815-826

The sparse matrix vector product (SpMV) is a key operation in engineering and scientific computing and, hence, it has been subjected to intense research for a long time. The irregular computations involved in SpMV make its optimization challenging. Therefore, enormous effort has been devoted to devise data formats to store the sparse matrix with the ultimate aim of maximizing the performance. Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. SpMV implementations for NVIDIA GPUs have already appeared on the scene. This work proposes and evaluates a new implementation of SpMV for NVIDIA GPUs based on a new format, ELLPACK‐R, that allows storage of the sparse matrix in a regular manner. A comparative evaluation against a variety of storage formats previously proposed has been carried out based on a representative set of test matrices. The results show that, although the performance strongly depends on the specific pattern of the matrix, the implementation based on ELLPACK‐R achieves higher overall performance. Moreover, a comparison with standard state‐of‐the‐art superscalar processors reveals that significant speedup factors are achieved with GPUs. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

7.

Iterative solution of algebraic matrix Riccati equations in open loop Nash games

Teresa Paula Azevedo‐Perdicoúlis Gerhard Jank 《国际强度与非线性控制杂志
》2005,15(2):55-62

In this note we study a fixed point iteration approach to solve algebraic Riccati equations as they appear in general two player Nash differential games on an infinite time horizon, where the information structure is of open loop type. We obtain conditions for existence and uniqueness of non‐negative solutions. The performance of the numerical algorithm is shown in an example. Copyright © 2005 John Wiley & Sons, Ltd. 相似文献

8.

求解对偶Riccati方程的矩阵符号函数法

罗宗虔《控制理论与应用》1992,9(5):551-554

本文介绍一种求解对偶代数Riccati方程正定(负定)稳定(反稳定)解的方法——矩阵符号函数法,给出这些解的唯一存在的充分必要条件和算法实现。相似文献

9.

Exploiting graphical processing units for data‐parallel scientific applications

A. Leist D. P. Playne K. A. Hawick 《Concurrency and Computation》2009,21(18):2400-2437

Graphical processing units (GPUs) have recently attracted attention for scientific applications such as particle simulations. This is partially driven by low commodity pricing of GPUs but also by recent toolkit and library developments that make them more accessible to scientific programmers. We discuss the application of GPU programming to two significantly different paradigms—regular mesh field equations with unusual boundary conditions and graph analysis algorithms. The differing optimization techniques required for these two paradigms cover many of the challenges faced when developing GPU applications. We discuss the relevance of these application paradigms to simulation engines and games. GPUs were aimed primarily at the accelerated graphics market but since this is often closely coupled to advanced game products it is interesting to speculate about the future of fully integrated accelerator hardware for both visualization and simulation combined. As well as reporting the speed‐up performance on selected simulation paradigms, we discuss suitable data‐parallel algorithms and present code examples for exploiting GPU features like large numbers of threads and localized texture memory. We find a surprising variation in the performance that can be achieved on GPUs for our applications and discuss how these findings relate to past known effects in parallel computing such as memory speed‐related super‐linear speed up. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

10.

Performance,optimization, and fitness: Connecting applications to architectures

Mohammad A. Bhuiyan Melissa C. Smith Vivek K. Pallipuram 《Concurrency and Computation》2011,23(10):1066-1100

Recent trends involving multicore processors and graphical processing units (GPUs) focus on exploiting task‐ and thread‐level parallelism. In this paper, we have analyzed various aspects of the performance of these architectures including NVIDIA GPUs, and multicore processors such as Intel Xeon, AMD Opteron, IBM's Cell Broadband Engine. The case study used in this paper is a biological spiking neural network (SNN), implemented with the Izhikevich, Wilson, Morris–Lecar, and Hodgkin–Huxley neuron models. The four SNN models have varying requirements for communication and computation making them useful for performance analysis of the hardware platforms. We report and analyze the variation of performance with network (problem size) scaling, available optimization techniques and execution configuration. A Fitness performance model, that predicts the suitability of the architecture for accelerating an application, is proposed and verified with the SNN implementation results. The Roofline model, another existing performance model, has also been utilized to determine the hardware bottleneck(s) and attainable peak performance of the architectures. Significant speedups for the four SNN neuron models utilizing these architectures are reported; the maximum speedup of 574x was observed in our GPU implementation. Our results and analysis show that a proper match of architecture with algorithm complexity provides the best performance. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

11.

Accelerating computation of Euclidean distance map using the GPU with efficient memory access

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(5):383-406

Recent graphics processing units (GPUs), which have many processing units, can be used for general purpose parallel computation. To utilise the powerful computing ability, GPUs are widely used for general purpose processing. Since GPUs have very high memory bandwidth, the performance of GPUs greatly depends on memory access. The main contribution of this paper is to present a GPU implementation of computing Euclidean distance map (EDM) with efficient memory access. Given a two-dimensional (2D) binary image, EDM is a 2D array of the same size such that each element stores the Euclidean distance to the nearest black pixel. In the proposed GPU implementation, we have considered many programming issues of the GPU system such as coalesced access of global memory and shared memory bank conflicts, and so on. To be concrete, by transposing 2D arrays, which are temporal data stored in the global memory, with the shared memory, the main access from/to the global memory enables to be performed by coalesced access. In practice, we have implemented our parallel algorithm in the following three modern GPU systems: Tesla C1060, GTX 480 and GTX 580. The experimental results have shown that, for an input binary image with size of 9216 × 9216, our implementation can achieve a speedup factor of 54 over the sequential algorithm implementation. 相似文献

12.

Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms

Julien C. Thibault Inanc Senocak 《The Journal of supercomputing》2012,59(2):693-719

Graphics processor units (GPU) that are originally designed for graphics rendering have emerged as massively-parallel “co-processors” to the central processing unit (CPU). Small-footprint multi-GPU workstations with hundreds of processing elements can accelerate compute-intensive simulation science applications substantially. In this study, we describe the implementation of an incompressible flow Navier–Stokes solver for multi-GPU workstation platforms. A shared-memory parallel code with identical numerical methods is also developed for multi-core CPUs to provide a fair comparison between CPUs and GPUs. Specifically, we adopt NVIDIA’s Compute Unified Device Architecture (CUDA) programming model to implement the discretized form of the governing equations on a single GPU. Pthreads are then used to enable communication across multiple GPUs on a workstation. We use separate CUDA kernels to implement the projection algorithm to solve the incompressible fluid flow equations. Kernels are implemented on different memory spaces on the GPU depending on their arithmetic intensity. The memory hierarchy specific implementation produces significantly faster performance. We present a systematic analysis of speedup and scaling using two generations of NVIDIA GPU architectures and provide a comparison of single and double precision computational performance on the GPU. Using a quad-GPU platform for single precision computations, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU workstations can serve as a cost-effective small-footprint parallel computing platform to accelerate computational fluid dynamics (CFD) simulations substantially. 相似文献

13.

An efficient parallelization technique for x264 encoder on heterogeneous platforms consisting of CPUs and GPUs

Youngsub Ko Youngmin Yi Soonhoi Ha 《Journal of Real-Time Image Processing》2014,9(1):5-18

H.264/AVC video encoders have been widely used for its high coding efficiency. Since the computational demand proportional to the frame resolution is constantly increasing, it has been of great interest to accelerate H.264/AVC by parallel processing. Recently, graphics processing units (GPUs) have emerged as a viable target for accelerating general purpose applications by exploiting fine-grain data parallelisms. Despite extensive research efforts to use GPUs to accelerate the H.264/AVC algorithm, it has not been successful to achieve any speed-up over the x264 algorithm that is known as the fastest CPU implementation, mainly due to significant communication overhead between the host CPU and the GPU and intra-frame dependency in the algorithm. In this paper, we propose a novel motion-estimation (ME) algorithm tailored for NVIDIA GPU implementation. It is accompanied by a novel pipelining technique, called sub-frame ME processing, to effectively hide the communication overhead between the host CPU and the GPU. Further, we incorporate frame-level parallelization technique to improve the overall throughput. Experimental results show that our proposed H.264 encoder has higher performance than x264 encoder. 相似文献

14.

Efficient magnetohydrodynamic simulations on graphics processing units with CUDA

Hon-Cheng Wong Un-Hong Wong Xueshang Feng Zesheng Tang 《Computer Physics Communications》2011,(10):2132-2160

Magnetohydrodynamic (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the best of the author?s knowledge, the first implementation of MHD simulations entirely on GPUs with CUDA, named GPU-MHD, to accelerate the simulation process. GPU-MHD supports both single and double precision computations. A series of numerical tests have been performed to validate the correctness of our code. Accuracy evaluation by comparing single and double precision computation results is also given. Performance measurements of both single and double precision are conducted on both the NVIDIA GeForce GTX 295 (GT200 architecture) and GTX 480 (Fermi architecture) graphics cards. These measurements show that our GPU-based implementation achieves between one and two orders of magnitude of improvement depending on the graphics card used, the problem size, and the precision when comparing to the original serial CPU MHD implementation. In addition, we extend GPU-MHD to support the visualization of the simulation results and thus the whole MHD simulation and visualization process can be performed entirely on GPUs. 相似文献

15.

CUD@SAT: SAT solving on GPUs

Alessandro Dal Palù Agostino Dovier Andrea Formisano Enrico Pontelli 《人工智能实验与理论杂志》2013,25(3):293-316

The parallel computing power offered by graphic processing units (GPUs) has been recently exploited to support general purpose applications – by exploiting the availability of general API and the single-instruction multiple-thread-style parallelism present in several classes of problems (e.g. numerical simulations and matrix manipulations) – where relatively simple computations need to be applied to all items in large sets of data. This paper investigates the use of GPUs in parallelising a class of search problems, where the combinatorial nature leads to large parallel tasks and relatively less natural symmetries. Specifically, the investigation focuses on the well-known satisfiability testing (SAT) problem and on the use of the NVIDIA compute unified device architecture, one of the most popular platforms for GPU computing. The paper explores ways to identify strong sources of GPU-style parallelism from SAT solving. The paper describes experiments with different design choices and evaluates the results. The outcomes demonstrate the potential for this approach, leading to one order of magnitude of speedup using a simple NVIDIA platform. 相似文献

16.

A computational solution for the matrix riccati equation using laplace transforms

《国际计算机数学杂志》2012,89(3-4):297-304

The solution for the finite-time matrix Riccati equation is presented in this paper. The solution to the Riccati equation is obtained in terms of the partition of the transition matrix. Matrix differential equations for the partition of the transition matrix are derived and solved using Laplace transforms and the computation is done by the digital computer.

A numerical example for the proposed method is given. 相似文献

17.

Discrete generalized algebraic Riccati equations and polynomial matrix factorization

F. A. Aliev 《Systems & Control Letters》1992,18(1)

This paper presents an algorithm for solving discrete generalized algebraic Riccati equations with the help of an orthogonal projector. A generalization of the procedure of forming and correcting the orthogonal projector is considered and also that of correcting the proper solution by the Newton-Raphson scheme. The possibility to use the discrete generalized Riccati equations for polynomial matrix factorization with respect to the unit circle is demonstrated. A numerical example is given. 相似文献

18.

Accelerating iterative linear solvers using multiple graphical processing units

Zhangxin Chen Bo Yang 《国际计算机数学杂志》2015,92(7):1422-1438

In this paper, we develop, study and implement iterative linear solvers and preconditioners using multiple graphical processing units (GPUs). Techniques for accelerating sparse matrix–vector (SpMV) multiplication, linear solvers and preconditioners are presented. Four Krylov subspace solvers, a Neumann polynomial preconditioner and a domain decomposition preconditioner are implemented. Our numerical tests with NVIDIA C2050 GPUs show that the SpMV kernel can be sped over 40 times faster using four GPUs. Our linear solvers and preconditioners have similar speedup. 相似文献

19.

Optimized OpenCL implementation of the Elastodynamic Finite Integration Technique for viscoelastic media

M. Molero-Armenta Ursula Iturrarán-Viveros S. Aparicio M.G. Hernández 《Computer Physics Communications》2014

Development of parallel codes that are both scalable and portable for different processor architectures is a challenging task. To overcome this limitation we investigate the acceleration of the Elastodynamic Finite Integration Technique (EFIT) to model 2-D wave propagation in viscoelastic media by using modern parallel computing devices (PCDs), such as multi-core CPUs (central processing units) and GPUs (graphics processing units). For that purpose we choose the industry open standard Open Computing Language (OpenCL) and an open-source toolkit called PyOpenCL. The implementation is platform independent and can be used on AMD or NVIDIA GPUs as well as classical multi-core CPUs. The code is based on the Kelvin–Voigt mechanical model which has the gain of not requiring additional field variables. OpenCL performance can be in principle, improved once one can eliminate global memory access latency by using local memory. Our main contribution is the implementation of local memory and an analysis of performance of the local versus the global memory using eight different computing devices (including Kepler, one of the fastest and most efficient high performance computing technology) with various operating systems. The full implementation of the code is included. 相似文献

20.

Global magnetohydrodynamic simulations on multiple GPUs

Un-Hong Wong Hon-Cheng Wong Yonghui Ma 《Computer Physics Communications》2014

Global magnetohydrodynamic (MHD) models play the major role in investigating the solar wind–magnetosphere interaction. However, the huge computation requirement in global MHD simulations is also the main problem that needs to be solved. With the recent development of modern graphics processing units (GPUs) and the Compute Unified Device Architecture (CUDA), it is possible to perform global MHD simulations in a more efficient manner. In this paper, we present a global magnetohydrodynamic (MHD) simulator on multiple GPUs using CUDA 4.0 with GPUDirect 2.0. Our implementation is based on the modified leapfrog scheme, which is a combination of the leapfrog scheme and the two-step Lax–Wendroff scheme. GPUDirect 2.0 is used in our implementation to drive multiple GPUs. All data transferring and kernel processing are managed with CUDA 4.0 API instead of using MPI or OpenMP. Performance measurements are made on a multi-GPU system with eight NVIDIA Tesla M2050 (Fermi architecture) graphics cards. These measurements show that our multi-GPU implementation achieves a peak performance of 97.36 GFLOPS in double precision. 相似文献