首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
In this study, we present a novel optimization model that can automatically and rapidly generate an optimally parallel preconditioned conjugate gradient (PCG) algorithm for any given linear system on a specific multi-graphics processing unit (GPU) platform. For our proposed model, there are the following novelties: (1) a profile-based performance model for each one of the main components of the PCG algorithm, including the vector operation, inner product, and sparse matrix-vector multiplication (SpMV), is suggested, and (2) our model is general, independent of the problems, and only dependent on the resources of devices, and (3) our model is extensible. For a vector operation kernel, or inner product kernel, or SpMV kernel that is not included in our framework, once its performance model is successfully constructed, it can be incorporated into our framework. Our model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.  相似文献   

Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core configuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU–GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect result. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms.  相似文献   

It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that cannot be fully exploited by SpMV due to its irregularity. In this paper, we propose two novel methods to optimize the memory bandwidth for SpMV on GPGPU. First, a new storage format is proposed to exploit memory bandwidth of GPU architecture more efficiently. The new storage format can ensure that there are as many non-zeros as possible in the format which is suitable to exploit the memory bandwidth of the GPU. Second, we propose a cache blocking method to improve the performance of SpMV on GPU architecture. The sparse matrix is partitioned into sub-blocks that are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time to access the global memory for vector x is reduced heavily. Experiments are carried out on three GPU platforms, GeForce 9800 GX2, GeForce GTX 480, and Tesla K40. Experimental results show that both new methods can efficiently improve the utilization of GPU memory bandwidth and the performance of the GPU.  相似文献   

稀疏矩阵与向量相乘SpMV是求解稀疏线性系统中的一个重要问题,但是由于非零元素的稀疏性,计算密度较低,造成计算效率不高。针对稀疏矩阵存在的一些不规则性,利用混合存储格式来进行SpMV计算,能够提高对稀疏矩阵的压缩效率,并扩大其适应范围。HYB是一种广泛使用的混合压缩格式,其性能较为稳定。而随着GPU并行计算得到普遍应用以及CPU日趋多核化,因此利用GPU和多核CPU构建异构并行计算系统得到了普遍的认可。针对稀疏矩阵的HYB存储格式中的ELL和COO存储特征,把两部分数据分别分割到CPU和GPU进行协同并行计算,既能充分利用CPU和GPU的计算资源,又能够发挥CPU和GPU的计算特性,从而提高了计算资源的利用效能。在分析CPU+GPU异构计算模式的特征的基础上,对混合格式的数据分割和共享方面进行优化,能够较好地发挥在异构计算环境的优势,提高计算性能。  相似文献   

研究了基于GPU的稀疏线性方程组的预条件共轭梯度法加速求解问题,并基于统一计算设备架构(CUDA)平台编制了程序,在NVIDIAGT430 GPU平台上进行了程序性能测试和分析。稀疏矩阵采用压缩稀疏行(CSR)格式压缩存储,针对预条件共轭梯度法的算法特性,研究了基于GPU的稀疏矩阵与向量相乘的性能优化、数据从CPU端传到GPU端的加速传输措施。将编制的稀疏矩阵与向量相乘的kernel函数和CUSPARSE函数库中的cusparseDcsrmv函数性能进行了对比,最优得到了2.1倍的加速效果。对于整个预条件共轭梯度法,通过自编kernel函数来实现的算法较之采用CUBLAS库和CUSPARSE库实现的算法稍具优势,与CPU端的预条件共轭梯度法相比,最优可以得到7.4倍的加速效果。  相似文献   

Existing formats for Sparse Matrix–Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700 K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision.  相似文献   

A CPU-GPU hybrid approach for the unsymmetric multifrontal method   总被引:1,自引:0,他引:1  
Multifrontal is an efficient direct method for solving large-scale sparse and unsymmetric linear systems. The method transforms a large sparse matrix factorization process into a sequence of factorizations involving smaller dense frontal matrices. Some of these dense operations can be accelerated by using a graphic processing unit (GPU). We analyze the unsymmetric multifrontal method from both an algorithmic and implementational perspective to see how a GPU, in particular the NVIDIA Tesla C2070, can be used to accelerate the computations. Our main accelerating strategies include (i) performing BLAS on both CPU and GPU, (ii) improving the communication efficiency between the CPU and GPU by using page-locked memory, zero-copy memory, and asynchronous memory copy, and (iii) a modified algorithm that reuses the memory between different GPU tasks and sets thresholds to determine whether certain tasks be performed on the GPU. The proposed acceleration strategies are implemented by modifying UMFPACK, which is an unsymmetric multifrontal linear system solver. Numerical results show that the CPU-GPU hybrid approach can accelerate the unsymmetric multifrontal solver, especially for computationally expensive problems.  相似文献   

Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

Sparse matrix–vector multiplication (SpMV) is one of the most important high level operations for basic linear algebra. Nowadays, the GPU has evolved into a highly parallel coprocessor which is suited to compute-intensive, highly parallel computation. Achieving high performance of SpMV on GPUs is relatively challenging, especially when the matrix has no specific structure. For these general sparse matrices, a new data structure based on the bisection ELLPACK format, BiELL, is designed to realize the load balance better, and thus improve the performance of the SpMV. Besides, based on the same idea of JAD format, the BiJAD format can be obtained. Experimental results on various matrices show that the BiELL and BiJAD formats perform better than other similar formats, especially when the number of non-zero elements per row varies a lot.  相似文献   

针对基于GPU求解大规模稀疏线性方程组进行了研究,提出一种稀疏矩阵的分块存储格式HMEC(hybrid multiple ELL and CSR)。通过重排序优化系数矩阵的存储结构,将系数矩阵以一定的比例分块存储,采用ELL与CSR存储格式相结合的方式以适应不同的分块特征,分别使用适用于不对称矩阵的不完全LU分解预处理BICGStab法和对称正定矩阵的不完全Cholesky分解预处理共轭梯度法求解大规模稀疏线性系统。实验表明,应用HMEC格式存储稀疏矩阵并以调用GPU kernel的方式实现前述两种方法,与其他存储格式的实现方式作比较,最优可分别获得31.89%和17.50%的加速效果。  相似文献   

In this study, for two-dimensional Maxwell's equations, an efficient preconditioned generalized minimum residual method on the graphics processing unit (GPUPGMRES) is proposed to obtain numerical solutions of the equations that are discretized by a multisymplectic Preissmann scheme. In our proposed GPUPGMRES, a novel sparse matrix–vector multiplication (SpMV) kernel is suggested while keeping the compressed sparse row (CSR) intact. The proposed kernel dynamically assigns different number of rows to each thread block, and accesses the CSR arrays in a fully coalesced manner. This greatly alleviates the bottleneck of many existing CSR-based algorithms. Furthermore, the vector-operation and inner-product decision trees are automatically constructed. These kernels and their corresponding optimized compute unified device architecture parameter values can be automatically selected from the decision trees for vectors of any size. In addition, using the sparse approximate inverse technique, the preconditioner equation solving falls within the scope of SpMV. Numerical results show that our proposed kernels have high parallelism. GPUPGMRES outperforms a recently proposed preconditioned GMRES method, and a preconditioned GMRES implementation in the AmgX library. Moreover, GPUPGMRES is efficient in solving the two-dimensional Maxwell's equations.  相似文献   

This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge–Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using the CSR storage scheme and working on Intel Xeon Phi, were investigated. The implementation based on the Intel MKL library uses the high-performance BLAS and Sparse BLAS routines. In our application we focus on OpenMP style programming. We implement SpMV operation and vector addition using the basic optimizing techniques and the vectorization. We evaluate our approach in native and offload modes for various number of cores and thread allocation affinities. Both implementations (based on Intel MKL and made by the authors) were compared in respect of the time, the speedup and the performance. The numerical experiments on Intel Xeon Phi show that the performance of authors’ implementation is very promising and gives a gain of up to two times compared to the multithreaded implementation (based on Intel MKL) running on CPU (Intel Xeon processor) and even three times in comparison with the application which uses Intel MKL on Intel Xeon Phi.  相似文献   

细观数值模拟是混凝土性能研究的一种重要手段,但稀疏线性方程组求解在总体模拟时间中所占比重很大。由于属于三维问题,且规模很大,所以采用预条件Krylov子空间迭代是必由之路。Aztec是国际上专门设计用于求解稀疏线性方程组的软件包之一,由于目前混凝土细观数值模拟中的稀疏线性方程组对称正定,所以利用Aztec中提供的CG迭代法进行求解,并对多种能保持对称性的预条件选项进行了实验比较。结果表明,在基于区域分解的并行不完全Cholesky分解、无重叠对称化GS迭代、最小二乘等预条件技术中,第一种的效率最高,且在重叠度为0,填充层次为0时,效果最好;实验结果还表明,在本应用问题中,用RCM排序一般导致求解时间更长,从而没有必要采用。  相似文献   

稀疏矩阵向量乘(SpMV)是求解稀疏线性方程组的计算核心,被广泛应用在经济学模型、信号处理等科学计算和工程应用中,对于SpMV及其调优技术的研究有助于提升解决相关领域问题的运算效率。传统SpMV自动调优方法基于硬件平台的体系结构参数设置来提升SpMV性能,但巨大的参数设置量导致搜索空间变大且自动调优耗时大幅增加。采用深度学习技术,基于卷积神经网络,构建由双通道稀疏矩阵特征融合以及稀疏矩阵特征与体系结构特征融合组成的SpMV运算性能预测模型,实现快速自动调优。为提高SpMV运算时间的预测精度,选取特征数据并利用箱形图统计SpMV时间信息,同时在佛罗里达稀疏矩阵数据集上进行实验设计与验证,结果表明,该模型的SpMV运算时间预测准确率达到80%以上,并且具有较强的泛化能力。  相似文献   

稀疏矩阵和矢量的乘积运算在工程实践及科学计算中经常用到,随着矩阵规模的增长,大量的计算限制了整个系统的性能,因此可以利用GPU的高运算能力加速SpMV。分析了现有GPU上实现的SpMV存在的问题,并设计了行分割优化和float4数据类型优化两种方案。实验表明,该方案可以使性能提升2—8倍。  相似文献   

In this paper, we develop, study and implement iterative linear solvers and preconditioners using multiple graphical processing units (GPUs). Techniques for accelerating sparse matrix–vector (SpMV) multiplication, linear solvers and preconditioners are presented. Four Krylov subspace solvers, a Neumann polynomial preconditioner and a domain decomposition preconditioner are implemented. Our numerical tests with NVIDIA C2050 GPUs show that the SpMV kernel can be sped over 40 times faster using four GPUs. Our linear solvers and preconditioners have similar speedup.  相似文献   

In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer  相似文献   

GPU可以显著提升一些网络功能的性能,但在GPU加速的网络功能虚拟化(Network Function Virtualization,NFV)系统中,由于网络功能需要以虚拟化方式独立开发和部署,其CPU-GPU处理流水线的CPU处理阶段会有较大的额外开销,使得网络功能GPU加速的效果不明显。为解决该问题,提出一个新的支持GPU加速的NFV系统框架。利用服务链中网络功能之间共享数据和流状态的特性,设计了共享式状态管理机制,以减少网络功能中重复性的协议栈处理和流状态管理开销,提升GPU加速的效果。对原型系统进行评估表明,相比于现有的系统框架,该框架能够显著地降低多种GPU加速的网络功能中CPU处理阶段的时间开销,并在常见的网络功能服务链上实现了高达2倍的吞吐量提升。  相似文献   

Graphics Processing Units (GPUs) were originally designed to manipulate images, but due to their intrinsic parallel nature, they turned into a powerful tool for scientific applications. In this article, we evaluated GPU performance in an implementation of a traditional stochastic simulation – the correlated Brownian motion. This movement can be described by the Generalized Langevin Equation (GLE), which is a stochastic integro-differential equation, with applications in many areas like anomalous diffusion, transport in porous media, noise analysis, quantum dynamics, among many others. Our results show the power inherent in GPU programming when compared to traditional CPUs (Intel): we observed acceleration values up to sixty times by using a NVIDIA GPU in place of a single-core Intel CPU.  相似文献   

We present graphics processing unit (GPU) data structures and algorithms to efficiently solve sparse linear systems that are typically required in simulations of multi‐body systems and deformable bodies. Thereby, we introduce an efficient sparse matrix data structure that can handle arbitrary sparsity patterns and outperforms current state‐of‐the‐art implementations for sparse matrix vector multiplication. Moreover, an efficient method to construct global matrices on the GPU is presented where hundreds of thousands of individual element contributions are assembled in a few milliseconds. A finite‐element‐based method for the simulation of deformable solids as well as an impulse‐based method for rigid bodies are introduced in order to demonstrate the advantages of the novel data structures and algorithms. These applications share the characteristic that a major computational effort consists of building and solving systems of linear equations in every time step. Our solving method results in a speed‐up factor of up to 13 in comparison to other GPU methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号