期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

BiELL: A bisection ELLPACK-based storage format for optimizing SpMV on GPUs

Cong Zheng Shuo Gu Tong-Xiang Gu Bing Yang Xing-Ping Liu 《Journal of Parallel and Distributed Computing》2014

Sparse matrix–vector multiplication (SpMV) is one of the most important high level operations for basic linear algebra. Nowadays, the GPU has evolved into a highly parallel coprocessor which is suited to compute-intensive, highly parallel computation. Achieving high performance of SpMV on GPUs is relatively challenging, especially when the matrix has no specific structure. For these general sparse matrices, a new data structure based on the bisection ELLPACK format, BiELL, is designed to realize the load balance better, and thus improve the performance of the SpMV. Besides, based on the same idea of JAD format, the BiJAD format can be obtained. Experimental results on various matrices show that the BiELL and BiJAD formats perform better than other similar formats, especially when the number of non-zero elements per row varies a lot. 相似文献

2.

Accelerating iterative linear solvers using multiple graphical processing units

Zhangxin Chen Bo Yang 《国际计算机数学杂志》2015,92(7):1422-1438

In this paper, we develop, study and implement iterative linear solvers and preconditioners using multiple graphical processing units (GPUs). Techniques for accelerating sparse matrix–vector (SpMV) multiplication, linear solvers and preconditioners are presented. Four Krylov subspace solvers, a Neumann polynomial preconditioner and a domain decomposition preconditioner are implemented. Our numerical tests with NVIDIA C2050 GPUs show that the SpMV kernel can be sped over 40 times faster using four GPUs. Our linear solvers and preconditioners have similar speedup. 相似文献

3.

A segment‐based sparse matrix–vector multiplication on CUDA

Xiaowen Feng Hai Jin Ran Zheng Zhiyuan Shao Lei Zhu 《Concurrency and Computation》2014,26(1):271-286

The challenge for Sparse Matrix–Vector multiplication (SpMV) performance is memory bandwidth, which mostly depends on input matrices and underlying computing platforms. To solve this challenge, many researchers have explored a variety of optimization techniques. One of the most promising aspects focuses on designing storage formats to represent sparse matrices. However, lots of prior storage formats cannot fully take advantage of the underlying computing platforms, resulting in unsatisfactory performance and large memory footprint. Therefore, a novel storage format, called Segmented Hybrid ELL + Compressed Sparse Row (CSR) (SHEC for short), is proposed to further improve the throughput and lessen memory footprint on Graphics Processing Unit (GPU). SHEC format employs an interleaved combination pattern, which combines certain amount of compressed rows to form a new SHEC row. Segmentation is brought in to balance load and reduce memory footprint. According to the empirical data, an automatic SHEC‐based SpMV is developed to fit for all the matrices. Experimental results show that SHEC approach outperforms the best results of NVIDIA SpMV library and exhibits a comparable performance with state‐of‐the‐art storage formats on the standard dataset. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

4.

基于HYB格式稀疏矩阵与向量乘在CPU+GPU异构系统中的实现与优化

阳王东李肯立《计算机工程与科学》2016,38(2):202-209

稀疏矩阵与向量相乘SpMV是求解稀疏线性系统中的一个重要问题,但是由于非零元素的稀疏性,计算密度较低,造成计算效率不高。针对稀疏矩阵存在的一些不规则性,利用混合存储格式来进行SpMV计算,能够提高对稀疏矩阵的压缩效率,并扩大其适应范围。HYB是一种广泛使用的混合压缩格式,其性能较为稳定。而随着GPU并行计算得到普遍应用以及CPU日趋多核化,因此利用GPU和多核CPU构建异构并行计算系统得到了普遍的认可。针对稀疏矩阵的HYB存储格式中的ELL和COO存储特征,把两部分数据分别分割到CPU和GPU进行协同并行计算,既能充分利用CPU和GPU的计算资源,又能够发挥CPU和GPU的计算特性,从而提高了计算资源的利用效能。在分析CPU+GPU异构计算模式的特征的基础上,对混合格式的数据分割和共享方面进行优化,能够较好地发挥在异构计算环境的优势,提高计算性能。相似文献

5.

选择稀疏矩阵乘法最优存储格式的研究

李佳佳张秀霞谭光明陈明宇《计算机研究与发展》2014,51(4):882-894

稀疏矩阵向量乘法(sparse matrix vector multiplication, SpMV)是科学和工程领域中重要的核心子程序之一,也是稀疏基本线性代数子程序(basic linear algebra subprograms, BLAS)库的重要函数.目前很多SpMV的优化工作在不同程度上获得了性能提升,但大多数优化工作针对特定存储格式或一类具有特定特征的稀疏矩阵缺乏通用性.因此高性能的SpMV实现并没有广泛地应用于实际应用和数值解法器中.另外,稀疏矩阵具有众多存储格式,不同存储格式的SpMV存在较大性能差异.根据以上现象,提出一个SpMV的自动调优器(SpMV auto-tuner, SMAT).对于一个给定的稀疏矩阵,SMAT结合矩阵特征选择并返回其最优的存储格式.应用程序通过调用SMAT来得到合适的存储格式,从而获得性能提升,同时随着SMAT中存储格式的扩展,更多的SpMV优化工作可以将性能优势在实际应用中发挥作用.使用佛罗里达大学的2366个稀疏矩阵作为测试集,在Intel上SMAT分别获得9.11GFLOPS(单精度)和2.44GFLOPS(双精度)的最高浮点性能,在AMD平台上获得了3.36GFLOPS(单精度)和1.52GFLOPS(双精度)的最高浮点性能.相比Intel的核心数学函数库(math kernel library, MKL)数学库,SMAT平均获得1.4~1.5倍的性能提升. 相似文献

6.

CUDA-enabled Sparse Matrix–Vector Multiplication on GPUs using atomic operations

Hoang-Vu Dang Bertil Schmidt 《Parallel Computing》2013

Existing formats for Sparse Matrix–Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700 K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision. 相似文献

7.

基于RISC-V向量指令的稀疏矩阵向量乘法实现与优化

顾越赵银亮《计算机工程与科学》2022,44(1):1-8

开源指令集架构RISC-V具有高性能、模块化、简易性和易拓展等优势,在物联网、云计算等领域的应用日渐广泛,其向量拓展部分V模块更是很好地支持了矩阵数值计算.稀疏矩阵向量乘法SpM V作为矩阵数值计算的一个重要组成部分,具有深刻的研究意义与价值.利用RISC-V指令集的向量可配置性和寻址特性,分别对基于CSR、ELLPA... 相似文献

8.

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

Mayez A. Al-Mouhamed Ayaz H. Khan 《The Journal of supercomputing》2017,73(9):3761-3795

The abundant data parallelism available in many-core GPUs has been a key interest to improve accuracy in scientific and engineering simulation. In many cases, most of the simulation time is spent in linear solver involving sparse matrix–vector multiply. In forward petroleum oil and gas reservoir simulation, the application of a stencil relationship to structured grid leads to a family of generalized hepta-diagonal solver matrices with some regularity and structural uniqueness. We present a customized storage scheme that takes advantage of generalized hepta-diagonal sparsity pattern and stencil regularity by optimizing both storage and matrix–vector computation. We also present an in-kernel optimization for implementing sparse matrix–vector multiply (SpMV) and biconjugate gradient stabilized (BiCG-Stab) solver. In-kernel is intended to avoid the multiple kernels invocation associated with the use of the numerical library operators. To keep in-kernel, a lock-free inter-block synchronization is used in which completing thread blocks are assigned some independent computations to avoid repeatedly polling the global memory. Other optimizations enable combining reductions and collective write operations to memory. The in-kernel optimization is particularly useful for the iterative structure of BiCG-Stab for preserving vector data locality and to avoid saving vector data back to memory and reloading on each kernel exit and re-entry. Evaluation uses generalized hepta-diagonal matrices that derives from a range of forward reservoir simulation’s structured grids. Results show the profitability of proposed generalized hepta-diagonal custom storage scheme over standard library storage like compressed sparse row, hybrid sparse, and diagonal formats. Using proposed optimizations, SpMV and BiCG-Stab have been noticeably accelerated compared to other implementations using multiple kernel exit–re-entry when the solver is implemented by invoking numerical library operators. 相似文献

9.

GPU-accelerated preconditioned iterative linear solvers 总被引：1，自引：1，他引：0

Ruipeng Li Yousef Saad 《The Journal of supercomputing》2013,63(2):443-466

This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved. 相似文献

10.

Using sampled information: is it enough for the sparse matrix–vector product locality optimization?

Juan C. Pichel Juan A. Lorenzo Francisco F. Rivera Dora B. Heras Toms F. Pena 《Concurrency and Computation》2014,26(1):98-117

One of the main factors that affect the performance of the sparse matrix–vector product (SpMV) is the low data reuse caused by the irregular and indirect memory access patterns. Different strategies to deal with this problem such as data reordering techniques have been proposed. The computational cost of these techniques is typically high because they consider all the nonzeros of the sparse matrix in order to find an appropriate permutation of rows and columns that improves the SpMV performance. In this paper, we analyze the possibility of increasing the locality of the SpMV using incomplete information in the reordering process. This partial information comes as a consequence of considering only a subset of the nonzero elements of the matrix. These nonzeros are obtained from the original matrix through a sampling process. In particular, two different sampling methods have been considered: a random sampling and an event‐based sampling using hardware counters. We have detected that a small number of samples is enough to obtain quality reorderings. As a consequence, using sampling‐based reorderings leads to noticeable performance improvements with respect to the non‐reordered matrices, reaching speedup values up to 2.1 × . In addition, an important reduction in the computational time required by the reordering technique has been observed. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

11.

Speeding up solving of differential matrix Riccati equations using GPGPU computing and MATLAB

Jesus Peinado Jacinto J. Ibez Enrique Arias Vicente Hernndez 《Concurrency and Computation》2012,24(12):1334-1348

In this work, we developed a parallel algorithm to speed up the resolution of differential matrix Riccati equations using a backward differentiation formula algorithm based on a fixed‐point method. The role and use of differential matrix Riccati equations is especially important in several applications such as optimal control, filtering, and estimation. In some cases, the problem could be large, and it is interesting to speed it up as much as possible. Recently, modern graphic processing units (GPUs) have been used as a way to improve performance. In this paper, we used an approach based on general‐purpose computing on graphics processing units. We used NVIDIA © GPUs with unified architecture. To do this, a special version of basic linear algebra subprograms for GPUs, called CUBLAS, and a package (three different packages were studied) to solve linear systems using GPUs have been used. Moreover, we developed a MATLAB © toolkit to use our implementation from MATLAB in such a way that if the user has a graphic card, the performance of the implementation is improved. If the user does not have such a card, the algorithm can also be run using the machine CPU. Experimental results on a NVIDIA Quadro FX 5800 are shown. Copyright © 2011 John Wiley & Sons, Ltd. 相似文献

12.

基于深度学习的稀疏矩阵向量乘运算性能预测模型

曹中潇冯仰德王珏闵维潇姚铁锤高岳王丽华高付海《计算机工程》2022,48(2):86-91

稀疏矩阵向量乘（SpMV）是求解稀疏线性方程组的计算核心,被广泛应用在经济学模型、信号处理等科学计算和工程应用中,对于SpMV及其调优技术的研究有助于提升解决相关领域问题的运算效率。传统SpMV自动调优方法基于硬件平台的体系结构参数设置来提升SpMV性能,但巨大的参数设置量导致搜索空间变大且自动调优耗时大幅增加。采用深度学习技术,基于卷积神经网络,构建由双通道稀疏矩阵特征融合以及稀疏矩阵特征与体系结构特征融合组成的SpMV运算性能预测模型,实现快速自动调优。为提高SpMV运算时间的预测精度,选取特征数据并利用箱形图统计SpMV时间信息,同时在佛罗里达稀疏矩阵数据集上进行实验设计与验证,结果表明,该模型的SpMV运算时间预测准确率达到80%以上,并且具有较强的泛化能力。相似文献

13.

Memory bandwidth optimization of SpMV on GPGPUs

Chenggang Clarence YAN Hui YU Weizhi XU Yingping ZHANG Bochuan CHEN Zhu TIAN Yuxuan WANG Jian YIN 《Frontiers of Computer Science》2015,9(3):431

It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that cannot be fully exploited by SpMV due to its irregularity. In this paper, we propose two novel methods to optimize the memory bandwidth for SpMV on GPGPU. First, a new storage format is proposed to exploit memory bandwidth of GPU architecture more efficiently. The new storage format can ensure that there are as many non-zeros as possible in the format which is suitable to exploit the memory bandwidth of the GPU. Second, we propose a cache blocking method to improve the performance of SpMV on GPU architecture. The sparse matrix is partitioned into sub-blocks that are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time to access the global memory for vector x is reduced heavily. Experiments are carried out on three GPU platforms, GeForce 9800 GX2, GeForce GTX 480, and Tesla K40. Experimental results show that both new methods can efficiently improve the utilization of GPU memory bandwidth and the performance of the GPU. 相似文献

14.

GPU accelerated sparse matrix‐vector multiplication and sparse matrix‐transpose vector multiplication

Yuan Tao Yangdong Deng Shuai Mu Zhenzhong Zhang Mingfa Zhu Limin Xiao Li Ruan 《Concurrency and Computation》2015,27(14):3771-3789

Many high performance computing applications require computing both sparse matrix‐vector product (SMVP) and sparse matrix‐transpose vector product (SMTVP) for better overall performance. Under such a circumstance, it is critical to maintain a similarly high throughput for these two computing patterns with the underlying sparse matrix encoded in a single storage format. The compressed sparse block (CSB) format proposed by Buluç et al. allows computing both problems on multi‐core CPUs with nearly identical throughputs. On the other hand, a direct porting of CSB to graphics processing units (GPUs), which have been recently recognized as a powerful general purpose computing platform, turns out to be inefficient. In this work, we propose a new data structure, designated as expanded CSB (eCSB), to minimize the throughput gap between SMVP and SMTVP computations on GPUs, while at the same time enable a high computing throughput. We also use a hybrid storage format to store elements in each block, which can be selected dynamically at runtime. Experimental results show that the proposed techniques implemented on a Kepler GPU delivers similar throughput on both SMVP and SMTVP and the throughput is up to 13 times faster than that of the CPU‐based CSB implementation. In addition, our eCSB procedure outperforms the previous GPU results by up to 188% and 914% in computing SMVP and SMTVP, and we validate the effectiveness of eCSB by means of wall‐clock time of bi‐conjugate gradient algorithm; our eCSB is 25% faster than Compressed Sparse Rows (CSR) and 6% faster than HYB, respectively. Copyright © 2014 John Wiley & Sons, Ltd. 相似文献

15.

Algorithmic performance studies on graphics processing units

Olaf Schenk Matthias Christen Helmar Burkhart 《Journal of Parallel and Distributed Computing》2008

We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications. 相似文献

16.

对角线稀疏矩阵的SpMV自适应性能优化

孙相征张云泉王婷李焱袁良《计算机研究与发展》2013,50(3):648-656

稀疏矩阵向量乘(SpMV)是科学计算中常用的内核之一,其运行速率跟非零元分布相关.针对对角线稀疏矩阵,提出了压缩行片段对角(compressed row segment diagonal, CRSD)存储格式.它利用“对角线格式”有效描述矩阵的对角线分布,区别于以往通用的计算方法,CRSD通过对给定应用的对角线稀疏矩阵采样再进行特定的优化.并且在软件安装阶段,通过自适应的方法选取适合具体运行平台的最优SpMV实现.在CPU端进行多线程并行化实现时,自适应调优过程中收集的信息还被用于线程间任务划分,以实现负载平衡.同时完成CRSD存储格式在GPU端的实现,并根据GPU端计算与访存的特点进行优化.实验结果表明：在Intel和AMD的多核平台使用相同线程数的情况下,与DIA相比,使用CRSD的加速比可以达到2.37X(平均1.7X);与CSR相比,可以达到4.6X(平均2.1X). 相似文献

17.

Two-dimensional cache-oblivious sparse matrix-vector multiplication

A.N. Yzelman Rob H. Bisseling 《Parallel Computing》2011,37(12):806-819

In earlier work, we presented a one-dimensional cache-oblivious sparse matrix-vector (SpMV) multiplication scheme which has its roots in one-dimensional sparse matrix partitioning. Partitioning is often used in distributed-memory parallel computing for the SpMV multiplication, an important kernel in many applications. A logical extension is to move towards using a two-dimensional partitioning. In this paper, we present our research in this direction, extending the one-dimensional method for cache-oblivious SpMV multiplication to two dimensions, while still allowing only row and column permutations on the sparse input matrix. This extension requires a generalisation of the compressed row storage data structure to a block-based data structure, for which several variants are investigated. Experiments performed on three different architectures show further improvements of the two-dimensional method compared to the one-dimensional method, especially in those cases where the one-dimensional method already provided significant gains. The largest gain obtained by our new reordering is over a factor of 3 in SpMV speed, compared to the natural matrix ordering. 相似文献

18.

Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform

Shiming Xu Wei Xue Hai Xiang Lin 《The Journal of supercomputing》2013,63(3):710-721

In this article, we discuss the performance modeling and optimization of Sparse Matrix-Vector Multiplication ( ) on NVIDIA GPUs using CUDA. has a very low computation-data ratio and its performance is mainly bound by the memory bandwidth. We propose optimization of based on ELLPACK from two aspects: (1) enhanced performance for the dense vector by reducing cache misses, and (2) reduce accessed matrix data by index reduction. With matrix bandwidth reduction techniques, both cache usage enhancement and index compression can be enabled. For GPU with better cache support, we propose differentiated memory access scheme to avoid contamination of caches by matrix data. Performance evaluation shows that the combined speedups of proposed optimizations for GT-200 are 16% (single-precision) and 12.6% (double-precision) for GT-200 GPU, and 19% (single-precision) and 15% (double-precision) for GF-100 GPU. 相似文献

19.

Supernodal sparse Cholesky factorization on graphics processing units

Dan Zou Yong Dou Song Guo Rongchun Li Lin Deng 《Concurrency and Computation》2014,26(16):2713-2726

Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

20.

面向国产申威26010众核处理器的SpMV实现与优化

刘芳芳杨超袁欣辉吴长茂敖玉龙《软件学报》2018,29(12):3921-3932

世界首台峰值性能超过100P的超级计算机——神威太湖之光已经研制完成,该超级计算机采用了国产申威异构众核处理器,该处理器不同于现有的纯CPU,CPU-MIC,CPU-GPU架构,采用了主-从核架构,单处理器峰值计算能力为3TFlops/s,访存带宽为130GB/s.稀疏矩阵向量乘SpMV（sparse matrix-vector multiplication）是科学与工程计算中的一个非常重要的核心函数,众所周知,其是带宽受限型的,且存在间接访存操作.国产申威处理器给稀疏矩阵向量乘的高效实现带来了很大的挑战.针对申威处理器提出了一种CSR格式SpMV操作的通用异构众核并行算法,该算法从任务划分、LDM空间划分方面进行精细设计,提出了一套动静态buffer的缓存机制以提升向量x的访存命中率,提出了一套动静态的任务调度方法以实现负载均衡.另外还分析了该算法中影响SpMV性能的几个关键因素,并开展了自适应优化,进一步提升了性能.采用Matrix Market矩阵集中具有代表性的16个稀疏矩阵进行了测试,相比主核版最高有10倍左右的加速,平均加速比为6.51.通过采用主核版CSR格式SpMV的访存量进行分析,测试矩阵最高可达该处理器实测带宽的86%,平均可达到47%. 相似文献