期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

GPU-accelerated preconditioned GMRES method for two-dimensional Maxwell's equations

Jiaquan Gao Kesong Wu Yushun Wang Panpan Qi Guixia He 《国际计算机数学杂志》2017,94(10):2122-2144

In this study, for two-dimensional Maxwell's equations, an efficient preconditioned generalized minimum residual method on the graphics processing unit (GPUPGMRES) is proposed to obtain numerical solutions of the equations that are discretized by a multisymplectic Preissmann scheme. In our proposed GPUPGMRES, a novel sparse matrix–vector multiplication (SpMV) kernel is suggested while keeping the compressed sparse row (CSR) intact. The proposed kernel dynamically assigns different number of rows to each thread block, and accesses the CSR arrays in a fully coalesced manner. This greatly alleviates the bottleneck of many existing CSR-based algorithms. Furthermore, the vector-operation and inner-product decision trees are automatically constructed. These kernels and their corresponding optimized compute unified device architecture parameter values can be automatically selected from the decision trees for vectors of any size. In addition, using the sparse approximate inverse technique, the preconditioner equation solving falls within the scope of SpMV. Numerical results show that our proposed kernels have high parallelism. GPUPGMRES outperforms a recently proposed preconditioned GMRES method, and a preconditioned GMRES implementation in the AmgX library. Moreover, GPUPGMRES is efficient in solving the two-dimensional Maxwell's equations. 相似文献

2.

Memory bandwidth optimization of SpMV on GPGPUs

Chenggang Clarence YAN Hui YU Weizhi XU Yingping ZHANG Bochuan CHEN Zhu TIAN Yuxuan WANG Jian YIN 《Frontiers of Computer Science》2015,9(3):431

It is an important task to improve performance for sparse matrix vector multiplication (SpMV), and it is a difficult task because of its irregular memory access. General purpose GPU (GPGPU) provides high computing ability and substantial bandwidth that cannot be fully exploited by SpMV due to its irregularity. In this paper, we propose two novel methods to optimize the memory bandwidth for SpMV on GPGPU. First, a new storage format is proposed to exploit memory bandwidth of GPU architecture more efficiently. The new storage format can ensure that there are as many non-zeros as possible in the format which is suitable to exploit the memory bandwidth of the GPU. Second, we propose a cache blocking method to improve the performance of SpMV on GPU architecture. The sparse matrix is partitioned into sub-blocks that are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time to access the global memory for vector x is reduced heavily. Experiments are carried out on three GPU platforms, GeForce 9800 GX2, GeForce GTX 480, and Tesla K40. Experimental results show that both new methods can efficiently improve the utilization of GPU memory bandwidth and the performance of the GPU. 相似文献

3.

二元域大型稀疏矩阵向量乘的FPGA设计与实现

苏锦柱邬贵明贾迅《计算机工程与科学》2016,38(8):1530-1535

作为Wiedemannn算法的核心部分,稀疏矩阵向量乘是求解二元域上大型稀疏线性方程组的主要步骤。提出了一种基于FPGA的二元域大型稀疏矩阵向量乘的环网硬件系统架构,为解决Wiedemannn算法重复计算稀疏矩阵向量乘,提出了新的并行计算结构。实验分析表明,提出的架构提高了Wiedemannn算法中稀疏矩阵向量乘的并行性,同时充分利用了FPGA的片内存储器和吉比特收发器,与目前性能最好的部分可重构计算PR模型相比,实现了2.65倍的加速性能。相似文献

4.

Random access schemes for efficient FPGA SpMV acceleration

《Microprocessors and Microsystems》2016

Utilizing hardware resources efficiently is vital to building the future generation of high-performance computing systems. The sparse matrix – dense vector multiplication (SpMV) kernel, which is notorious for its poor efficiency on conventional processors, is a key component in many scientific computing applications and increasing SpMV efficiency can contribute significantly to improving overall system efficiency. The major challenge in implementing SpMV efficiently is handling the input-dependent memory access patterns, and reconfigurable logic is a strong candidate for tackling this problem via memory system customization. In this work, we consider three schemes (all off-chip, all on-chip, caching) for servicing the irregular-access component of SpMV and investigate their effects on accelerator efficiency. To combine the strengths of on-chip and off-chip random accesses, we propose a hardware-software caching scheme named NCVCS that combines software preprocessing with a nonblocking cache to enable highly efficient SpMV accelerators with modest on-chip memory requirements. Our results from the comparison of the three schemes implemented as part of an FPGA SpMV accelerator show that our scheme effectively combines the high efficiency from on-chip accesses with the capability of working with large matrices from off-chip accesses. 相似文献

5.

Accelerating iterative linear solvers using multiple graphical processing units

Zhangxin Chen Bo Yang 《国际计算机数学杂志》2015,92(7):1422-1438

In this paper, we develop, study and implement iterative linear solvers and preconditioners using multiple graphical processing units (GPUs). Techniques for accelerating sparse matrix–vector (SpMV) multiplication, linear solvers and preconditioners are presented. Four Krylov subspace solvers, a Neumann polynomial preconditioner and a domain decomposition preconditioner are implemented. Our numerical tests with NVIDIA C2050 GPUs show that the SpMV kernel can be sped over 40 times faster using four GPUs. Our linear solvers and preconditioners have similar speedup. 相似文献

6.

面向国产申威26010众核处理器的SpMV实现与优化

刘芳芳杨超袁欣辉吴长茂敖玉龙《软件学报》2018,29(12):3921-3932

世界首台峰值性能超过100P的超级计算机——神威太湖之光已经研制完成,该超级计算机采用了国产申威异构众核处理器,该处理器不同于现有的纯CPU,CPU-MIC,CPU-GPU架构,采用了主-从核架构,单处理器峰值计算能力为3TFlops/s,访存带宽为130GB/s.稀疏矩阵向量乘SpMV（sparse matrix-vector multiplication）是科学与工程计算中的一个非常重要的核心函数,众所周知,其是带宽受限型的,且存在间接访存操作.国产申威处理器给稀疏矩阵向量乘的高效实现带来了很大的挑战.针对申威处理器提出了一种CSR格式SpMV操作的通用异构众核并行算法,该算法从任务划分、LDM空间划分方面进行精细设计,提出了一套动静态buffer的缓存机制以提升向量x的访存命中率,提出了一套动静态的任务调度方法以实现负载均衡.另外还分析了该算法中影响SpMV性能的几个关键因素,并开展了自适应优化,进一步提升了性能.采用Matrix Market矩阵集中具有代表性的16个稀疏矩阵进行了测试,相比主核版最高有10倍左右的加速,平均加速比为6.51.通过采用主核版CSR格式SpMV的访存量进行分析,测试矩阵最高可达该处理器实测带宽的86%,平均可达到47%. 相似文献

7.

Two-dimensional cache-oblivious sparse matrix-vector multiplication

A.N. Yzelman Rob H. Bisseling 《Parallel Computing》2011,37(12):806-819

In earlier work, we presented a one-dimensional cache-oblivious sparse matrix-vector (SpMV) multiplication scheme which has its roots in one-dimensional sparse matrix partitioning. Partitioning is often used in distributed-memory parallel computing for the SpMV multiplication, an important kernel in many applications. A logical extension is to move towards using a two-dimensional partitioning. In this paper, we present our research in this direction, extending the one-dimensional method for cache-oblivious SpMV multiplication to two dimensions, while still allowing only row and column permutations on the sparse input matrix. This extension requires a generalisation of the compressed row storage data structure to a block-based data structure, for which several variants are investigated. Experiments performed on three different architectures show further improvements of the two-dimensional method compared to the one-dimensional method, especially in those cases where the one-dimensional method already provided significant gains. The largest gain obtained by our new reordering is over a factor of 3 in SpMV speed, compared to the natural matrix ordering. 相似文献

8.

SpMV的自动性能优化实现技术及其应用研究 总被引：1，自引：0，他引：1

袁娥张云泉刘芳芳孙相征《计算机研究与发展》2009,46(7)

在科学计算中,稀疏矩阵向量乘(SpMV)是一个十分重要且经常被大量调用的计算内核.由于SpMV一般实现算法的浮点计算和存储访问次数比率非常低,且其存储访问模式极为不规则,其实际运行性能往往很低.通过采用寄存器分块算法和启发式分块大小选择算法,将稀疏矩阵分成小的稠密分块,重用保存在寄存器中向量x元素,可以提高该计算内核的性能.剖析和总结了OSKI软件包所采用的若干关键优化技术,并进行了实际应用性能测试.测试表明,在实际应用这些优化技术的过程中,应用程序对SpMV的调用次数要达到上百次的量级,才能抵消由于应用这些性能优化技术所带来的额外时间开销,取得性能加速效果.在Pentium 4和AMD Athlon平台上,测试了10个矩阵,其平均加速比分别达到了1.69和1.48. 相似文献

9.

基于RISC-V向量指令的稀疏矩阵向量乘法实现与优化

顾越赵银亮《计算机工程与科学》2022,44(1):1-8

开源指令集架构RISC-V具有高性能、模块化、简易性和易拓展等优势,在物联网、云计算等领域的应用日渐广泛,其向量拓展部分V模块更是很好地支持了矩阵数值计算.稀疏矩阵向量乘法SpM V作为矩阵数值计算的一个重要组成部分,具有深刻的研究意义与价值.利用RISC-V指令集的向量可配置性和寻址特性,分别对基于CSR、ELLPA... 相似文献

10.

Analyzing the execution of sparse matrix-vector product on the Finisterrae SMP-NUMA system

Juan C. Pichel Juan A. Lorenzo Dora B. Heras Jose C. Cabaleiro Tomás F. Pena 《The Journal of supercomputing》2011,58(2):195-205

In this paper, the sparse matrix-vector product (SpMV) is evaluated on the FinisTerrae SMP-NUMA supercomputer. Its architecture particularities make the tuning of SpMV especially relevant due to the significant impact on the performance. First, we have estimated the influence of data and thread allocation. Moreover, because of the indirect and irregular memory access patterns of SpMV, we have also studied the influence of the memory hierarchy in the performance. According to the behavior observed in the study, a set of optimizations specially tuned for FinisTerrae were successfully applied to SpMV. Noticeable improvements are obtained in comparison with the SpMV naïve implementation. 相似文献

11.

A segment‐based sparse matrix–vector multiplication on CUDA

Xiaowen Feng Hai Jin Ran Zheng Zhiyuan Shao Lei Zhu 《Concurrency and Computation》2014,26(1):271-286

The challenge for Sparse Matrix–Vector multiplication (SpMV) performance is memory bandwidth, which mostly depends on input matrices and underlying computing platforms. To solve this challenge, many researchers have explored a variety of optimization techniques. One of the most promising aspects focuses on designing storage formats to represent sparse matrices. However, lots of prior storage formats cannot fully take advantage of the underlying computing platforms, resulting in unsatisfactory performance and large memory footprint. Therefore, a novel storage format, called Segmented Hybrid ELL + Compressed Sparse Row (CSR) (SHEC for short), is proposed to further improve the throughput and lessen memory footprint on Graphics Processing Unit (GPU). SHEC format employs an interleaved combination pattern, which combines certain amount of compressed rows to form a new SHEC row. Segmentation is brought in to balance load and reduce memory footprint. According to the empirical data, an automatic SHEC‐based SpMV is developed to fit for all the matrices. Experimental results show that SHEC approach outperforms the best results of NVIDIA SpMV library and exhibits a comparable performance with state‐of‐the‐art storage formats on the standard dataset. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

12.

A new approach for sparse matrix vector product on NVIDIA GPUs

F. Vzquez J. J. Fernndez E. M. Garzn 《Concurrency and Computation》2011,23(8):815-826

The sparse matrix vector product (SpMV) is a key operation in engineering and scientific computing and, hence, it has been subjected to intense research for a long time. The irregular computations involved in SpMV make its optimization challenging. Therefore, enormous effort has been devoted to devise data formats to store the sparse matrix with the ultimate aim of maximizing the performance. Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. SpMV implementations for NVIDIA GPUs have already appeared on the scene. This work proposes and evaluates a new implementation of SpMV for NVIDIA GPUs based on a new format, ELLPACK‐R, that allows storage of the sparse matrix in a regular manner. A comparative evaluation against a variety of storage formats previously proposed has been carried out based on a representative set of test matrices. The results show that, although the performance strongly depends on the specific pattern of the matrix, the implementation based on ELLPACK‐R achieves higher overall performance. Moreover, a comparison with standard state‐of‐the‐art superscalar processors reveals that significant speedup factors are achieved with GPUs. Copyright © 2010 John Wiley & Sons, Ltd. 相似文献

13.

Using sampled information: is it enough for the sparse matrix–vector product locality optimization?

Juan C. Pichel Juan A. Lorenzo Francisco F. Rivera Dora B. Heras Toms F. Pena 《Concurrency and Computation》2014,26(1):98-117

One of the main factors that affect the performance of the sparse matrix–vector product (SpMV) is the low data reuse caused by the irregular and indirect memory access patterns. Different strategies to deal with this problem such as data reordering techniques have been proposed. The computational cost of these techniques is typically high because they consider all the nonzeros of the sparse matrix in order to find an appropriate permutation of rows and columns that improves the SpMV performance. In this paper, we analyze the possibility of increasing the locality of the SpMV using incomplete information in the reordering process. This partial information comes as a consequence of considering only a subset of the nonzero elements of the matrix. These nonzeros are obtained from the original matrix through a sampling process. In particular, two different sampling methods have been considered: a random sampling and an event‐based sampling using hardware counters. We have detected that a small number of samples is enough to obtain quality reorderings. As a consequence, using sampling‐based reorderings leads to noticeable performance improvements with respect to the non‐reordered matrices, reaching speedup values up to 2.1 × . In addition, an important reduction in the computational time required by the reordering technique has been observed. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

14.

基于HYB格式稀疏矩阵与向量乘在CPU+GPU异构系统中的实现与优化

阳王东李肯立《计算机工程与科学》2016,38(2):202-209

稀疏矩阵与向量相乘SpMV是求解稀疏线性系统中的一个重要问题,但是由于非零元素的稀疏性,计算密度较低,造成计算效率不高。针对稀疏矩阵存在的一些不规则性,利用混合存储格式来进行SpMV计算,能够提高对稀疏矩阵的压缩效率,并扩大其适应范围。HYB是一种广泛使用的混合压缩格式,其性能较为稳定。而随着GPU并行计算得到普遍应用以及CPU日趋多核化,因此利用GPU和多核CPU构建异构并行计算系统得到了普遍的认可。针对稀疏矩阵的HYB存储格式中的ELL和COO存储特征,把两部分数据分别分割到CPU和GPU进行协同并行计算,既能充分利用CPU和GPU的计算资源,又能够发挥CPU和GPU的计算特性,从而提高了计算资源的利用效能。在分析CPU+GPU异构计算模式的特征的基础上,对混合格式的数据分割和共享方面进行优化,能够较好地发挥在异构计算环境的优势,提高计算性能。相似文献

15.

基于神威·太湖之光的非结构网格计算加速算法

许乐安虹陈俊仕张鹏飞武铮《计算机工程》2022,48(12):45-53

在国产异构众核平台神威·太湖之光上的非结构网格计算具有稀疏存储、离散访存、数据依赖等特点,严重制约了众核处理器的性能发挥。为解决稀疏存储和离散访存问题,提出一种N阶对角染色算法,以有效平衡主从核计算并利用从核将全局访存转化为LDM访问。针对数据依赖造成的计算竞争问题,采用自适应和无依赖的任务划分方法,避免并行计算时的数据冲突。为对处理器架构和非结构网格计算进行优化,采用主核与从核异步并行的方式,差异化使用主从核以充分利用硬件资源,同时,取消处理器提供的寄存器通信机制,降低从核阵列的同步开销同时便于扩展到新一代神威平台。此外,使用计算访存异步重叠技术来充分隐藏访存延迟。利用SpMV、Integration、calcLudsFcc算子进行实验,结果表明,相比主核实现,组合加速算法在不同算例规模下平均取得了10倍的加速效果,加速比最高可达24倍,N阶对角染色算法相比非染色分块算法取得了超过5.8倍的性能加速,有效提升了数据局部性和计算并行度。该算法对有依赖关系的计算冲突算子同样具有良好的加速性能,验证了自适应和无依赖任务划分方法的有效性。相似文献

16.

Parallel sparse matrix vector multiply software for matrices with data locality

RAY S. TUMINARO JOHN N. SHADID SCOTT A. HUTCHINSON 《Concurrency and Computation》1998,10(3):229-247

In this paper we describe general software utilities for performing unstructured sparse matrix–vector multiplications on distributed-memory message-passing computers. The matrix–vector multiply comprises an important kernel in the solution of large sparse linear systems by iterative methods. Our focus is to present the data structures and communication parameters required by these utilities for general sparse unstructured matrices with data locality. These types of matrices are commonly produced by finite difference and finite element approximations to systems of partial differential equations. In this discussion we also present representative examples and timings which demonstrate the utility and performance of the software. © 1998 John Wiley & Sons, Ltd. 相似文献

17.

基于深度学习的稀疏矩阵向量乘运算性能预测模型

曹中潇冯仰德王珏闵维潇姚铁锤高岳王丽华高付海《计算机工程》2022,48(2):86-91

稀疏矩阵向量乘（SpMV）是求解稀疏线性方程组的计算核心,被广泛应用在经济学模型、信号处理等科学计算和工程应用中,对于SpMV及其调优技术的研究有助于提升解决相关领域问题的运算效率。传统SpMV自动调优方法基于硬件平台的体系结构参数设置来提升SpMV性能,但巨大的参数设置量导致搜索空间变大且自动调优耗时大幅增加。采用深度学习技术,基于卷积神经网络,构建由双通道稀疏矩阵特征融合以及稀疏矩阵特征与体系结构特征融合组成的SpMV运算性能预测模型,实现快速自动调优。为提高SpMV运算时间的预测精度,选取特征数据并利用箱形图统计SpMV时间信息,同时在佛罗里达稀疏矩阵数据集上进行实验设计与验证,结果表明,该模型的SpMV运算时间预测准确率达到80%以上,并且具有较强的泛化能力。相似文献

18.

基于GPU的稀疏线性系统的预条件共轭梯度法

张健飞沈德飞《计算机应用》2013,33(3):825-829

研究了基于GPU的稀疏线性方程组的预条件共轭梯度法加速求解问题,并基于统一计算设备架构(CUDA)平台编制了程序,在NVIDIAGT430 GPU平台上进行了程序性能测试和分析。稀疏矩阵采用压缩稀疏行(CSR)格式压缩存储,针对预条件共轭梯度法的算法特性,研究了基于GPU的稀疏矩阵与向量相乘的性能优化、数据从CPU端传到GPU端的加速传输措施。将编制的稀疏矩阵与向量相乘的kernel函数和CUSPARSE函数库中的cusparseDcsrmv函数性能进行了对比,最优得到了2.1倍的加速效果。对于整个预条件共轭梯度法,通过自编kernel函数来实现的算法较之采用CUBLAS库和CUSPARSE库实现的算法稍具优势,与CPU端的预条件共轭梯度法相比,最优可以得到7.4倍的加速效果。相似文献

19.

Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model

Michael A. Bender Gerth Stølting Brodal Rolf Fagerberg Riko Jacob Elias Vicari 《Theory of Computing Systems》2010,47(4):934-962

We study the problem of sparse-matrix dense-vector multiplication (SpMV) in external memory. The task of SpMV is to compute y:=Ax, where A is a sparse N×N matrix and x is a vector. We express sparsity by a parameter k, and for each choice of k consider the class of matrices where the number of nonzero entries is kN, i.e., where the average number of nonzero entries per column is k. 相似文献

20.

稀疏最小二乘支持向量机及其应用研究 总被引：2，自引：0，他引：2

宋海鹰桂卫华阳春华《信息与控制》2008,37(3):334-1

提出一种构造稀疏化最小二乘支持向量机的方法．该方法首先通过斯密特正交化法对核矩阵进行简约,得到核矩阵的基向量组;再利用核偏最小二乘方法对最小二乘支持向量机进行回归计算,从而使最小二乘向量机具有一定稀疏性．基于稀疏最小二乘向量机建立了非线性动态预测模型,对铜转炉造渣期吹炼时间进行滚动预测．仿真结果表明,基于核偏最小二乘辨识的稀疏最小二乘支持向量机具有计算效率高、预测精度好的特点．相似文献