首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 234 毫秒
1.
稀疏矩阵存储格式中的稀疏矩阵向量乘(SpMV)计算效率低下,且分块行列(BRC)存储格式的计算结果缺少再现性和确定性。为此,提出一种改进的BRCP存储格式。采用不同的二维分块策略,根据矩阵各行非零元素分布的统计特性自适应调节分块参数,提高SpMV在GPU平台上的并行性,并设计基于快速分段求和算法的GPU内核函数,保证计算结果的确定性及其在不同GPU平台上的再现性。实验结果表明,BRCP存储格式具有较高的计算效率,相比BRC存储格式可减少并行环境中的SpMV计算误差,并提高PageRank排序的准确率。  相似文献   

2.
油藏数值模拟和很多其他科学计算问题一样需要求解大型稀疏线性代数方程组.在求解稀疏线性代数方程组的迭代法中,稀疏矩阵向量乘法(SpMV)是影响计算效率的核心函数之一.随着计算机硬件架构异构化,科学计算从单核、多核CPU计算架构逐渐发展到多核CPU+众核加速卡(GPU卡或MIC等)的计算架构.SpMV的实现效率与稀疏矩阵的存储格式及硬件架构关系密切.本文针对油藏模拟中常见的Jacobian矩阵的稀疏模式,利用GPU核心的合并访问和并发计算等特点,结合油藏模拟线性解法器的算法要求,设计了一种BHYB矩阵存储格式及其对应的线程组并行策略.数值实验测得基于该存储格式的SpMV相对串行BCSR格式的SpMV的加速比可达19倍,比cuSPARSE库中效率最高的HYB格式的SpMV快30%到80%.此外,本文所提出的BHYB存储格式对块状矩阵在GPU上的存储以及线程组并行策略对其它GPU并行程序中内核函数的设计和优化能起到一定的借鉴作用.  相似文献   

3.
稀疏矩阵向量乘(SpMV)在线性系统的求解问题中具有重要意义,是科学计算和工程实践中的核心问题之一,其性能高度依赖于稀疏矩阵的非零分布。稀疏对角矩阵是一类特殊的稀疏矩阵,其非零元素按照对角线的形式密集排列。针对稀疏对角矩阵,在GPU平台上提出的多种存储格式虽然使SpMV性能有所提升,但仍存在零填充和负载不平衡的问题。针对上述问题,提出了一种DRM存储格式,利用基于固定阈值的矩阵划分策略和基于迭代归并的矩阵重构策略,实现了少量零填充和块间负载平衡。实验结果表明,在NVIDIA? Tesla? V100平台上,相比于DIA、HDC、HDIA和DIA-Adaptive格式,在时间性能方面,该存储格式分别取得了20.76,1.94,1.13和2.26倍加速;在浮点计算性能方面,分别提高了1.54,5.28,1.13和1.94倍。  相似文献   

4.
稀疏矩阵向量乘法(sparse matrix vector multiplication, SpMV)是科学和工程领域中重要的核心子程序之一,也是稀疏基本线性代数子程序(basic linear algebra subprograms, BLAS)库的重要函数.目前很多SpMV的优化工作在不同程度上获得了性能提升,但大多数优化工作针对特定存储格式或一类具有特定特征的稀疏矩阵缺乏通用性.因此高性能的SpMV实现并没有广泛地应用于实际应用和数值解法器中.另外,稀疏矩阵具有众多存储格式,不同存储格式的SpMV存在较大性能差异.根据以上现象,提出一个SpMV的自动调优器(SpMV auto-tuner, SMAT).对于一个给定的稀疏矩阵,SMAT结合矩阵特征选择并返回其最优的存储格式.应用程序通过调用SMAT来得到合适的存储格式,从而获得性能提升,同时随着SMAT中存储格式的扩展,更多的SpMV优化工作可以将性能优势在实际应用中发挥作用.使用佛罗里达大学的2366个稀疏矩阵作为测试集,在Intel上SMAT分别获得9.11GFLOPS(单精度)和2.44GFLOPS(双精度)的最高浮点性能,在AMD平台上获得了3.36GFLOPS(单精度)和1.52GFLOPS(双精度)的最高浮点性能.相比Intel的核心数学函数库(math kernel library, MKL)数学库,SMAT平均获得1.4~1.5倍的性能提升.  相似文献   

5.
稀疏矩阵与向量乘(SpMV)属于科学计算和工程应用中的一种基本运算,其高性能实现与优化是计算科学的研究热点之一。在微分方程的求解过程中会产生大规模的稀疏矩阵,而且很大一部分是一种准对角矩阵。针对准对角矩阵存在的一些不规则性,提出一种混合对角存储(DIA)和行压缩存储(CSR)格式来进行SpMV计算,对于分割出来的对角线区域之外的离散非零元素采用CSR存储,这样能够克服DIA在不规则情况下存储矩阵的列迅速增加的缺陷,同时对角线采用DIA存储又能充分利用矩阵的对角特征,以减少CSR的行非零元素数目的不均衡现象,并可以通过调整存储对角线的带宽来适应准对角矩阵的不同的离散形式,以获得比DIA和CSR更高的压缩比,减小计算的数据规模。利用CUDA平台在GPU上进行了实验测试,结果表明该方法比DIA和CSR具有更高的加速比。  相似文献   

6.
稀疏矩阵与向量相乘SpMV是求解稀疏线性系统中的一个重要问题,但是由于非零元素的稀疏性,计算密度较低,造成计算效率不高。针对稀疏矩阵存在的一些不规则性,利用混合存储格式来进行SpMV计算,能够提高对稀疏矩阵的压缩效率,并扩大其适应范围。HYB是一种广泛使用的混合压缩格式,其性能较为稳定。而随着GPU并行计算得到普遍应用以及CPU日趋多核化,因此利用GPU和多核CPU构建异构并行计算系统得到了普遍的认可。针对稀疏矩阵的HYB存储格式中的ELL和COO存储特征,把两部分数据分别分割到CPU和GPU进行协同并行计算,既能充分利用CPU和GPU的计算资源,又能够发挥CPU和GPU的计算特性,从而提高了计算资源的利用效能。在分析CPU+GPU异构计算模式的特征的基础上,对混合格式的数据分割和共享方面进行优化,能够较好地发挥在异构计算环境的优势,提高计算性能。  相似文献   

7.
SpMV的自动性能优化实现技术及其应用研究   总被引:1,自引:0,他引:1  
在科学计算中,稀疏矩阵向量乘(SpMV)是一个十分重要且经常被大量调用的计算内核.由于SpMV一般实现算法的浮点计算和存储访问次数比率非常低,且其存储访问模式极为不规则,其实际运行性能往往很低.通过采用寄存器分块算法和启发式分块大小选择算法,将稀疏矩阵分成小的稠密分块,重用保存在寄存器中向量x元素,可以提高该计算内核的性能.剖析和总结了OSKI软件包所采用的若干关键优化技术,并进行了实际应用性能测试.测试表明,在实际应用这些优化技术的过程中,应用程序对SpMV的调用次数要达到上百次的量级,才能抵消由于应用这些性能优化技术所带来的额外时间开销,取得性能加速效果.在Pentium 4和AMD Athlon平台上,测试了10个矩阵,其平均加速比分别达到了1.69和1.48.  相似文献   

8.
研究了基于GPU的稀疏线性方程组的预条件共轭梯度法加速求解问题,并基于统一计算设备架构(CUDA)平台编制了程序,在NVIDIAGT430 GPU平台上进行了程序性能测试和分析。稀疏矩阵采用压缩稀疏行(CSR)格式压缩存储,针对预条件共轭梯度法的算法特性,研究了基于GPU的稀疏矩阵与向量相乘的性能优化、数据从CPU端传到GPU端的加速传输措施。将编制的稀疏矩阵与向量相乘的kernel函数和CUSPARSE函数库中的cusparseDcsrmv函数性能进行了对比,最优得到了2.1倍的加速效果。对于整个预条件共轭梯度法,通过自编kernel函数来实现的算法较之采用CUBLAS库和CUSPARSE库实现的算法稍具优势,与CPU端的预条件共轭梯度法相比,最优可以得到7.4倍的加速效果。  相似文献   

9.
大规模稀疏矩阵的主特征向量计算优化方法   总被引:1,自引:0,他引:1  
矩阵主特征向量(principal eigenvectors computing,PEC)的求解是科学与工程计算中的一个重要问题。随着图形处理单元通用计算(general-purpose computing on graphics pro cessing unit,GPGPU)的兴起,利用GPU来优化大规模稀疏矩阵的图形处理单元求解得到了广泛关注。分别从应用特征和GPU体系结构特征两方面分析了PEC运算的性能瓶颈,提出了一种面向GPU的稀疏矩阵存储格式——GPU-ELL和一个针对GPU的线程优化映射策略,并设计了相应的PEC优化执行算法。在ATI HD Radeon5850上的实验结果表明,相对于传统CPU,该方案获得了最多200倍左右的加速,相对于已有GPU上的实现,也获得了2倍的加速。  相似文献   

10.
针对基于GPU求解大规模稀疏线性方程组进行了研究,提出一种稀疏矩阵的分块存储格式HMEC(hybrid multiple ELL and CSR)。通过重排序优化系数矩阵的存储结构,将系数矩阵以一定的比例分块存储,采用ELL与CSR存储格式相结合的方式以适应不同的分块特征,分别使用适用于不对称矩阵的不完全LU分解预处理BICGStab法和对称正定矩阵的不完全Cholesky分解预处理共轭梯度法求解大规模稀疏线性系统。实验表明,应用HMEC格式存储稀疏矩阵并以调用GPU kernel的方式实现前述两种方法,与其他存储格式的实现方式作比较,最优可分别获得31.89%和17.50%的加速效果。  相似文献   

11.
Existing formats for Sparse Matrix–Vector Multiplication (SpMV) on the GPU are outperforming their corresponding implementations on multi-core CPUs. In this paper, we present a new format called Sliced COO (SCOO) and an efficient CUDA implementation to perform SpMV on the GPU using atomic operations. We compare SCOO performance to existing formats of the NVIDIA Cusp library using large sparse matrices. Our results for single-precision floating-point matrices show that SCOO outperforms the COO and CSR format for all tested matrices and the HYB format for all tested unstructured matrices on a single GPU. Furthermore, our dual-GPU implementation achieves an efficiency of 94% on average. Due to the lower performance of existing CUDA-enabled GPUs for atomic operations on double-precision floating-point numbers the SCOO implementation for double-precision does not consistently outperform the other formats for every unstructured matrix. Overall, the average speedup of SCOO for the tested benchmark dataset is 3.33 (1.56) compared to CSR, 5.25 (2.42) compared to COO, 2.39 (1.37) compared to HYB for single (double) precision on a Tesla C2075. Furthermore, comparison to a Sandy-Bridge CPU shows that SCOO on a Fermi GPU outperforms the multi-threaded CSR implementation of the Intel MKL Library on an i7-2700 K by a factor between 5.5 (2.3) and 18 (12.7) for single (double) precision.  相似文献   

12.
The sparse matrix vector product (SpMV) is a key operation in engineering and scientific computing and, hence, it has been subjected to intense research for a long time. The irregular computations involved in SpMV make its optimization challenging. Therefore, enormous effort has been devoted to devise data formats to store the sparse matrix with the ultimate aim of maximizing the performance. Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. SpMV implementations for NVIDIA GPUs have already appeared on the scene. This work proposes and evaluates a new implementation of SpMV for NVIDIA GPUs based on a new format, ELLPACK‐R, that allows storage of the sparse matrix in a regular manner. A comparative evaluation against a variety of storage formats previously proposed has been carried out based on a representative set of test matrices. The results show that, although the performance strongly depends on the specific pattern of the matrix, the implementation based on ELLPACK‐R achieves higher overall performance. Moreover, a comparison with standard state‐of‐the‐art superscalar processors reveals that significant speedup factors are achieved with GPUs. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

13.
This paper concerns an Intel Xeon Phi implementation of the explicit fourth-order Runge–Kutta method (RK4) for very sparse matrices with very short rows. Such matrices arise during Markovian modeling of computer and telecommunication networks. In this work an implementation based on Intel Math Kernel Library (Intel MKL) routines and the authors’ own implementation, both using the CSR storage scheme and working on Intel Xeon Phi, were investigated. The implementation based on the Intel MKL library uses the high-performance BLAS and Sparse BLAS routines. In our application we focus on OpenMP style programming. We implement SpMV operation and vector addition using the basic optimizing techniques and the vectorization. We evaluate our approach in native and offload modes for various number of cores and thread allocation affinities. Both implementations (based on Intel MKL and made by the authors) were compared in respect of the time, the speedup and the performance. The numerical experiments on Intel Xeon Phi show that the performance of authors’ implementation is very promising and gives a gain of up to two times compared to the multithreaded implementation (based on Intel MKL) running on CPU (Intel Xeon processor) and even three times in comparison with the application which uses Intel MKL on Intel Xeon Phi.  相似文献   

14.
The abundant data parallelism available in many-core GPUs has been a key interest to improve accuracy in scientific and engineering simulation. In many cases, most of the simulation time is spent in linear solver involving sparse matrix–vector multiply. In forward petroleum oil and gas reservoir simulation, the application of a stencil relationship to structured grid leads to a family of generalized hepta-diagonal solver matrices with some regularity and structural uniqueness. We present a customized storage scheme that takes advantage of generalized hepta-diagonal sparsity pattern and stencil regularity by optimizing both storage and matrix–vector computation. We also present an in-kernel optimization for implementing sparse matrix–vector multiply (SpMV) and biconjugate gradient stabilized (BiCG-Stab) solver. In-kernel is intended to avoid the multiple kernels invocation associated with the use of the numerical library operators. To keep in-kernel, a lock-free inter-block synchronization is used in which completing thread blocks are assigned some independent computations to avoid repeatedly polling the global memory. Other optimizations enable combining reductions and collective write operations to memory. The in-kernel optimization is particularly useful for the iterative structure of BiCG-Stab for preserving vector data locality and to avoid saving vector data back to memory and reloading on each kernel exit and re-entry. Evaluation uses generalized hepta-diagonal matrices that derives from a range of forward reservoir simulation’s structured grids. Results show the profitability of proposed generalized hepta-diagonal custom storage scheme over standard library storage like compressed sparse row, hybrid sparse, and diagonal formats. Using proposed optimizations, SpMV and BiCG-Stab have been noticeably accelerated compared to other implementations using multiple kernel exit–re-entry when the solver is implemented by invoking numerical library operators.  相似文献   

15.
作为Wiedemannn算法的核心部分,稀疏矩阵向量乘是求解二元域上大型稀疏线性方程组的主要步骤。提出了一种基于FPGA的二元域大型稀疏矩阵向量乘的环网硬件系统架构,为解决Wiedemannn算法重复计算稀疏矩阵向量乘,提出了新的并行计算结构。实验分析表明,提出的架构提高了Wiedemannn算法中稀疏矩阵向量乘的并行性,同时充分利用了FPGA的片内存储器和吉比特收发器,与目前性能最好的部分可重构计算PR模型相比,实现了2.65倍的加速性能。  相似文献   

16.
In this paper, the sparse matrix-vector product (SpMV) is evaluated on the FinisTerrae SMP-NUMA supercomputer. Its architecture particularities make the tuning of SpMV especially relevant due to the significant impact on the performance. First, we have estimated the influence of data and thread allocation. Moreover, because of the indirect and irregular memory access patterns of SpMV, we have also studied the influence of the memory hierarchy in the performance. According to the behavior observed in the study, a set of optimizations specially tuned for FinisTerrae were successfully applied to SpMV. Noticeable improvements are obtained in comparison with the SpMV naïve implementation.  相似文献   

17.
GPU-accelerated preconditioned iterative linear solvers   总被引:1,自引:1,他引:0  
This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved.  相似文献   

18.
在国产异构众核平台神威·太湖之光上的非结构网格计算具有稀疏存储、离散访存、数据依赖等特点,严重制约了众核处理器的性能发挥。为解决稀疏存储和离散访存问题,提出一种N阶对角染色算法,以有效平衡主从核计算并利用从核将全局访存转化为LDM访问。针对数据依赖造成的计算竞争问题,采用自适应和无依赖的任务划分方法,避免并行计算时的数据冲突。为对处理器架构和非结构网格计算进行优化,采用主核与从核异步并行的方式,差异化使用主从核以充分利用硬件资源,同时,取消处理器提供的寄存器通信机制,降低从核阵列的同步开销同时便于扩展到新一代神威平台。此外,使用计算访存异步重叠技术来充分隐藏访存延迟。利用SpMV、Integration、calcLudsFcc算子进行实验,结果表明,相比主核实现,组合加速算法在不同算例规模下平均取得了10倍的加速效果,加速比最高可达24倍,N阶对角染色算法相比非染色分块算法取得了超过5.8倍的性能加速,有效提升了数据局部性和计算并行度。该算法对有依赖关系的计算冲突算子同样具有良好的加速性能,验证了自适应和无依赖任务划分方法的有效性。  相似文献   

19.
Utilizing hardware resources efficiently is vital to building the future generation of high-performance computing systems. The sparse matrix – dense vector multiplication (SpMV) kernel, which is notorious for its poor efficiency on conventional processors, is a key component in many scientific computing applications and increasing SpMV efficiency can contribute significantly to improving overall system efficiency. The major challenge in implementing SpMV efficiently is handling the input-dependent memory access patterns, and reconfigurable logic is a strong candidate for tackling this problem via memory system customization. In this work, we consider three schemes (all off-chip, all on-chip, caching) for servicing the irregular-access component of SpMV and investigate their effects on accelerator efficiency. To combine the strengths of on-chip and off-chip random accesses, we propose a hardware-software caching scheme named NCVCS that combines software preprocessing with a nonblocking cache to enable highly efficient SpMV accelerators with modest on-chip memory requirements. Our results from the comparison of the three schemes implemented as part of an FPGA SpMV accelerator show that our scheme effectively combines the high efficiency from on-chip accesses with the capability of working with large matrices from off-chip accesses.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号