首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 757 毫秒
1.
Sparse Cholesky factorization is the most computationally intensive component in solving large sparse linear systems and is the core algorithm of numerous scientific computing applications. A large number of sparse Cholesky factorization algorithms have previously emerged, exploiting architectural features for various computing platforms. The recent use of graphics processing units (GPUs) to accelerate structured parallel applications shows the potential to achieve significant acceleration relative to desktop performance. However, sparse Cholesky factorization has not been explored sufficiently because of the complexity involved in its efficient implementation and the concerns of low GPU utilization. In this paper, we present a new approach for sparse Cholesky factorization on GPUs. We present the organization of the sparse matrix supernode data structure for GPU and propose a queue‐based approach for the generation and scheduling of GPU tasks with dense linear algebraic operations. We also design a subtree‐based parallel method for multi‐GPU system. These approaches increase GPU utilization, thus resulting in substantial computational time reduction. Comparisons are made with the existing parallel solvers by using problems arising from practical applications. The experiment results show that the proposed approaches can substantially improve sparse Cholesky factorization performance on GPUs. Relative to a highly optimized parallel algorithm on a 12‐core node, we were able to obtain speedups in the range 1.59× to 2.31× by using one GPU and 1.80× to 3.21× by using two GPUs. Relative to a state‐of‐the‐art solver based on supernodal method for CPU‐GPU heterogeneous platform, we were able to obtain speedups in the range 1.52× to 2.30× by using one GPU and 2.15× to 2.76× by using two GPUs. Concurrency and Computation: Practice and Experience, 2013. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

2.
A CPU-GPU hybrid approach for the unsymmetric multifrontal method   总被引:1,自引:0,他引:1  
Multifrontal is an efficient direct method for solving large-scale sparse and unsymmetric linear systems. The method transforms a large sparse matrix factorization process into a sequence of factorizations involving smaller dense frontal matrices. Some of these dense operations can be accelerated by using a graphic processing unit (GPU). We analyze the unsymmetric multifrontal method from both an algorithmic and implementational perspective to see how a GPU, in particular the NVIDIA Tesla C2070, can be used to accelerate the computations. Our main accelerating strategies include (i) performing BLAS on both CPU and GPU, (ii) improving the communication efficiency between the CPU and GPU by using page-locked memory, zero-copy memory, and asynchronous memory copy, and (iii) a modified algorithm that reuses the memory between different GPU tasks and sets thresholds to determine whether certain tasks be performed on the GPU. The proposed acceleration strategies are implemented by modifying UMFPACK, which is an unsymmetric multifrontal linear system solver. Numerical results show that the CPU-GPU hybrid approach can accelerate the unsymmetric multifrontal solver, especially for computationally expensive problems.  相似文献   

3.
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floating-point co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. Since a full re-implementation of these complex kernels is typically not feasible, we identify the matrix–matrix multiplication as a first natural entry-point for a minimally invasive integration of GPUs. We investigate the performance on the NVIDIA GeForce 8800 multicore chip initially architectured for intensive gaming applications. We exploit the architectural features of the GeForce 8800 GPU to design an efficient GPU-parallel sparse matrix solver. A prototype approach to leverage the bandwidth and computing power of GPUs for these matrix kernel operation is demonstrated resulting in an overall performance of over 110 GFlops/s on the desktop for large matrices and over 38 GFlops/s for sparse matrices arising in real applications. We use our GPU algorithm for PDE-constrained optimization problems and demonstrate that the commodity GPU is a useful co-processor for scientific applications.  相似文献   

4.
In this paper, we describe scalable parallel algorithms for symmetric sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1,024 processors on a Gray T3D parallel computer. Through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems-both in terms of scalability and overall performance. It is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. In this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. Our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. Although, in this paper, we discuss Cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for Gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. An implementation of one of our sparse Cholesky factorization algorithms delivers up to 20 GFlops on a Gray T3D for medium-size structural engineering and linear programming problems. To the best of our knowledge, this is the highest performance ever obtained for sparse Cholesky factorization on any supercomputer  相似文献   

5.
We present a novel strategy for sparse direct factorizations that is geared towards the matrices that arise from hp-adaptive Finite Element Methods. In that context, a sequence of linear systems derived by successive local refinement of the problem domain needs to be solved. Thus, there is an opportunity for a factorization strategy that proceeds by updating (and possibly downdating) the factorization. Our scheme consists of storing the matrix as unassembled element matrices, hierarchically ordered to mirror the refinement history of the domain. The factorization of such an ‘unassembled hyper-matrix’ proceeds in terms of element matrices, only assembling nodes when they need to be eliminated. The main benefits are efficiency from the fact that only updates to the factorization are made, high scalar efficiency since the factorization process uses dense matrices throughout, and a workflow that integrates naturally with the application.  相似文献   

6.
GPU-accelerated preconditioned iterative linear solvers   总被引:1,自引:1,他引:0  
This work is an overview of our preliminary experience in developing a high-performance iterative linear solver accelerated by GPU coprocessors. Our goal is to illustrate the advantages and difficulties encountered when deploying GPU technology to perform sparse linear algebra computations. Techniques for speeding up sparse matrix-vector product (SpMV) kernels and finding suitable preconditioning methods are discussed. Our experiments with an NVIDIA TESLA M2070 show that for unstructured matrices SpMV kernels can be up to 8 times faster on the GPU than the Intel MKL on the host Intel Xeon X5675 Processor. Overall performance of the GPU-accelerated Incomplete Cholesky (IC) factorization preconditioned CG method can outperform its CPU counterpart by a smaller factor, up to 3, and GPU-accelerated The incomplete LU (ILU) factorization preconditioned GMRES method can achieve a speed-up nearing 4. However, with better suited preconditioning techniques for GPUs, this performance can be further improved.  相似文献   

7.
Two parallel block tridiagonalization algorithms and implementations for dense real symmetric matrices are presented. Block tridiagonalization is a critical pre-processing step for the block tridiagonal divide-and-conquer algorithm for computing eigensystems and is useful for many algorithms desiring the efficiencies of block structure in matrices. For an “effectively” sparse matrix, which frequently results from applications with strong locality properties, a heuristic parallel algorithm is used to transform it into a block tridiagonal matrix such that the eigenvalue errors remain bounded by some prescribed accuracy tolerance. For a dense matrix without any usable structure, orthogonal transformations are used to reduce it to block tridiagonal form using mostly level 3 BLAS operations. Numerical experiments show that block tridiagonal structure obtained from this algorithm directly affects the computational complexity of the parallel block tridiagonal divide-and-conquer eigensolver. Reduction to block tridiagonal form provides significantly lower execution times, as well as memory traffic and communication cost, over the traditional reduction to tridiagonal form for eigensystem computations.  相似文献   

8.
细粒度任务并行GPU通用矩阵乘   总被引:1,自引:0,他引:1       下载免费PDF全文
稠密线性代数运算对模式识别和生物信息等许多实际应用至关重要,而通用矩阵乘(GEMM)处于稠密线性代数运算的基础地位。在cuBLAS与MAGMA中,GEMM被实现为若干kernel函数,对大型GEMM计算能够达到很高的性能。然而,现有实现对批量的小型GEMM计算性能发挥则较为有限。而且,现有实现也不能在多个具有不同性能的GPU之间自动扩展并达到负载均衡。提出任务并行式GEMM(TPGEMM),用细粒度任务并行的方式实现批量矩阵乘和多GPU矩阵乘。一个或多个GEMM的计算能够被拆分为多个任务,动态地调度到一个或多个GPU上。TPGEMM避免了为批量矩阵乘启动多个kernel函数的开销,对批量矩阵乘能够取得显著高于cuBLAS与MAGMA的性能。在低开销细粒度任务调度的基础上,TPGEMM支持单个GEMM计算在多个GPU间的自动并行,在一台具有四个不同性能GPU的工作站上取得了接近100%的扩展效率。  相似文献   

9.
We develop optimized multi-dimensional FFT implementations on CPU–GPU heterogeneous platforms for the case when the input is too large to fit on the GPU global memory, and use the resulting techniques to develop a fast Poisson solver. The solver involves memory bound computations for which the large 3D data may have to be transferred over the PCIe bus several times during the computation. We develop a new strategy to decompose and allocate the computation between the GPU and the CPU such that the 3D data is transferred only once to the device memory, and the executions of the GPU kernels are almost completely overlapped with the PCI data transfer. We were able to achieve significantly better performance than what has been reported in previous related work, including over 145 GFLOPS for the three periodic boundary conditions (single precision version), and over 105 GFLOPS for the two periodic, one Neumann boundary conditions (single precision version). The effective bidirectional PCIe bus bandwidth achieved is 9–10 GB/s, which is close to the best possible on our platform. For all the cases tested, the single 3D data PCIe transfer time, which constitutes a lower bound on what is possible on our platform, takes almost 70% of the total execution time of the Poisson solver.  相似文献   

10.
The effect which the representation of the data (matrices and vectors) has on the communication patterns of preconditionings for exploitation of massively parallel architectures is discussed. Preconditioned iterative methods are used to solve the sparse linear systems generated by discretizations of partial differential equations in many areas of science and engineering. The preconditionings considered are based on nested incomplete factorization with approximate tridiagonal inverses using a two color line ordering of the discretization grid. These preconditionings can be described in terms of vector-vector to vector operations of dimension equal to half the total number of grid points.  相似文献   

11.
《Parallel Computing》2014,40(5-6):70-85
QR factorization is a computational kernel of scientific computing. How can the latest computer be used to accelerate this task? We investigate this topic by proposing a dense QR factorization algorithm with adaptive block sizes on a hybrid system that contains a central processing unit (CPU) and a graphic processing unit (GPU). To maximize the use of CPU and GPU, we develop an adaptive scheme that chooses block size at each iteration. The decision is based on statistical surrogate models of performance and an online monitor, which avoids unexpected occasional performance drops. We modify the highly optimized CPU–GPU based QR factorization in MAGMA to implement the proposed schemes. Numerical results suggest that our approaches are efficient and can lead to near-optimal block sizes. The proposed algorithm can be extended to other one-sided factorizations, such as LU and Cholesky factorizations.  相似文献   

12.
针对基于GPU求解大规模稀疏线性方程组进行了研究,提出一种稀疏矩阵的分块存储格式HMEC(hybrid multiple ELL and CSR)。通过重排序优化系数矩阵的存储结构,将系数矩阵以一定的比例分块存储,采用ELL与CSR存储格式相结合的方式以适应不同的分块特征,分别使用适用于不对称矩阵的不完全LU分解预处理BICGStab法和对称正定矩阵的不完全Cholesky分解预处理共轭梯度法求解大规模稀疏线性系统。实验表明,应用HMEC格式存储稀疏矩阵并以调用GPU kernel的方式实现前述两种方法,与其他存储格式的实现方式作比较,最优可分别获得31.89%和17.50%的加速效果。  相似文献   

13.
Expressing scientific computations in terms of BLAS, and in particular the general dense matrix-matrix multiplication (GEMM), is of fundamental importance for obtaining high performance portability across architectures. However, GEMMs for small matrices of sizes smaller than 32 are not sufficiently optimized in existing libraries. We consider the computation of many small GEMMs and its performance portability for a wide range of computer architectures, including Intel CPUs, ARM, IBM, Intel Xeon Phi, and GPUs. These computations often occur in applications like big data analytics, machine learning, high-order finite element methods (FEM), and others. The GEMMs are grouped together in a single batched routine. For these cases, we present algorithms and their optimization techniques that are specialized for the matrix sizes and architectures of interest. We derive a performance model and show that the new developments can be tuned to obtain performance that is within 90% of the optimal for any of the architectures of interest. For example, on a V100 GPU for square matrices of size 32, we achieve an execution rate of about 1600 gigaFLOP/s in double-precision arithmetic, which is 95% of the theoretically derived peak for this computation on a V100 GPU. We also show that these results outperform currently available state-of-the-art implementations such as vendor-tuned math libraries, including Intel MKL and NVIDIA CUBLAS, as well as open-source libraries like OpenBLAS and Eigen.  相似文献   

14.
广义稠密对称特征问题的求解是许多应用科学和工程的主要任务,并且是计算电磁学、电子结构、有限元模型和量子化学等计算中的重要部分。将广义对称特征问题转化为标准对称特征问题是求解广义稠密对称特征问题的关键计算步骤。针对GPU集群,文中给出了广义稠密对称特征问题标准化块算法在GPU集群上基于MPI+CUDA的实现。为了适应GPU集群的架构,广义对称特征问题标准化算法将正定矩阵的Cholesky分解与传统的广义特征问题标准化块算法相结合,降低了标准化算法中不必要的通信开销,并且增强了算法的并行性。在基于MPI+CUDA的标准化算法中,GPU与CPU之间的数据传输操作被用来掩盖GPU内的数据拷贝操作,这消除了拷贝所花费的时间,进而提高了程序的性能。同时,文中还给出了矩阵在二维通信网格中行通信域和列通信域之间完全并行的点对点的转置算法和基于MPI+CUDA的具有多个右端项的三角矩阵方程BX=A求解的并行块算法。在中科院计算机网络信息中心的超级计算机系统“元”上,每个计算节点配置2块Nvidia Tesla K20 GPGPU卡及2颗Intel E5-2680 V2处理器,使用多达32个GPU对不同规模矩阵的基于MPI+CUDA的广义对称特征问题标准化算法进行测试,取得了较好的加速效果与性能,并且具有良好的可扩展性。当使用32个GPU对50000×50000阶的矩阵进行测试时,峰值性能达到了约9.21 Tflops。  相似文献   

15.
Despite the increasing investment in integrated GPU and next-generation interconnect research,discrete GPU connected by PCIe still account for the dominant position of the market,the management of data communication between CPU and GPU continues to evolve.Initially,the programmer explicitly controls the data transfer between CPU and GPU.To simplify programming and enable systemwide atomic memory operations,GPU vendors have developed a programming model that provides a single,virtual address space for accessing all CPU and GPU memories in the system.The page migration engine in this model automatically migrates pages between CPU and GPU on demand.To meet the needs of high-performance workloads,the page size tends to be larger.Limited by low bandwidth and high latency interconnects compared to GDDR,larger page migration has longer delay,which may reduce the overlap of computation and transmission,waste time to migrate unrequested data,block subsequent requests,and cause serious performance decline.In this paper,we propose partial page migration that only migrates the requested part of a page to reduce the migration unit,shorten the migration latency,and avoid the performance degradation of the full page migration when the page becomes larger.We show that partial page migration is possible to largely hide the performance overheads of full page migration.Compared with programmer controlled data transmission,when the page size is 2MB and the PCIe bandwidth is 16GB/sec,full page migration is 72.72×slower,while our partial page migration achieves 1.29×speedup.When the PCIe bandwidth is changed to 96GB/sec,full page migration is 18.85×slower,while our partial page migration provides 1.37×speedup.Additionally,we examine the performance impact that PCIe bandwidth and migration unit size have on execution time,enabling designers to make informed decisions.  相似文献   

16.
异构计算作为一种特殊的并行计算方式,能根据计算任务的特点发挥不同计算资源的能力,在提高服务器计算性能、能效比和实时性方面有极大优势,但目前异构计算环境存在编程复杂、可信性无法保证的问题。针对以上问题,提出了一个基于状态变迁矩阵(STM)的编程框架,可以集成GPU和FPGA的资源。通过状态迁移矩阵对CUDA和Vivado的应用程序接口(API)进行集成,自动生成异构计算所需要的标准C代码。通过PCIe总线连接GPU和FPGA设备,从而可以在这些异构计算单元之间进行数据传输,中间无需使用系统CPU内存。并且 通过GPUDirect RDMA实现了FPGA作为主控器的PCIe通信,突破了GPU作为主控器的PCIe通信当中读取操作的短板。 实验表明,相比共享内存的通信方式, FPGA作为主控器的PCIe通信方式的通信效率提高了1.4倍, 实现的数据速率接近理论带宽的最大值。  相似文献   

17.
For software to fully exploit the computing power of emerging heterogeneous computers, not only must the required computational kernels be optimized for the specific hardware architectures but also an effective scheduling scheme is needed to utilize the available heterogeneous computational units and to hide the communication between them. As a case study, we develop a static scheduling scheme for the tridiagonalization of a symmetric dense matrix on multicore CPUs with multiple graphics processing units (GPUs) on a single compute node. We then parallelize and optimize the Basic Linear Algebra Subroutines (BLAS)‐2 symmetric matrix‐vector multiplication, and the BLAS‐3 low rank symmetric matrix updates on the GPUs. We demonstrate the good scalability of these multi‐GPU BLAS kernels and the effectiveness of our scheduling scheme on twelve Intel Xeon processors and three NVIDIA GPUs. We then integrate our hybrid CPU‐GPU kernel into computational kernels at higher‐levels of software stacks, that is, a shared‐memory dense eigensolver and a distributed‐memory sparse eigensolver. Our experimental results show that our kernels greatly improve the performance of these higher‐level kernels, not only reducing the solution time but also enabling the solution of larger‐scale problems. Because such symmetric eigenvalue problems arise in many scientific and engineering simulations, our kernels could potentially lead to new scientific discoveries. Furthermore, these dense linear algebra algorithms present algorithmic characteristics that can be found in other algorithms. Hence, they are not only important computational kernels on their own but also useful testbeds to study the performance of the emerging computers and the effects of the various optimization techniques. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

18.
In this paper, we use a two-stage sparse factorization approach for blindly estimating the channel parameters and then estimating source components for electroencephalogram (EEG) signals. EEG signals are assumed to be linear mixtures of source components, artifacts, etc. Therefore, a raw EEG data matrix can be factored into the product of two matrices, one of which represents the mixing matrix and the other the source component matrix. Furthermore, the components are sparse in the time-frequency domain, i.e., the factorization is a sparse factorization in the time frequency domain. It is a challenging task to estimate the mixing matrix. Our extensive analysis and computational results, which were based on many sets of EEG data, not only provide firm evidences supporting the above assumption, but also prompt us to propose a new algorithm for estimating the mixing matrix. After the mixing matrix is estimated, the source components are estimated in the time frequency domain using a linear programming method. In an example of the potential applications of our approach, we analyzed the EEG data that was obtained from a modified Sternberg memory experiment. Two almost uncorrelated components obtained by applying the sparse factorization method were selected for phase synchronization analysis. Several interesting findings were obtained, especially that memory-related synchronization and desynchronization appear in the alpha band, and that the strength of alpha band synchronization is related to memory performance.  相似文献   

19.
Linear least squares problems are commonly solved by QR factorization. When multiple solutions need to be computed with only minor changes in the underlying data, knowledge of the difference between the old data set and the new can be used to update an existing factorization at reduced computational cost. We investigate the viability of implementing QR updating algorithms on GPUs and demonstrate that GPU-based updating for removing columns achieves speed-ups of up to 13.5× compared with full GPU QR factorization. We characterize the conditions under which other types of updates also achieve speed-ups.  相似文献   

20.
蔡勇  李胜 《计算机应用》2016,36(3):628-632
针对传统并行计算方法实现结构拓扑优化快速计算的硬件成本高、程序开发效率低的问题,提出了一种基于Matlab和图形处理器(GPU)的双向渐进结构优化(BESO)方法的全流程并行计算策略。首先,探讨了Matlab编程环境中实现GPU并行计算的三种途径的优缺点和适用范围;其次,分别采用内置函数直接并行的方式实现了拓扑优化算法中向量和稠密矩阵的并行化计算,采用MEX函数调用CUSOLVER库的形式实现了稀疏格式有限元方程组的快速求解,采用并行线程执行(PTX)代码的方式实现了拓扑优化中单元敏度分析等优化决策的并行化计算。数值算例表明,基于Matlab直接开发GPU并行计算程序不仅编程效率高,而且还可以避免不同编程语言间的计算精度差异,最终使GPU并行程序可以在保持计算结果不变的前提下取得可观的加速比。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号