Accelerating sparse Cholesky factorization on GPUs期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Accelerating sparse Cholesky factorization on GPUs

Affiliation:	1. Sr. HPC Developer Technology Engineer, NVIDIA, Santa Clara, CA, United States;2. HPC Developer Technology Intern, NVIDIA, Santa Clara, CA, United States;3. Professor, CSE, Texas A&M University, College Station, TX, United States;1. D. Sistemes Informàtics i Computació, Universitat Politècnica de València, Camí de Vera s/n, 46022 Valencia, Spain;2. Cisco Systems, Inc., San Jose, CA 95134, USA;1. D-ITET, Swiss Federal Institute of Technology in Zurich (ETHZ), Gloriastrasse 35, Zurich 8092, Switzerland;2. DEI, University of Bologna, Viale Risorgimento 2, Bologna 40136, Italy;1. Departament d''Estadística i Investigació Operativa, Universitat Politècnica de Catalunya -BarcelonaTech, Barcelona, Spain;2. Department of Management Science, Lancaster University Management School, Lancaster, United Kindom;3. Departament d''Arquitectura de Computadors, Universitat Politècnica de Catalunya-BarcelonaTech, Barcelona, Spain;4. Barcelona Supercomputing Center, Barcelona, Spain;5. Departament d''Antropologia Social i Cultural, Universitat Autònoma de Barcelona, Spain

Abstract:	Sparse factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. For factorizations which involve sufficient dense math, the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost. However, for many other cases, the prevalence of small/irregular dense math and the relatively slow communication between the host and device over the PCIe bus, make it challenging to significantly accelerate sparse factorization using the GPU.In this paper we describe a left-looking supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream subtrees of the elimination tree through the GPU and perform the factorization of each subtree entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these subtrees, many independent, small, dense operations are batched to minimize kernel launch overhead and many of these batched kernels are executed concurrently to maximize device utilization.Performance results for commonly studied matrices are presented along with suggested actions for further optimization.

Keywords:
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏