首页 | 本学科首页   官方微博 | 高级检索  
     


Accelerating sparse Cholesky factorization on GPUs
Affiliation:1. Sr. HPC Developer Technology Engineer, NVIDIA, Santa Clara, CA, United States;2. HPC Developer Technology Intern, NVIDIA, Santa Clara, CA, United States;3. Professor, CSE, Texas A&M University, College Station, TX, United States;1. D. Sistemes Informàtics i Computació, Universitat Politècnica de València, Camí de Vera s/n, 46022 Valencia, Spain;2. Cisco Systems, Inc., San Jose, CA 95134, USA;1. D-ITET, Swiss Federal Institute of Technology in Zurich (ETHZ), Gloriastrasse 35, Zurich 8092, Switzerland;2. DEI, University of Bologna, Viale Risorgimento 2, Bologna 40136, Italy;1. Departament d''Estadística i Investigació Operativa, Universitat Politècnica de Catalunya -BarcelonaTech, Barcelona, Spain;2. Department of Management Science, Lancaster University Management School, Lancaster, United Kindom;3. Departament d''Arquitectura de Computadors, Universitat Politècnica de Catalunya-BarcelonaTech, Barcelona, Spain;4. Barcelona Supercomputing Center, Barcelona, Spain;5. Departament d''Antropologia Social i Cultural, Universitat Autònoma de Barcelona, Spain
Abstract:Sparse factorization is a fundamental tool in scientific computing. As the major component of a sparse direct solver, it represents the dominant computational cost for many analyses. For factorizations which involve sufficient dense math, the substantial computational capability provided by GPUs (Graphics Processing Units) can help alleviate this cost. However, for many other cases, the prevalence of small/irregular dense math and the relatively slow communication between the host and device over the PCIe bus, make it challenging to significantly accelerate sparse factorization using the GPU.In this paper we describe a left-looking supernodal Cholesky factorization algorithm which permits improved utilization of the GPU when factoring sparse matrices. The central idea is to stream subtrees of the elimination tree through the GPU and perform the factorization of each subtree entirely on the GPU. This avoids the majority of the PCIe communication without the need for a complex task scheduler. Importantly, within these subtrees, many independent, small, dense operations are batched to minimize kernel launch overhead and many of these batched kernels are executed concurrently to maximize device utilization.Performance results for commonly studied matrices are presented along with suggested actions for further optimization.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号