Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

Affiliation:	1. Univ. Grenoble Alpes, Laboratoire Jean Kuntzmann, CNRS, Inria;2. Laboratoire d’informatique de Grenoble, Univ. Grenoble Alpes, CNRS, Inria;3. Univ. Grenoble Alpes, Laboratoire de l’informatique et du paralllisme, Université de Lyon Inria;1. COMSATS Institute of Information Technology, Pakistan;2. University Paris Est (UPEM), LIGM, France;3. Clarkson University, New York, USA;4. University of Nottingham, UK;5. Arizona State University, Tempe, USA;1. Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, United States;2. UltraScale Systems Research Center, Los Alamos National Laboratory, Los Alamos, NM 87544, United States;3. Applied Computer Science Group, Los Alamos National Laboratory, Los Alamos, NM 87544, United States;1. Illinois Institute of Technology, Chicago, IL, USA;2. Argonne National Laboratory, Argonne, IL, USA;3. Northern Illinois University, DeKalb, IL, USA

Abstract:	We present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures. Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization. Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebra over a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive–iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies.

Keywords:
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏