Cholesky factorization on SIMD multi-core architectures期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Cholesky factorization on SIMD multi-core architectures

Affiliation:	1. CERN on behalf of the LHCb Collaboration, Geneva, Switzerland;2. Sorbonne Universites, UPMC Univ Paris 06, CNRS UMR 7606, LIP6, Paris, France;1. School of Microelectronics, Shanghai Jiao Tong University, 800 DongChuan Road, Shanghai, China;2. Department of Computer Science, Pace University, 1 Pace Plaza, Manhattan, New York City, NY 10038, USA;1. Lamont-Doherty Earth Observatory of Columbia University, 61 Rt. 9W, Palisades, NY 10964, United States;2. Department of Environmental Science, Barnard College, NY, NY, United States;3. Institute for Translational Medicine and Science, Rutgers Biomedical and Health Sciences, Rutgers University, New Brunswick, NJ, United States;4. Department of Health Policy and Management, Mailman School of Public Health, Columbia University, New York, NY, United States;5. Center of Excellence in Environmental Toxicology (CEET), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA;6. Department of Biostatics, Mailman School of Public Health, Columbia University, New York, NY, United States

Abstract:	Many linear algebra libraries, such as the Intel MKL, Magma or Eigen, provide fast Cholesky factorization. These libraries are suited for big matrices but perform slowly on small ones. Even though State-of-the-Art studies begin to take an interest in small matrices, they usually feature a few hundreds rows. Fields like Computer Vision or High Energy Physics use tiny matrices. In this paper we show that it is possible to speed up the Cholesky factorization for tiny matrices by grouping them in batches and using highly specialized code. We provide High Level Transformations that accelerate the factorization for current multi-core and many-core SIMD architectures (SSE, AVX2, KNC, AVX512, Neon, Altivec). We focus on the fact that, on some architectures, compilers are unable to vectorize and on other architectures, vectorizing compilers are not efficient. Thus hand-made SIMDization is mandatory. We achieve with these transformations combined with SIMD a speedup from × 14 to × 28 for the whole resolution in single precision compared to the naive code on a AVX2 machine and a speedup from × 6 to × 14 on double precision, both with a strong scalability.

Keywords:
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏