Algorithm-based fault tolerance applied to high performance computing期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Algorithm-based fault tolerance applied to high performance computing

Authors:	George Bosilca Rémi Delmas Jack Dongarra Julien Langou

Affiliation:	1. Department of Electrical Engineering and Computer Science, University of Tennessee, United States;2. Department of Mathematical and Statistical Sciences, University of Colorado Denver, United States

Abstract:	We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithm-Based Fault Tolerance technique K. Huang, J. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Transactions on Computers (Spec. Issue Reliable & Fault-Tolerant Comp.) 33 (1984) 518–528] to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault-tolerant matrix–matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix–matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster `jacquard.nersc.gov`) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly.

Keywords:	Fault tolerance Linear algebra High performance computing
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏