Software approaches for resilience of high performance computing systems: a survey |
| |
Authors: | Jie JIA Yi LIU Guozhen ZHANG Yulin GAO Depei QIAN |
| |
Affiliation: | 1. School of Computer Science and Engineering, Beihang University, Beijing 100191, China2. Sino-German Joint Software Institute, Beihang University, Beijing 100191, China |
| |
Abstract: | With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed. |
| |
Keywords: | resilience high-performance computing fault tolerance challenge |
|
| 点击此处可从《Frontiers of Computer Science》浏览原始摘要信息 |
|
点击此处可从《Frontiers of Computer Science》下载全文 |
|