首页 | 本学科首页   官方微博 | 高级检索  
     


Software approaches for resilience of high performance computing systems: a survey
Authors:Jie JIA  Yi LIU  Guozhen ZHANG  Yulin GAO  Depei QIAN
Affiliation:1. School of Computer Science and Engineering, Beihang University, Beijing 100191, China2. Sino-German Joint Software Institute, Beihang University, Beijing 100191, China
Abstract:With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed.
Keywords:resilience  high-performance computing  fault tolerance  challenge  
点击此处可从《Frontiers of Computer Science》浏览原始摘要信息
点击此处可从《Frontiers of Computer Science》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号