首页 | 本学科首页   官方微博 | 高级检索  
     


User-level failure detection and auto-recovery of parallel programs in HPC systems
Authors:Guozhen ZHANG  Yi LIU  Hailong YANG  Jun XU  Depei QIAN
Affiliation:1. State Key Laboratory of Software Development Environment, Beijing 100191, China2. Sino-German Joint Software Institute, Beihang University, Beijing 100191, China3. School of Computer Science and Engineering, Beihang University, Beijing 100191, China4. Science and Technology on Space System Simulation Laboratory Beijing Simulation Center, Beijing 100854, China
Abstract:As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems.
Keywords:high performance computing  parallel program  failure detection  failure auto-recovery  
点击此处可从《Frontiers of Computer Science》浏览原始摘要信息
点击此处可从《Frontiers of Computer Science》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号