User-level failure detection and auto-recovery of parallel programs in HPC systems期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

User-level failure detection and auto-recovery of parallel programs in HPC systems

Authors:	Guozhen ZHANG Yi LIU Hailong YANG Jun XU Depei QIAN

Affiliation:	¹. State Key Laboratory of Software Development Environment, Beijing 100191, China². Sino-German Joint Software Institute, Beihang University, Beijing 100191, China³. School of Computer Science and Engineering, Beihang University, Beijing 100191, China⁴. Science and Technology on Space System Simulation Laboratory Beijing Simulation Center, Beijing 100854, China

Abstract:	As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems.

Keywords:	high performance computing parallel program failure detection failure auto-recovery

	点击此处可从《Frontiers of Computer Science》浏览原始摘要信息
	点击此处可从《Frontiers of Computer Science》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏