首页 | 本学科首页   官方微博 | 高级检索  
     


Job migration in HPC clusters by means of checkpoint/restart
Authors:Rodríguez-Pascual  Manuel  Cao  Jiajun  Moríñigo  José A  Cooperman  Gene  Mayo-García  Rafael
Affiliation:1.Department of Technology, CIEMAT, Avda. Complutense 40, 28840, Madrid, Spain
;2.Department of Electrical and Computer Engineering, Northeastern University, 360 Huntington Avenue, Boston, MA, 02115, USA
;
Abstract:

Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.

Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号