首页 | 本学科首页   官方微博 | 高级检索  
     


Unified model for assessing checkpointing protocols at extreme‐scale
Authors:George Bosilca,Auré  lien Bouteiller,Elisabeth Brunet,Franck Cappello,Jack Dongarra,Amina Guermouche,Thomas Herault,Yves Robert,Fré    ric Vivien,Dounia Zaidouni
Abstract:In this paper, we present a unified model for several well‐known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. Copyright © 2013 John Wiley & Sons, Ltd.
Keywords:checkpoint/restart  coordinated checkpoint  hierarchical checkpoint with message logging  checkpointing waste optimization problem
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号