首页 | 本学科首页   官方微博 | 高级检索  
     


Iaso: an autonomous fault-tolerant management system for supercomputers
Authors:Kai Lu  Xiaoping Wang  Gen Li  Ruibo Wang  Wanqing Chi  Yongpeng Liu  Hongwei Tang  Hua Feng  Yinghui Gao
Affiliation:1. Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China2. College of Computer, National University of Defense Technology, Changsha 410073, China3. ATR Laboratory, National University of Defense Technology, Changsha 410073, China
Abstract:With the increase of system scale, the inherent reliability of supercomputers becomes lower and lower. The cost of fault handling and task recovery increases so rapidly that the reliability issue will soon harm the usability of supercomputers. This issue is referred to as the “reliability wall”, which is regarded as a critical problem for current and future supercomputers. To address this problem, we propose an autonomous fault-tolerant system, named Iaso, in MilkyWay-2 system. Iaso introduces the concept of autonomous management in supercomputers. By autonomous management, the computer itself, rather than manpower, takes charge of the fault management work. Iaso automatically manage the whole lifecycle of faults, including fault detection, fault diagnosis, fault isolation, and task recovery. Iaso endows the autonomous features with MilkyWay-2 system, such as self-awareness, self-diagnosis, self-healing, and self-protection. With the help of Iaso, the cost of fault handling in supercomputers reduces from several hours to a few seconds. Iaso greatly improves the usability and reliability of MilkyWay-2 system.
Keywords:supercomputer  autonomous management  fault tolerant  fault management  MilkyWay-2 system  
本文献已被 SpringerLink 等数据库收录!
点击此处可从《Frontiers of Computer Science》浏览原始摘要信息
点击此处可从《Frontiers of Computer Science》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号