首页 | 本学科首页   官方微博 | 高级检索  
     


Fault-tolerant scheduling on parallel systems with non-memoryless failure distributions
Authors:Mohamed Slim Bouguerra  Derrick Kondo  Fernando Mendonca  Denis Trystram
Affiliation:1. INRIA, 655 Avenue de l’Europe, 38334 Saint Ismier cedex, France;2. Grenoble Institute of Technology, France;3. Institut Universitaire de France, France
Abstract:As large parallel systems increase in size and complexity, failures are inevitable and exhibit complex space and time dynamics. Most often, in real systems, failure rates are increasing or decreasing over time. Considering non-memoryless failure distributions, we study a bi-objective scheduling problem of optimizing application makespan and reliability. In particular, we determine whether one can optimize both makespan and reliability simultaneously, or whether one metric must be degraded in order to improve the other. We also devise scheduling algorithms for achieving (approximately) optimal makespan or reliability. When failure rates decrease, we prove that makespan and reliability are opposing metrics. In contrast, when failure rates increase, we prove that one can optimize both makespan and reliability simultaneously. Moreover, we show that the largest processing time (LPT) list scheduling algorithm achieves good performance when processors are of uniform speed. The implications of our findings are the accelerated completion and improved reliability of parallel jobs executed across large distributed systems. Finally, we conduct simulations to investigate the impact of failures on the performance, which is done using an actual application of biological sequence comparison.
Keywords:Fault tolerance  Reliability  Scheduling  Multi-objective optimization
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号