首页 | 本学科首页   官方微博 | 高级检索  
     


Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud
Authors:Ifeanyi P. Egwutuoha  Shiping Chen  David Levy  Bran Selic  Rafael Calvo
Affiliation:1. School of Electrical and Information Engineering, The University of Sydney, NSW 2006, Australia;2. CSIRO, Information Engineering Laboratory, CSIRO ICT Centre, Sydney, NSW, Australia
Abstract:Cloud computing offers new computing paradigms, capacity and flexible solutions to high performance computing (HPC) applications. For example, Hardware as a Service (HaaS) allows users to provide a large number of virtual machines (VMs) for computation-intensive applications using the HaaS model. Due to the large number of VMs and electronic components in HPC system in the cloud, any fault during the execution would result in re-running the applications, which will cost time, money and energy. In this paper we presented a proactive fault tolerance (FT) approach to HPC systems in the cloud to reduce the wall-clock execution time and dollar cost in the presence of faults. We also developed a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We also developed a cost model for executing computation-intensive applications on HPC systems in the cloud. We analysed the dollar cost of provisioning spare nodes and checkpointing FT to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall-clock execution time and cost of running computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of checkpointing of computation-intensive applications can be reduced up to 50% with our FT approach for HPC in the cloud compared with current FT approaches.
Keywords:HPC  Cloud computing  HaaS  proactive fault tolerance  computation-intensive
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号