首页 | 本学科首页   官方微博 | 高级检索  
     


PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs
Authors:Xin-Hai Xu  Xue-Jun Yang  Jing-Ling Xue  Yu-Fei Lin  Yi-Song Lin
Affiliation:(1) National Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, 410073, China;(2) Programming Languages and Compilers Group, School of Computer Science and Engineering University of New South Wales, Sydney, Australia
Abstract:GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world's fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based fault-tolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens.
Keywords:GPGPU  partial recomputing  fault tolerance  CUDA  checkpointing
本文献已被 CNKI SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号