首页 | 本学科首页   官方微博 | 高级检索  
     

大规模MPI 并行计算的可扩展三模冗余容错机制
引用本文:王之元,杨学军,周云.大规模MPI 并行计算的可扩展三模冗余容错机制[J].软件学报,2012,23(4):1022-1035.
作者姓名:王之元  杨学军  周云
作者单位:国防科学技术大学计算机学院,湖南长沙,410073
基金项目:国家自然科学基金(61003081,61003087,60921062)
摘    要:随着系统规模的扩大,并行计算的性能不断提高,但可靠性却也在不断下降,因此需要采用某种容错机制来容忍或恢复硬件故障和数据错误.目前常用的容错机制Checkpoint/Restart和多模冗余均引入了额外的开销,这些开销均在某种程度上制约了并行计算的可扩展性.因此,在高性能计算需求不断增长的今天,可扩展容错机制的设计显得尤为迫切和重要.以三模冗余(triple modular redundancy,简称TMR)为典型案例,描述了传统TMR在大规模MPI 并行计算上的实现方法,分析了该机制所面临的实际问题,进而指出传统TMR制约了并行计算的扩展.根据该技术所面临的问题,设计了可扩展三模冗余(scalable triple modular redundancy,简称STMR),并进一步验证了其有效性和可扩展性.该机制不仅能够处理Checkpoint/Restart针对的fail-stop故障,还能够解决绝大部分硬件不能直接感知的数据错误.最后,借用BlueGene/L的系统参数进行模拟,预测当系统规模增大时,在分别采用TMR和STMR的情况下并行计算可扩展性的变化,结果进一步验证了STMR是可扩展的容错机制.

关 键 词:容错机制  可扩展性  三模冗余  大规模并行计算  MPI
收稿时间:2010/10/8 0:00:00
修稿时间:2011/1/20 0:00:00

Scalable Triple Modular Redundancy Fault Tolerance Mechanism for MPI-Oriented Large Scale Parallel Computing
WANG Zhi-Yuan,YANG Xue-Jun and ZHOU Yun.Scalable Triple Modular Redundancy Fault Tolerance Mechanism for MPI-Oriented Large Scale Parallel Computing[J].Journal of Software,2012,23(4):1022-1035.
Authors:WANG Zhi-Yuan  YANG Xue-Jun and ZHOU Yun
Affiliation:(College of Computer,National University of Defense Technology,Changsha 410073,China)
Abstract:The scale-up of system brings improvement in performance as well as reliability degradation,so there is a need to apply some fault tolerance mechanism to tolerate hardware failure or recover data.Currently,the popular fault tolerance mechanisms,such as Checkpoint/Restart and N-modular redundancy,all need additional overhead,which limits the scalability of parallel computing to some extent.Therefore,it is very important to develop scalable fault tolerance mechanisms for increasingly high performance supercomputing.This paper takes triple modular redundancy(TMR) as an example,describes the implementation of TMR on large-scale MPI parallel computing,and argues that traditional TMR fault-tolerant mechanism limits the scalability of parallel computing.To solve these practical problems,the paper proposes the scalable triple modular redundancy(STMR),and verifies the validity and scalability of it.STMR can not only handle the fail-stop failures that are traditionally handled by Checkpoint/Restart,but can also deal with most of data errors not perceived directly by the hardware.Finally,the study conducts the simulation using the system parameters of BlueGene/L,which shows the scalability change of parallel computing with the TMR and the STMR respectively when the system size increases.The results further validate STMR position as scalable fault-tolerant mechanism.
Keywords:fault tolerance mechanism  scalability  triple modular redundancy  large scale parallel computing  MPI
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号