首页 | 本学科首页   官方微博 | 高级检索  
     

MapReduce框架下结合分布式编码计算的容错算法
引用本文:张基,谢在鹏,毛莺池,徐媛媛,朱晓瑞,李博文.MapReduce框架下结合分布式编码计算的容错算法[J].计算机工程,2021,47(4):173-179.
作者姓名:张基  谢在鹏  毛莺池  徐媛媛  朱晓瑞  李博文
作者单位:河海大学 计算机与信息学院, 南京 211100
基金项目:国家重点研发计划;国家自然科学基金重点项目
摘    要:随着分布式系统规模扩大及计算复杂度增加,分布式计算的平均故障修复时间和容错计算所产生的通信开销呈现日益上升趋势。结合分布式编码计算和副本冗余技术,提出一种新的容错算法。map节点应用分布式编码计算的思想,将数据冗余分配至多个计算节点创建编码中间结果,降低计算节点在shuffle阶段的数据传输量。reduce节点通过对接收到的编码中间结果进行解码,从而验证中间结果的正确性并得到最终计算结果。实验结果表明,在基于MapReduce的分布式计算框架下,与三模冗余和两阶段三模冗余容错算法相比,该算法在完成容错计算的同时能降低计算过程中的通信开销和平均故障修复时间,并提高分布式系统的可用性和可靠性。

关 键 词:分布式系统  分布式计算  容错算法  分布式编码计算  三模冗余  
收稿时间:2020-03-13
修稿时间:2020-04-28

Fault-Tolerant Algorithm Combined with Distributed Coding Computing in MapReduce Framework
ZHANG Ji,XIE Zaipeng,MAO Yingchi,XU Yuanyuan,ZHU Xiaorui,LI Bowen.Fault-Tolerant Algorithm Combined with Distributed Coding Computing in MapReduce Framework[J].Computer Engineering,2021,47(4):173-179.
Authors:ZHANG Ji  XIE Zaipeng  MAO Yingchi  XU Yuanyuan  ZHU Xiaorui  LI Bowen
Affiliation:School of Computer and Information, Hohai University, Nanjing 211100, China
Abstract:The growing size and computational complexity of distributed systems lead to an increase in the Mean Time to Repair(MTTR)of distributed computing systems and the communication load caused by fault-tolerant computing.To solve the problems,this paper integrates distributed coding computing with replica redundancy to propose a novel faulttolerant algorithm.The map node uses the idea of distributed coding computing to allocate data replica to multiple computing nodes to create intermediate coding results and reduce the amount of data transmitted by the computing nodes in the shuffle phase.The reduce node decodes the received intermediate coding result to verify its correctness and obtain the final computing result.Experimental results show that in the MapReduce framework,the proposed algorithm can reduce the communication overhead and MTTR compared with the Triple Modular Redundancy(TMR)and two-stage TMR fault-tolerant algorithms.It also improves the availability and reliability of distributed systems.
Keywords:distributed system  distributed computing  fault-tolerant algorithm  distributed coding computing  Triple Modular Redundancy(TMR)
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号