首页 | 本学科首页   官方微博 | 高级检索  
     


Fault tolerance in distributed systems using fused state machines
Authors:Bharath Balasubramanian  Vijay K Garg
Affiliation:1. EDGE Lab, Department of Electrical Engineering, Princeton University, Engineering Quadrangle, Olden Street, Princeton, NJ, 08544, USA
2. Parallel and Distributed Systems Laboratory, Department of Electrical and Computer Engineering, The University of Texas at Austin, 1 University Station, C0803, Austin, TX, 78712-0240, USA
Abstract:Replication is a standard technique for fault tolerance in distributed systems modeled as deterministic finite state machines (DFSMs or machines). To correct \(f\) crash or \(\lfloor f/2 \rfloor \) Byzantine faults among \(n\) different machines, replication requires \(nf\) backup machines. We present a solution called fusion that requires just \(f\) backup machines. First, we build a framework for fault tolerance in DFSMs based on the notion of Hamming distances. We introduce the concept of an ( \(f\) , \(m\) )-fusion, which is a set of \(m\) backup machines that can correct \(f\) crash faults or \(\lfloor f/2 \rfloor \) Byzantine faults among a given set of machines. Second, we present an algorithm to generate an ( \(f\) , \(f\) )-fusion for a given set of machines. We ensure that our backups are efficient in terms of the size of their state and event sets. Third, we use locality sensitive hashing for the detection and correction of faults that incurs almost the same overhead as that for replication. We detect Byzantine faults with time complexity \(O(n f)\) on average while we correct crash and Byzantine faults with time complexity \(O(n \rho f)\) with high probability, where \(\rho \) is the average state reduction achieved by fusion. Finally, our evaluation of fusion on the widely used MCNC’91 benchmarks for DFSMs shows that the average state space savings in fusion (over replication) is 38 % (range 0–99 %). To demonstrate the practical use of fusion, we describe its potential application to two areas: sensor networks and the MapReduce framework. In the case of sensor networks a fusion-based solution can lead to significantly fewer sensor-nodes than a replication-based solution. For the MapReduce framework, fusion can reduce the number of map-tasks compared to replication. Hence, fusion results in considerable savings in state space and other resources such as the power needed to run the backups.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号