首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Failure detection and consensus in the crash-recovery model   总被引:2,自引:0,他引:2  
Summary. We study the problems of failure detection and consensus in asynchronous systems in which processes may crash and recover, and links may lose messages. We first propose new failure detectors that are particularly suitable to the crash-recovery model. We next determine under what conditions stable storage is necessary to solve consensus in this model. Using the new failure detectors, we give two consensus algorithms that match these conditions: one requires stable storage and the other does not. Both algorithms tolerate link failures and are particularly efficient in the runs that are most likely in practice – those with no failures or failure detector mistakes. In such runs, consensus is achieved within time and with 4 n messages, where is the maximum message delay and n is the number of processes in the system. Received: May 1998 / Accepted: November 1999  相似文献   

2.
Synchronous atomic broadcast for redundant broadcast channels   总被引:4,自引:3,他引:1  
We propose a synchronous atomic broadcast protocol for distributed real-time systems based on redundant broadcast channels. The protocol can tolerate a finite number f of concurrent processor crash failures, channel adapter performance failures and channel omission failures. Its message cost is optimal: when no failures occur only f+1 messages are sent per broadcast. The cost implications of providing tolerance to other failure classes are also investigated.  相似文献   

3.
Summary The binary Byzantine Agreement problem requiresn–1 receivers to agree on the binary value broadcast by a sender even when some of thesen processes may be faulty. We investigate the message complexity of protocols that solve this problem in the case of crash failures. In particular, we derive matching upper and lower bounds on the total, worst and average case number of meassages needed in the failure-free executions of such protocols.More specifically, we prove that any protocol that tolerates up tot faulty processes requires a total of at leastn+t–1 messages in its failure-free executions —and, therefore, at least [(n+t–1)/2] messages in the worst case and min (P 0,P 1)·(n+t–1) meassages in the average case, whereP v is the probability that the value of the bit that the sender wants to broadcast isv. We also give protocols that solve the problem using only the minimum number of meassages for these three complexity measures. These protocols can be implemented by using 1-bit messages. Since a lower bound on the number of messages is also a lower bound on the number of meassage bits, this means that the above tight bounds on the number of messages are also tight bounds on the number of meassage bits. Vassos Hadzilacos received a BSE from Princeton University in 1980 and a PhD from Harvard University in 1984, both in Computer Science. In 1984 he joined the Department of Computer Science at the University of Toronto where he is currently an Associate Professor. In 1990–1991 he was visiting Associate Professor in the Department of Computer Science at Cornell University. His research interests are in the theory of distributed systems. Eugene Amdur obtained a B. Math from the University of Waterloo in 1986 and a M.Sc. from the University of Toronto in 1988. He is currently employed by the Vision and Robotics group at the University of Toronto in both technical and research capacities. His current areas of interest are vision, robotics, and networking. Samuel Weber received his B.Sc. in Mathematics and Computer Science and his M.Sc. in Computer Science from the University of Toronto. Currently, he is at Cornell University as a Ph.D. student in Computer Science with a minor in Psychology. His research interests include distributed systems, and the semantics of programming languages.  相似文献   

4.
Impossibility results and best-case lower bounds are proved for the number of message delays and the number of processes required to reach agreement in an asynchronous consensus algorithm that tolerates non-Byzantine failures. General algorithms exist that achieve these lower bounds in the normal case, when the response time of non-faulty processes and the transmission delay of messages they send to one another are bounded. Our theorems allow algorithms to do better in certain exceptional cases, and such algorithms are presented. Two of these exceptional algorithms may be of practical interest.  相似文献   

5.
We consider the problem of reaching agreement in distributed systems in which some processes may deviate from their prescribed behavior before they eventually crash. We call this failure model “mortal Byzantine”. After discussing some application examples where this model is justified, we provide matching upper and lower bounds on the number of faulty processes, and on the required number of rounds in synchronous systems. We then continue our study by varying different system parameters. On the one hand, we consider the failure model under weaker timing assumptions, namely for partially synchronous systems and asynchronous systems with unreliable failure detectors. On the other hand, we vary the failure model in that we limit the occurrences of faulty steps that actually lead to a crash in synchronous systems.  相似文献   

6.
We define a simple variant of the Byzantine agreement (BA) problem, called the Failure Discovery (FD) problem, that roughly speaking, amounts to reaching BA provided that no failures are discovered. We show how a protocol for FD can be extended to one for BA, with no message overhead in the failure-free runs. We also show that, for so-calledbenign failures, if the FD protocol satisfies an additional property, the message-preserving extension to a BA protocol can be accomplished with minimal time overhead in the failure-free runs. Our results show that FD is a useful building block for BA; indeed, it has been used in this way in a companion paper (Hadzilacos and Halpern, 1993).The work of V. Hadzilacos was supported, in part, by a grant from the Natural Sciences and Engineering Research Council of Canada.  相似文献   

7.
Failure detectors have been shown to be a very useful mechanism to solve the consensus problem in the crash failure model, for which a number of communication-efficient algorithms have been proposed. In this paper we deal with the definition, implementation and use of communication-efficient failure detectors in the general omission failure model, where processes can fail by crashing and by omitting messages when sending and/or receiving. We first define a new failure detector class for this model in terms of completeness and accuracy properties. Then we propose an algorithm that implements a failure detector of the proposed class in a communication-efficient way, in the sense that only a linear number of links are used to send messages forever. We also explain how the well-known consensus algorithm of Chandra and Toueg can be adapted in order to use the proposed failure detector.  相似文献   

8.
Summary.  We consider agreement and leader election on asynchronous complete networks when the processors are reliable, but some of the channels are subject to failure. Fischer, Lynch, and Paterson have already shown that no deterministic algorithm can solve the agreement problem on asynchronous networks if any processor fails during the execution of the algorithm. Therefore, we consider only channel failures. The type of channel failure we consider in this paper is Byzantine failure, that is, channels fail by altering messages, sending false information, forging messages, losing messages at will, and so on. There are no restrictions on the behavior of a faulty channel. Therefore, a faulty channel may act as an adversary who forges messages on purpose to prevent the successful completion of the algorithm. Because we assume an asynchronous network, the channel delays are arbitrary. Thus, the faulty channels may not be detectable unless, for example, the faulty channels cause garbage to be sent. We present the first known agreement and leader election algorithm for asynchronous complete networks in which the processors are reliable but some channels may be Byzantine faulty. The algorithm can tolerate up to [n−22] faulty channels, where n is the number of processors in the network. We show that the bound on the number of faulty channels is optimal. When the processors terminate their corresponding algorithms, all the processors in the network will have the same correct vector, where the vector contains the private values of all the processors. Received: May 1994/Accepted: July 1995  相似文献   

9.
Fail-Awareness: An Approach to Construct Fail-Safe Systems   总被引:1,自引:0,他引:1  
We present a framework for building fail-safe hard real-time applications in timed asynchronous distributed systems subject to communication partitions and performance, omission, and crash failures. Most distributed systems built from commercial-off-the-shelf (COTS) processor and communication services are subject to such partitions because their COTS components do not provide hard real-time guarantees. Also custom designed systems can be subject to partitions due to unmaskable link or router failures. The basic assumption behind our approach is that each processor has a local hardware clock that proceeds within a linear envelope of real-time. This allows one to compute an upper bound on the actual delays incurred by a particular processing sequence or message transmission. Services and applications can use these computed bounds to detect when they cannot guarantee all their standard properties because of excessive delays. This allows an application to be fail-aware, that is, to detect when it cannot guarantee all its safety properties and in particular, to detect when to switch to a fail-safe mode.  相似文献   

10.
In many distributed computing environments, processes are concurrently executed by nodes in a store-and-forward network. Distributed control issues as diverse as name-server, mutual exclusion, and replicated data management, involve making matches between processes. The generic paradigm is a formal problem called “distributed match-making.” We define multidimensional and weighted versions, and the relations between the two, and develop a very general method to prove lower bounds on the complexity as a tradeoff between number of messages and “distributedness.” The resulting lower bounds are tight in all cases we have examined. We present a success-stop version of distributed match-making that is analysed in terms of a weight distribution that in all cases results in approximately halving the (expected) number of messages required in the corresponding strategy that does not use these weights. The second author did part of this work at the Laboratory for Computer Science, M.I.T., Cambridge, MA. He was supported in part by the Office of Naval Research under Contract N00014-85-K-0168, by the Office of Army Research under Contract DAAG29-84-K-0058, by the National Science Foundation under Grant DCR-83-02391, and by the Defence Advanced Research Projects Agency (DARPA) under Contract N00014-83-K-0125. A preliminary version of this paper appeared inProc. VLSI Algorithms and Architectures, 3rd Aegean Workshop on Computing (AWOC 88), Lecture Notes in Computer Science, vol. 319, Springer-Verlag, Berlin, 1988, pp. 361–368.  相似文献   

11.
We consider the gossip problem in a synchronous message-passing system. Participating processors are prone to omission failures, that is, a faulty processor may fail to send or receive a message. The gossip problem in the fault-tolerant setting is defined as follows: every correct processor must learn the initial value of any other processor, unless the other one is faulty; in the latter case either the initial value or the information about the fault must be learned. We develop two efficient algorithms that solve the gossip problem in time O(logn), where n is the number of processors in the system. The first one is an explicit algorithm (i.e., constructed in polynomial time) sending O(nlogn+f2) messages, and the second one reduces the message complexity to O(n+f2), where f is the upper bound on the number of faulty processors.  相似文献   

12.
Reliability is an important research topic in distributed computing systems consisting of a large number of processors. To achieve reliability, the fault-tolerance scheme of the distributed computing system must be revised. This kind of problem is known as a Byzantine agreement (BA) problem. It requires all fault-free processors to agree on a common value, even if some components are corrupt. Consequently, there have been significant studies of this agreement problem in distributed systems. However, the traditional BA protocols focus on running ⌊(n−1)/3⌋+1 rounds of message exchange continuously to make each fault-free processor reach an agreement. In other words, since having a large number of messages results in a large protocol overhead, those protocols are inefficient and unreasonable, especially for some network environments which have large number of processors. In this study, we propose a novel and efficient protocol to reduce the number of messages. Our protocol can collect, compare and replace the received values to find the reliable processors and replace the values sent by the unreliable processors. Subsequently, each processor can agree on a common value through three rounds of message exchange. Furthermore, the proposed protocol can use the minimum number of messages to tolerate the maximum number of faulty components in a distributed system.  相似文献   

13.
The AQuA architecture provides adaptive fault tolerance to CORBA applications by replicating objects and providing a high-level method that an application can use to specify its desired level of dependability. This paper presents the algorithms that AQUA uses, when an application's dependability requirements can change at runtime, to tolerate both value faults in applications and crash failures simultaneously. In particular, we provide an active replication communication scheme that maintains data consistency among replicas, detects crash failures, collates the messages generated by replicated objects, and delivers the result of each vote. We also present an adaptive majority voting algorithm that enables the correct ongoing vote while both the number of replicas and the majority size dynamically change. Together, these two algorithms form the basis of the mechanism for tolerating and recovering from value faults and crash failures in AQuA  相似文献   

14.
Wireless ad-hoc networks are being increasingly used in diverse contexts, ranging from casual meetings to disaster recovery operations. A promising approach is to model these networks as distributed systems prone to dynamic communication failures. This captures transitory disconnections in communication due to phenomena like interference and collisions, and permits an efficient use of the wireless broadcasting medium. This model, however, is bound by the impossibility result of Santoro and Widmayer, which states that, even with strong synchrony assumptions, there is no deterministic solution to any non-trivial form of agreement if n ? 1 or more messages can be lost per communication round in a system with n processes. In this paper we propose a novel way to circumvent this impossibility result by employing randomization. We present a consensus protocol that ensures safety in the presence of an unrestricted number of omission faults, and guarantees progress in rounds where such faults are bounded by ${f \,{\leq}\,\lceil \frac{n}{2} \rceil (n\,{-}\,k)\,{+}\,k\,{-}\,2}$ , where k is the number of processes required to decide, eventually assuring termination with probability 1.  相似文献   

15.
Abstract. We study the average number of delays suffered by packets routed using greedy (work conserving) scheduling policies. We obtain tight bounds on the worst-case average number of delays in a few cases as follows. First, we show that the worst-case average number of delays is a function of the number of sources of packets, which is interesting in case a node may send many packets. Then, using a new concept we call delay race , we prove a tight bound on the average number of delays in a leveled graph. Finally, using delay races in a more involved way, we prove nearly tight bounds on the average number of delays in directed acyclic graphs (DAGs). The upper bound for DAGs is expressed in terms of the underlying topology, and as a result it holds for any acyclic set of routes, even if they are not shortest paths. The lower bound for DAGs, on the other hand, holds even for shortest paths routes.  相似文献   

16.
Summary Byzantine Agreement is important both in the theory and practice of distributed computing. However, protocols to reach Byzantine Agreement are usually expensive both in the time required as well as in the number of messages exchanged. In this paper, we present a self-adjusting approach to the problem. The Mostly Byzantine Agreement is proposed as a more restrictive agreement problem that requires that in the consecutive attempts to reach agreement, the number of disagreements (i.e., failures to reach Byzantine Agreement) is finite. Fort faulty processes, we give an algorithm that has at mostt disagreements for 4t or more processes. Another algorithm is given forn3t+1 processes with the number of disagreements belowt 2/2. Both algorithms useO(n 3) message bits for binary value agreement. Yi Zhao is currently working on his Ph.D. degree in Computer Science at University of Houston. His research interests include fault tolerance, distributed computing, parallel computation and neural networks. He obtained his M.S. from University of Houston in 1988 and B.S. from Beijing University of Aeronautics and Astronautics in 1984, both in computer science. Farokh B. Bastani received the B. Tech. degree in electrical engineering from the Indian Institute of Technology, Bombay, India, and the M.S. and Ph.D. degrees in electrical engineering and computer science from the University of California, Berkeley. He joined the University of Houston in 1980, where he is currently an Associate Professor of Computer Science. His research interests include software design and validation techniques, distributed systems, and fault-tolerant systems. He is a member of the ACM and the IEEE and is on the editorial board of theIEEE Transactions on Software Engineering.  相似文献   

17.
An indulgent algorithm is a distributed algorithm that tolerates asynchronous periods of the network when process crash detection is unreliable. This paper presents a tight bound on the time complexity of indulgent consensus algorithms. We consider a round-based eventually synchronous model, and we show that any t-resilient consensus algorithm in this model, requires at least t+2 rounds for a global decision even in runs that are synchronous. We contrast our lower bound with the well-known t+1 round tight bound on consensus in the synchronous model. We then prove the bound to be tight by exhibiting a new t-resilient consensus algorithm in the eventually synchronous model that reaches a global decision at round t+2 in every synchronous run. Our new algorithm is in this sense significantly faster than the most efficient indulgent algorithm we know of, which requires 2t+2 rounds in synchronous runs. Our lower bound applies to round-based consensus algorithms with unreliable failure detectors such as ⋄ P and ⋄ S, and our matching algorithm can be adapted to such failure detectors. This work is partially supported by the Swiss National Science Foundation (project number 510-207).  相似文献   

18.
Traditionally, the Byzantine Agreement (BA) problem is studied either in a fully connected network or in a broadcast network. A generalized network model for BA is proposed in this paper. A fully-connected network or a broadcast network is a special case of the new network architecture. Under the new generalized network model, the BA problem is reexamined with the assumption of malicious faults on both processors and transmission medium (TM), as opposed to previous studies which consider malicious faults on processors only. The proposed algorithm uses the minimum number of message exchanges, and can tolerate the maximum number of allowable faulty components to make each healthy processor reach a common agreement for the cases of processor failures, TM failures, or processor/TM failures. The results can also be used to solve the interactive consistency problem and the consensus problem  相似文献   

19.
基于代理的Byzantine一致性协议的研究   总被引:1,自引:0,他引:1  
本文在研究了国内外Byzantine协议的基础上提出了一种新的Byzantine一致性协议,即基于代理的Byzantine一致性协议。该协议按照Byzantine容错机制将所有参与运算的进程分成很多小块,每个块设有一个代理。通过代理,块内的进程向其他块的进程发送运算结果。这样,在进程发生Byzantine错误时可以先在块的内部处理,从而可以有效地减少容错的开销和时延,提高系统的安全性。  相似文献   

20.
We show that deadlocks due to dependencies on consumption channels are a fundamental problem in wormhole multicast routing. This type of resource deadlocks has not been addressed in many previously proposed wormhole multicast algorithms. We also show that deadlocks on consumption channels can be avoided by using multiple classes of consumption channels and restricting the use of consumption channels by multicast messages. We provide upper bounds for the number of consumption channels required to avoid deadlocks. In addition, we present a new multicast routing algorithm, column-path, which is based on the well-known dimension-order routing used in many multicomputers and multiprocessors. Therefore, this algorithm could be implemented in existing multicomputers with simple changes to the hardware. Using simulations, we compare the performance of the proposed column-path algorithm with the previously proposed Hamiltonian-path-based multipath and an e-cube-based multicast routing algorithms. Our results show that for multicast traffic, the column-path routing offers higher throughputs, while the multipath algorithm offers lower message latencies. Another result of our study is that the commonly implemented simplistic scheme of sending one copy of a multicast message to each of its destinations exhibits good performance provided the number of destinations is small  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号