首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We focus on the problem of synthesizing failsafe fault-tolerance where fault-tolerance is added to an existing (fault-intolerant) program. A failsafe fault-tolerant program satisfies its specification (including safety and liveness) in the absence of faults. However, in the presence of faults, it satisfies its safety specification. We present a somewhat unexpected result that, in general, the problem of synthesizing failsafe fault-tolerant distributed programs from their fault-intolerant version is NP-complete in the state space of the program. We also identify a class of specifications, monotonic specifications, and a class of programs, monotonic programs, for which the synthesis of failsafe fault-tolerance can be done in polynomial time (in program state space). As an illustration, we show that the monotonicity restrictions are met for commonly encountered problems, such as Byzantine agreement, distributed consensus, and atomic commitment. Furthermore, we evaluate the role of these restrictions in the complexity of synthesizing failsafe fault-tolerance. Specifically, we prove that if only one of these conditions is satisfied, the synthesis of failsafe fault-tolerance is still NP-complete. Finally, we demonstrate the application of monotonicity property in enhancing the fault-tolerance of (distributed) nonmasking fault-tolerant programs to masking.  相似文献   

2.
In this paper, we present a software framework for adding fault-tolerance to existing finite-state programs. The input to our framework is a fault-intolerant program and a class of faults that perturbs the program. The output of our framework is a fault-tolerant version of the input program. Our framework provides (1) the first automated tool for the synthesis of fault-tolerant distributed programs, and (2) an extensible platform for researchers to develop a repository of heuristics that deal with the complexity of adding fault-tolerance to distributed programs. We also present a set of heuristics for polynomial-time addition of fault-tolerance to distributed programs. We have used this framework for automated synthesis of several fault-tolerant programs including a simplified version of an aircraft altitude switch, token ring, Byzantine agreement, and agreement in the presence of Byzantine and fail-stop faults. These examples illustrate that our framework can be used for synthesizing programs that tolerate different types of faults (process restarts, Byzantine and fail-stop) and programs that are subject to multiple faults (Byzantine and fail-stop) simultaneously. We have found our framework to be highly useful for pedagogical purposes, especially for teaching concepts of fault-tolerance, automatic program transformation, and the effect of heuristics.  相似文献   

3.
Masking fault-tolerance guarantees that programs continually satisfy their specification in the presence of faults. By way of contrast, nonmasking fault-tolerance does not guarantee as much: it merely guarantees that when faults stop occurring, program executions converge to states from where programs continually (re)satisfy their specification. We present in this paper a component based method for the design of masking fault-tolerant programs. In this method, components are added to a fault-intolerant program in a stepwise manner, first, to transform the fault-intolerant program into a nonmasking fault-tolerant one and, then, to enhance the fault-tolerance from nonmasking to masking. We illustrate the method by designing programs for agreement in the presence of Byzantine faults, data transfer in the presence of message loss, triple modular redundancy in the presence of input corruption, and mutual exclusion in the presence of process fail-stops. These examples also serve to demonstrate that the method accommodates a variety of fault-classes. It provides alternative designs for programs usually designed with extant design methods, and it offers the potential for improved masking fault-tolerant programs  相似文献   

4.
We focus on the constraint-based automated addition of nonmasking and stabilizing fault-tolerance to hierarchical programs. We specify legitimate states of the program in terms of constraints that should be satisfied in those states. To deal with faults that may violate these constraints, we add recovery actions while ensuring interference freedom among the recovery actions added for satisfying different constraints. Since the constraint-based manual design of fault-tolerance is well known, we expect our approach to have a significant benefit in automating the addition of fault-tolerance. We illustrate our algorithm with four case studies: stabilizing mutual exclusion, stabilizing diffusing computation, a data dissemination problem in sensor networks, and tree maintenance. With experimental results, we show that the complexity of our algorithm is reasonable and that it can be reduced using the structure of the hierarchical systems.We also reduced the time complexity of the synthesis using parallelism. We consider two approaches to speedup the synthesis algorithm: first, the use of the multiple constraints that have to be satisfied during synthesis; second, the use of the distributed nature of the programs being synthesized. We show that our approaches provide significant reduction in the synthesis time.To our knowledge, this is the first instance where automated synthesis has been successfully used in synthesizing programs that are correct under fairness assumptions. Moreover, in three of the case studies considered in this paper, the structure of the recovery paths is too complex to permit existing heuristic-based approaches for adding recovery.  相似文献   

5.
In this paper, we investigate the effect of the representation of safety specification on the complexity of adding masking fault tolerance to programs - where, in the presence of faults, the program 1) recovers to states from where it satisfies its (safety and liveness) specification and 2) preserves its safety specification during recovery. Specifically, we concentrate on two approaches for modeling the safety specifications: 1) the bad transition (BT) model, where safety is modeled as a set of bad transitions that should not be executed by the program, and 2) the bad pair (BP) model, where safety is modeled as a set of finite sequences consisting of at most two successive transitions. If the safety specification is specified in the BT model, then it is known that the complexity of automatic addition of masking fault tolerance to high atomicity programs - where processes can read/write all program variables in an atomic step) - is polynomial in the state space of the program. However, for the case where one uses the BP model to specify safety specification, we show that the problem of adding masking fault tolerance to high atomicity programs is NP-complete. Therefore, we argue that automated synthesis of fault-tolerant programs is likely to be more successful if one focuses on problems where safety can be represented in the BT model.  相似文献   

6.
Fault-tolerant computing: fundamental concepts   总被引:2,自引:0,他引:2  
Nelson  V.P. 《Computer》1990,23(7):19-25
The basic concepts of fault-tolerant computing are reviewed, focusing on hardware. Failures, faults, and errors in digital systems are examined, and measures of dependability, which dictate and evaluate fault-tolerance strategies for different classes of applications, are defined. The elements of fault-tolerance strategies are identified, and various strategies are reviewed. They are: error detection, masking, and correction; error detection and correction codes; self-checking logic; module replication for error detection and masking; protocol and timing checks; fault containment; reconfiguration and repair; and system recovery  相似文献   

7.
ContextBusiness Process Execution Language (BPEL) is a widely recognized executable service composition language, which is significantly different from typical programming languages in both syntax and semantics, and especially shorter in program scale. How to effectively locate faults in BPEL programs is an open and challenging problem.ObjectiveIn this paper, we propose a fault localization framework for BPEL programs.MethodBased on BPEL program characteristics, we propose two fault localization guidelines to locate the integration and interaction faults in BPEL programs. Our framework formulates the BPEL fault localization problem using the popular fault localization problem settings, and synthesizes BPEL-specific fault localization techniques by reuse of existing fault localization formulas. We use two realistic BPEL programs and three existing fault localization formulas to evaluate the feasibility and effectiveness of the proposed fault localization framework and guidelines.ResultExperiment results show that faults can be located with the fewest code examining efforts. That is, the fault-relevant basic block is assigned the highest suspiciousness score by our fault localization method. The experiment results also show that with the use of the proposed fault localization guidelines, the code examining efforts to locate faults are extraordinarily reduced.ConclusionWe conclude that the proposed framework is feasible in synthesizing effective fault localization techniques, and our fault localization guidelines are very effective to enhance existing fault localization techniques in locating faults in BPEL programs.  相似文献   

8.
The fault-tolerance characteristics of time-continuous, recurrent artificial neural networks (ANNs) that can be used to solve optimization problems are investigated. The performance of these networks is illustrated by using well-known model problems like the traveling salesman problem and the assignment problem. The ANNs are then subjected to up to 13 simultaneous stuck-at-1 or stuck-at-0 faults for network sizes of up to 900 neurons. The effect of these faults on the performance is demonstrated, and the cause for the observed fault-tolerance is discussed. An application is presented in which a network performs a critical task for a real-time distributed processing system by generating new task allocations during the reconfiguration of the system. The performance degradation of the ANN under the presence of faults is investigated by large-scale simulations and the potential benefits of delegating a critical task to a fault-tolerant network are discussed.  相似文献   

9.
A system is fault tolerant if it remains functional after the occurrence of a fault. Given a plant subject to a fault, fault-tolerant control requires the controller to form a fault-tolerant closed-loop system. For the systematic design of a fault-tolerant controller, typical input data consists of the plant dynamics including the effect of the faults under consideration and a formal performance requirement with a possible allowance for degraded performance after the fault. For its obvious practical relevance, the synthesis of fault-tolerant controllers has received extensive attention in the literature, however, with a particular focus on continuous-variable systems. The present paper addresses discrete-event systems and provides an overview on fault-tolerant supervisory control. The discussion is held in terms of formal languages to uniformly present approaches to passive fault-tolerance, active fault-tolerance, post-fault recovery and fault hiding.  相似文献   

10.
可重构电子系统芯片级在线自主容错方法研究   总被引:4,自引:2,他引:2  
可重构电子系统芯片固定型故障的传统容错设计往往采用集中式控制方法,存在测试时间长、硬件资源利用率低、对外部控制器依赖性高等问题。因此,设计了一种具有分布式自主容错能力的可重构细胞阵列,通过将细胞内部查找表输出与参考值进行比较的方式进行循环检测,并利用冗余存储单元对故障查找表进行修复。以四位并行乘法器为例进行仿真验证,实验结果表明,新型可重构阵列的自主容错设计方法,比现有设计的硬件开销小,修复时间短,容错能力强,且设计复杂度不受阵列规模影响。  相似文献   

11.
在分布式存储系统存储数据时,如果一个或几个设备出现故障,不仅该设备中的数据不能使用,而且会导致用户无法完整地访问资源。针对该问题,提出一种基于RS码的错误容忍存储方案,当系统中错误设备的数量不超过m时,就可以对其进行恢复,实现容错。该方案具有较高的安全性与执行效率,能满足存储系统容错的要求,可以利用其构造对可靠性要求较高的存储系统。  相似文献   

12.
针对网络故障以及Web服务节点暂时性失效,导致Web服务请求出错的问题,提出了一种Web服务容错机制.通过建立服务补偿机制和Web服务会话管理机制达到容错的目的.首先对Web服务的容错性问题进行了分析;接着详细论述了服务补偿机制和容错框架的设计与实现方案;最后通过在政务服务集成系统中的应用实例说明了该容错机制能够有效地增强Web服务请求调用的可靠性.  相似文献   

13.
We examine the task of constructing bounded-time self-stabilizing rule-based systems that take their input from an external environment. Bounded response-time and self-stabilization are essential for rule-based programs that must be highly fault-tolerant and perform in a real-time environment. We present an approach for solving this problem using the OPS5 programming language as it is one of the most expressive and widely used rule-based programming languages. Bounded response-time of the program is ensured by constructing the state space graph so that the programmer can visualize the control flow of the program execution. Potential infinite firing sequences, if any, should be detected and the involved rules should be revised to ensure bounded termination. Both the input variables and internal variables are made fault-tolerant from corruption caused by transient faults via the introduction of new self-stabilizing rules in the program. Finally, the timing analysis of the self-stabilizing OPS5 program is shown in terms of the number of rule firings and the comparisons performed in the Rete network.  相似文献   

14.
Runtime verification (RV) is a natural fit for ultra-critical systems that require correct software behavior. Due to the low reliability of commodity hardware and the adversity of operational environments, it is common in ultra-critical systems to replicate processing units (and their hosted software) and incorporate fault-tolerant algorithms to compare the outputs, even if the software is considered to be fault-free. In this paper, we investigate the use of software monitoring in distributed fault-tolerant systems and the implementation of fault-tolerance mechanisms using RV techniques. We describe the Copilot language and compiler that generates monitors for distributed real-time systems, and we discuss two case-studies in which Copilot-generated monitors were used to detect onboard software and hardware faults and monitor air-ground data link messaging protocols.  相似文献   

15.
We consider part of the problem of schema-biased inductive synthesis of recursive logic programs from incomplete specifications, such as clausal evidence (for instance, but not necessarily, ground positive and negative examples). After synthesizing the base clause and introducing recursive call(s) to the recursive clause, it remains to combine the overall result from the partial results obtained through recursion, so as to complete the recursive clause. Evidence for this combination relation can be abduced from the initially given evidence for the top-level relation. A program for this combination relation can be anything, from a single clause performing a unification (such as for lastElem) to multiple guarded clauses performing unifications (such as for filtering programs) to recursive programs (such as for naive reverse). Existing methods cannot induce guarded clause programs for this combination relation from the abduced evidence. Some existing methods cannot even detect that the combination program itself may have to be recursive and thus they then do not recursively invoke themselves the overall recursive program synthesizer. We introduce our Program Completion Method as a suitable extension and generalization of the existing methods. ©1999 John Wiley & Sons, Inc.  相似文献   

16.
当前并发程序容错机制处理方式单一、效率较低。为此,提出一种适用于多种并发程序错误处理的容错机制。通过在编译及运行过程中对程序进行异常处理,并在异常发生时根据设置的检查点对程序进行回滚和防错误处理,以实现并发程序容错。实验结果表明,该容错机制可有效检测并发程序中的错误,在不增加程序总体运行时间的情况下达到比较理想的容错效果。  相似文献   

17.
Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at leastnprocessors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet.  相似文献   

18.
We pose and study the problem of Byzantine-robust topology discovery in an arbitrary asynchronous network. The problem is an abstraction of fault-tolerant routing. We formally state the weak and strong versions of the problem. The weak version requires that either each node discovers the topology of the network or at least one node detects the presence of a faulty node. The strong version requires that each node discovers the topology regardless of faults. We focus on noncryptographic solutions to these problems. We explore their bounds. We prove that the weak topology discovery problem is solvable only if the connectivity of the network exceeds the number of faults in the system. Similarly, we show that the strong version of the problem is solvable only if the network connectivity is more than twice the number of faults. We present solutions to both versions of the problem. The presented algorithms match the established graph connectivity bounds. The algorithms do not require the individual nodes to know either the diameter or the size of the network. The message complexity of both programs is low polynomial with respect to the network size. We describe how our solutions can be extended to add the property of termination, handle topology changes, and perform neighborhood discovery.  相似文献   

19.
Considera distributed real-time program which is executed on a systemwith a limited set of hardware resources. Assume the programis required to satisfy some timing constraints, despite the occurrenceof anticipated hardware failures. For efficient use of resources,scheduling decisions must be taken at run-time, considering deadlines,the load and hardware failures. The paper demonstrates how toreason about such dynamically scheduled programs in the frameworkof a timed process algebra and modal logic. The algebra providesa uniform process encoding of programs, hardware and schedulers,with an operational semantics of a process depending on the assumptionsabout faults. The logic specifies the timing properties of aprocess and verifies them via this fault-affected semantics,establishing fault-tolerance. The approach lends itself to applicationof existing tools and results supporting reasoning in processalgebras and modal logics.  相似文献   

20.
A Wireless Sensor Network (WSN) is a wireless network consisting of spatially distributed autonomous devices using sensor nodes in a wide range of applications in various domains. In the future, WSNs are expected to be integrated into the “Internet of Things” (IoT), where sensor nodes join the Internet dynamically, and use them to collaborate and accomplish their tasks. Because of the communications of WSN will produce a broadcast storm, the Cluster-based Wireless Sensor Network (CWSN) was proposed to ameliorate the broadcast storm. However, the capability of the fault-tolerance and reliability of CWSNs must be carefully investigated and analyzed. To cope with the influence of faulty components, reaching a common agreement in the presence of faults before performing certain tasks is essential. Byzantine Agreement (BA) problem is a fundamental problem in fault-tolerant distributed systems. To enhance fault-tolerance and reliability of CWSN, the BA problem in CWSN is revisited in this paper. In this paper, a new BA protocol is proposed that adapts to the CWSN and derives its limit of allowable faulty components, while maintaining the minimum number of message exchanges.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号