首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Fault tolerance in computerized systems involved in production has become an ever more important requirement. Existing fault tolerance approaches, wherever used, deal mainly with hardware faults. Nevertheless, the vast majority of contemporary system failures are software related. This paper introduces a knowledge-based approach to handling software related faults occurring in supervisory control systems. These systems are event driven and use data, stored in complex databases, to react to events coming from different kinds of devices by identifying, scheduling, initiating and monitoring operations. Failure of part of the supervisory control system's software to behave rationally when unexpected events occur is called an application fault. The approach introduced in this paper is based on a supervisory control system reference model which reveals the set of all possible application faults together with the major functions of the recovery processes associated with each fault, and leads to a high-level knowledge-based system architecture capable of handling every fault-related condition. This system is called PROFIT (Intelligent PROduction systems Fault Tolerance) and consists of three main components: the fault diagnosis module, the instant fault correction module and the learning module, co-ordinated by a PROFIT meta-level module. The prototype version of PROFIT is analysed and the development as well as the run-time environment that prove the applicability and effectiveness of the system are presented.  相似文献   

2.
The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system.  相似文献   

3.
Software diversity is known to improve fault tolerance in N-version software systems by independent development. As the leading cause of software faults, human error is considered an important factor in diversity seeking. However, there is little scientific research focusing on how to seek software fault diversity based on human error mechanisms. A literature review was conducted to extract factors that may differentiate people with respect to human error-proneness. In addition, we constructed a conceptual model of the links between human error diversity and software diversity. An experiment was designed to validate the hypotheses, in the form of a programming contest, accompanied by a survey of cognitive styles and personality traits. One hundred ninety-two programs were submitted for the identical problem, and 70 surveys were collected. Code inspection revealed 23 faults, of which 10 were coincident faults. The results show that personality traits seems not effective predictors for fault diversity as a whole model, whereas cognitive styles and program measurements moderately account for the variation of fault density. The results also show causal relations between performance levels and coincident faults: coincident faults are unlikely to occur at skill-based performance level; the coincident faults introduced in rule-based performances show a high probability of occurrence, and the coincident faults introduced in knowledge-based performances are shaped by the content and formats of the task itself. Based on these results, we have proposed a model to seek software diversity and prevent coincident faults.  相似文献   

4.
Fault-Tolerant Rate-Monotonic Scheduling   总被引:11,自引:0,他引:11  
Ghosh  Sunondo  Melhem  Rami  Mossé  Daniel  Sarma  Joydeep Sen 《Real-Time Systems》1998,15(2):149-181
Due to the critical nature of the tasks in hard real-time systems, it is essential that faults be tolerated. In this paper, we present a scheme which can be used to tolerate faults during the execution of preemptive real-time tasks. We describe a recovery scheme which can be used to re-execute tasks in the event of single and multiple transient faults and discuss conditions that must be met by any such recovery scheme. We then extend the original Rate Monotonic Scheduling (RMS) scheme and the exact characterization of RMS to provide tolerance for single and multiple transient faults. We derive schedulability bounds for sets of real-time tasks given the desired level of fault tolerance for each task or subset of tasks. Finally, we analyze and compare those bounds with existing bounds for non-fault-tolerant and other variations of RMS.  相似文献   

5.
软件密集型装备中常常包含着许多担负监测和控制作用的嵌入式实时系统,它们常常属于安全关键或者任务关键系统(safety-critical/mission-critical system)。为了能够有效解决该类系统中的软件故障检测、诊断与修复任务,本文提出了基于Multi-agent的实时系统运行故障监控框架,旨在利用在多agent的协作构建运行故障监控系统来在系统运行当中验证系统是否满足时序逻辑描述的性质规约,并采用具体的算法进行故障定位和修复。  相似文献   

6.
Transient fault tolerance in digital systems   总被引:1,自引:0,他引:1  
Sosnowski  J. 《Micro, IEEE》1994,14(1):24-35
It is hard to shield systems effectively from transient faults (fault avoidance techniques). So some other means must be employed to assure appropriate levels of transient fault tolerance (insensitivity to transient faults). They are based on fault-masking and fault recovery ideas. Having analyzed this problem, the author identifies critical design points and outlines some practical solutions that refer to efficient on-line detectors (detecting errors during the system operation) and error handling procedures. This framework provides a basis for understanding transient fault problems in digital systems. It can be helpful in selecting optimum techniques to mask or eliminate transient fault effects in developed systems  相似文献   

7.
Algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. We investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements  相似文献   

8.
Atomic actions are an important dynamic structuring technique that aid the construction of fault-tolerant concurrent systems. Although they were developed some years ago, none of the well-known commercially-available programming languages directly support their use. This paper summarizes software fault tolerance techniques for concurrent systems, evaluates the Ada 95 programming language from the perspective of its support for software fault tolerance, and shows how Ada 95 can be used to implement software fault tolerance techniques. In particular, it shows how packages, protected objects, requeue, exceptions, asynchronous transfer of control, tagged types, and controlled types can be used as building blocks from which to construct atomic actions with forward and backward error recovery, which are resilient to deserter tasks and task abortion  相似文献   

9.
A systematic method of providing software system fault recovery with maximal fault coverage subject to resource constraints of overall recovery cost and additional fault rate is presented. This method is based on a model for software systems which provides a measure of the fault coverage properties of the system in the presence of computer hardware faults. Techniques for system parameter measurements are given. An optimization problem results which is a doubly-constrained 0,1 Knapsack problem. Quantitative results are presented demonstrating the effectiveness of the approach.  相似文献   

10.
Fault injection techniques and tools   总被引:4,自引:0,他引:4  
Fault injection is important to evaluating the dependability of computer systems. Researchers and engineers have created many novel methods to inject faults, which can be implemented in both hardware and software. The contrast between the hardware and software methods lies mainly in the fault injection points they can access, the cost and the level of perturbation. Hardware methods can inject faults into chip pins and internal components, such as combinational circuits and registers that are not software-addressable. On the other hand, software methods are convenient for directly producing changes at the software-state level. Thus, we use hardware methods to evaluate low-level error detection and masking mechanisms, and software methods to test higher level mechanisms. Software methods are less expensive, but they also incur a higher perturbation overhead because they execute software on the target system  相似文献   

11.
We have developed a distributed parallel storage system that employs the aggregate bandwidth of multiple data servers connected by a high-speed wide-area network to achieve scalability and high data throughput. This paper studies different schemes to enhance the reliability and availability of such network-based distributed storage systems. The general approach of this paper employs “erasure” error-correcting codes that can be used to reconstruct missing information caused by hardware, software, or human faults. The paper describes the approach and develops optimized algorithms for the encoding and decoding operations. Moreover, the paper presents techniques for reducing the communication and computation overhead incurred while reconstructing missing data from the redundant information. These techniques include clustering, multidimensional coding, and the full two-dimensional parity schemes. The paper considers trade-offs between redundancy, fault tolerance, and complexity of error recovery  相似文献   

12.
An approach to fault-tolerant execution of real-time application tasks in hypercubes is proposed. The approach is based on the distributed recovery block (DRB) scheme and does not require special hardware mechanisms in support of fault tolerance. Each task is assigned to a pair of processors forming a DRB computing station for execution in a dual-redundant and self-checking mode. Assignment of all tasks in an application in such a form is called the full DRB mapping. The DRB scheme was developed as an approach to uniform treatment of hardware and software faults with the effect of fast forward recovery. However, if the system developer is concerned with hardware fault possibilities only, then forming DRB stations becomes a mechanical process not burdening the application software designer in any way. A procedure for converting an efficient nonredundant task-to-processor mapping into an efficient full DRB mapping is presented  相似文献   

13.
一种中间件服务容错配置管理方法   总被引:1,自引:0,他引:1  
李军国  黄罡  邹键  梅宏 《计算机学报》2007,30(10):1696-1704
提出一种基于运行时刻软件体系结构的容错管理方法,支持开发者和管理员针对不同中间件服务失效定制合适的故障检测和修复机制.首先,运行时刻软件体系结构自动构造构件依赖视图和错误传播①视图,为理解和分析整个系统的可靠性提供全局视图;然后,操作运行时刻软件体系结构配置容错机制;最后利用AOP技术将容错机制插装到中间件中,使其具备指定的容错能力.上述过程在一个可视化工具的辅助下半自动实施,并在J2EE中间件上得到验证.  相似文献   

14.
Failure of a safety critical system can lead to big losses.Very high software reliability is required for automating the working of systems such as aircraft controller and nuclear reactor controller software systems.Fault-tolerant softwares are used to increase the overall reliability of software systems.Fault tolerance is achieved using the fault-tolerant schemes such as fault recovery (recovery block scheme),fault masking (N-version programming (NVP)) or a combination of both (Hybrid scheme).These softwares incorporate the ability of system survival even on a failure.Many researchers in the field of software engineering have done excellent work to study the reliability of fault-tolerant systems.Most of them consider the stable system reliability.Few attempts have been made in reliability modeling to study the reliability growth for an NVP system.Recently,a model was proposed to analyze the reliability growth of an NVP system incorporating the effect of fault removal efficiency.In this model,a proportion of the number of failures is assumed to be a measure of fault generation while an appropriate measure of fault generation should be the proportion of faults removed.In this paper,we first propose a testing efficiency model incorporating the effect of imperfect fault debugging and error generation.Using this model,a software reliability growth model (SRGM) is developed to model the reliability growth of an NVP system.The proposed model is useful for practical applications and can provide the measures of debugging effectiveness and additional workload or skilled professional required.It is very important for a developer to determine the optimal release time of the software to improve its performance in terms of competition and cost.In this paper,we also formulate the optimal software release time problem for a 3VP system under fuzzy environment and discuss a the fuzzy optimization technique for solving the problem with a numerical illustration.  相似文献   

15.
This paper discusses the stability of a feasible pre-run-time schedule for a transient overload introduced by processes re-execution during an error recovery action. It shows that the stability of a schedule strictly tuned to meet hard deadlines is very small, invalidating thus backward error recovery. However, the stability of the schedule always increases when a real-time process is considered as having a nominal and a hard deadline separated by a non-zero grace time. This is true for sets of processes having arbitrary precedence and exclusion constraints, and executed on a single or multiprocessor based architecture. Grace time is not just the key element for the realistic estimation of the timing constraints of real-time error processing techniques. It also allows backward error recovery to be included in very efficient pre-run-time scheduled systems when the conditions stated in this paper are satisfied. This is a very important conclusion, as it shows that fault-tolerant hard real-time systems do not have to be extremely expensive and complex.  相似文献   

16.
容错优先级混合式分配搜索算法   总被引:1,自引:0,他引:1  
在实时系统中,由于任务未能及时产生正确结果将导致灾难性后果,容错对于实时系统的有效性及可靠性至关重要.基于最坏响应时间计算的可调度性分析,提出了一种容错优先级混合式分配搜索算法.这种算法通过允许替代任务既能运行在高优先级别上,又可运行在低优先级别上,有效地提高了系统的容错能力.通过实验测试,与目前所知的同类算法相比,在提高系统容错能力方面更为有效.  相似文献   

17.
Real-time systems (RTS) are those whose correctness depends on satisfying the required functional as well as the required temporal properties. Due to the criticality of such systems, recovery from faults is an essential part of a RTS. In many systems, such as those supporting space applications, single event upsets (SEUs) are the prevalent type of faults; SEUs are transient faults and affect a single task at a time. We present a scheme to guarantee that the execution of real-time tasks can tolerate SEUs and intermittent faults assuming any queue-based scheduling technique. Three algorithms are presented to solve the problem of adding fault tolerance to a queue of real-time tasks by reserving sufficient slack in a schedule so that recovery can be carried out before the task deadline without compromising guarantees given to other tasks. The first algorithm is a dynamic programming optimal solution, the second is a linear-time heuristic for scheduling dynamic tasks, and the third algorithm comprises extensions to address queues with gaps between tasks (gaps are caused by precedence, resource, or timing constraints). We show through simulations that the heuristics closely approximate the optimal algorithm. Finally, we describe the implementation of the modified admission control algorithm, non-preemptive scheduler, and recovery mechanism in the FT-RT-Mach operating system.  相似文献   

18.
Complex real-time system design needs to address dependability requirements, such as safety, reliability, and security. We introduce a modelling and simulation based approach which allows for the analysis and prediction of dependability constraints. Dependability can be improved by making use of fault tolerance techniques. The de-facto example, in the real-time system literature, of a pump control system in a mining environment is used to demonstrate our model-based approach. In particular, the system is modelled using the Discrete EVent system Specification (DEVS) formalism, and then extended to incorporate fault tolerance mechanisms. The modularity of the DEVS formalism facilitates this extension. The simulation demonstrates that the employed fault tolerance techniques are effective. That is, the system performs satisfactorily despite the presence of faults. This approach also makes it possible to make an informed choice between different fault tolerance techniques. Performance metrics are used to measure the reliability and safety of the system, and to evaluate the dependability achieved by the design. In our model-based development process, modelling, simulation and eventual deployment of the system are seamlessly integrated.  相似文献   

19.
针对计算机系统中软件和硬件相互作用而引发的故障分析问题,提出了基于Petri网的软硬件故障模型,用以表达软件故障和硬件故障相互作用的复杂过程,在此基础上给出了软件、硬件和软硬件故障模式的形式化定义。根据软硬件故障模式的特征,基于故障的传播过程提出了软硬件故障识别算法。实例结果表明模型和算法可以准确的分析和识别软硬件故障,从而为计算机系统的可靠性分析提供了新的途径。  相似文献   

20.
在硬件实时操作系统中,系统CPU的使用率是系统性能的一项重要指标,如果任务占据了系统的全部CPU,其它任务将无法继续运行,给系统带来灾难性后果。 通过分析实时操作系统中软件运行的特点,系统设计需要采取一定容错策略,以提高系统可靠性和容错能力。在μC/ OS-Ⅱ实时操作系统下对飞行控制软件中的任务进行实时监测。首先给出在μC/ OS Ⅱ实时操作系统下CPU使用率的计算方法,合理提出CPU的监测周期。其次,给出对CPU使用率异常的故障检测算法,对故障进行故障处置,提高系统的容错能力。最后,通过在MPC5674飞行控制计算机中编写嵌入式飞行控制软件来验证四种对CPU使用率异常的处置方法。仿真结果表明,实时操作系统中CPU的软件容错方法可以有效提高系统可靠性和容错能力。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号