首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The authors address the problem of validating the dependability of fault-tolerant computing systems, in particular, the validation of the fault-tolerance mechanisms. The proposed approach is based on the use of fault injection at the physical level on a hardware/software prototype of the system considered. The place of this approach in a validation-directed design process and with respect to related work on fault injection is clearly identified. The major requirements and problems related to the development and application of a validation methodology based on fault injection are presented and discussed. Emphasis is put on the definition, analysis, and use of the experimental dependability measures that can be obtained. The proposed methodology has been implemented through the realization of a general pin-level fault injection tool (MESSALINE), and its usefulness is demonstrated by the application of MESSALINE to the experimental validation of two systems: a subsystem of a centralized computerized interlocking system for railway control applications and a distributed system corresponding to the current implementation of the dependable communication system of the ESPRIT Delta-4 Project  相似文献   

2.
This paper presents the design and implementation of Jgroup/ARM, a distributed object group platform with autonomous replication management along with a novel measurement‐based assessment technique that is used to validate the fault‐handling capability of Jgroup/ARM. Jgroup extends Java RMI through the group communication paradigm and has been designed specifically for application support in partitionable systems. ARM aims at improving the dependability characteristics of systems through a fault‐treatment mechanism. Hence, ARM focuses on deployment and operational aspects, where the gain in terms of improved dependability is likely to be the greatest. The main objective of ARM is to localize failures and to reconfigure the system according to application‐specific dependability requirements. Combining Jgroup and ARM can significantly reduce the effort necessary for developing, deploying and managing dependable, partition‐aware applications. Jgroup/ARM is evaluated experimentally to validate its fault‐handling capability; the recovery performance of a system deployed in a wide area network is evaluated. In this experiment multiple nearly coincident reachability changes are injected to emulate network partitions separating the service replicas. The results show that Jgroup/ARM is able to recover applications to their initial state in several realistic failure scenarios, including multiple, concurrent network partitionings. Copyright © 2007 John Wiley & Sons, Ltd.  相似文献   

3.
Markov nets: probabilistic models for distributed and concurrent systems   总被引:1,自引:0,他引:1  
For distributed systems, i.e., large complex networked systems, there is a drastic difference between a local view and knowledge of the system, and its global view. Distributed systems have local state and time, but do not possess global state and time in the usual sense. In this paper, motivated by the monitoring of distributed systems and in particular of telecommunications networks, we develop a generalization of Markov chains and hidden Markov models for distributed and concurrent systems. By a concurrent system, we mean a system in which components may evolve independently, with sparse synchronizations. We follow a so-called true concurrency approach, in which neither global state nor global time are available. Instead, we use only local states in combination with a partial order model of time. Our basic mathematical tool is that of Petri net unfoldings.  相似文献   

4.
软件故障优化注入方案研究与分析   总被引:1,自引:0,他引:1  
主要研究了基于空间注入技术的软件故障注入(software-implemented fault injection,SWIFI)实验与分析中存在的问题.提出了并设计了2种基于空间注入技术的注入方式:等待方式与冲击方式,分别设计了2种方式的注入算法,并利用它们分别进行了故障注入实验,通过实验着重分析了注入地址不同的空间分布对实验产生的影响.详细讨论并分析了基于不同空间地址概率分布的故障注入实验问题,根据实验结果得出并证明结论,针对空间注入技术实施的2种注入算法在执行软件故障注入实验时总存在一种相对较优的注入方案.  相似文献   

5.
时间触发协议是TTA架构必需的通信协议,用于在要求高可靠性的分布式容错实时系统中电子模块之间的互连;目前作为时间触发通信系统重要组成部分的时间触发控制器主要是采用处理器来实现协议的处理,协议开销比较大;基于FPGA的时间触发协议控制器的设计,采用了具有较好同步能力的编码方式和合理的帧格式,在建立全局时间基准的基础上优化了协议处理状态机,利用FP-GA的并行处理能力,降低了协议开销,增加了总线的效率,同时也提高了时钟同步精度和容错能力;仿真结果表明,基于FPGA的时间触发协议控制器具有较好的性能.  相似文献   

6.
This study investigated the observer design schemes for interconnected nonlinear systems with actuator faults, sensor faults, external disturbances, and limited measured resources. A novel effective distributed estimation scheme is presented for the interconnected nonlinear system to estimate the states, faults, and lumped disturbances, simultaneously. To save communication resources and to improve information utilization, an adaptive event condition is designed in the sensor channel, and the triggered values are utilized to design the observer. Especially, to handle the sensor fault, the output is separated into two parts, and the estimation is realized with the help of a normal one. In the first part of this study, a class of interconnected nonlinear systems with partial loss of effectiveness of sensor fault is considered, and an event-based distributed estimation scheme is established. In the second part, a class of more universal feedback interconnected nonlinear with both partial loss sensor fault and bias sensor fault is investigated. An augment system is formulated by an augmented vector composed of state and sensor faults. And then the estimation scheme is realized by utilizing the presented event-based distributed observer. The convergence abilities of both the two conditions are proved and, finally, the estimation performances of the presented observer are verified by a numerical simulation system and an inverted pendulum system.  相似文献   

7.
A dependability model for TMR system   总被引:1,自引:0,他引:1  
Much research has been done on the dependability evaluation of computer systems. However, much of this is gone no further than study of the fault coverage of such systems, with little focus on the relationship between fault coverage and overall system dependability. In this paper, a Markovian dependability model for triple-modular-redundancy (TMR) system is presented. Having fully considered the effects of fault coverage, working time, and constant failure rate of single module on the dependability of the target TMR system, the model is built based on the stepwise degradation strategy. Through the model, the relationship between the fault coverage and the dependability of the system is determined. What is more, the dependability of the system can be dynamically and precisely predicted at any given time with the fault coverage set. This will be of much benefit for the dependability evaluation and improvement, and be helpful for the system design and maintenance.  相似文献   

8.
FTT—1:一个基于硬件的故障注入器的设计与实现   总被引:3,自引:0,他引:3  
故障注入是评价计算机系统可信性的一种重要的试验方法。构造故障注入器是故障注入研究中的一人重要组成部分。此文介绍了FTT-1,一个基于硬件的故障注入器的设计与实现。文章讨论了设计与实现硬件故障注入器的关键技术,并介绍了在FTT-1的实现中解决这些关键技术的方法。试验结果证明了FTT-1用于评价容错计算机系统可信性的有效性。  相似文献   

9.
覆盖率是容错系统设计和评估中的重要概念。该文介绍了采用行为分解技术建立和求解覆盖率模型的方法,总结了用故障注入技术估计覆盖率需要考虑的几个问题。以一个双机容错系统的评估为例,说明了这些方法在容错系统可靠性评估中的应用。  相似文献   

10.
软件DSM(distributed shared memory)系统在机群上构造了共享存储编程环境,结合了共享存储的易编程性和机群的可扩展性,引起了广泛的研究.由于软件DSM系统是一个分布式系统,系统失败风险大,需要实现容错技术以促进其实用化.利用用户级检查点技术,在支持域存储一致模型的软件DSM系统JIAJIA的基础上,设计并实现了一个可恢复的高可移植的软件DSM系统JIACKPT(JIAjia with ChecKPoinTing).由于采用适合软件DSM系统的强全局一致状态以及多种优化措施,JIACKPT易于实现且获得很好的性能.在一个8节点的PC机群上的应用测试表明,即使每分钟做一次检查点,大部分应用的检查点开销也小于10%.此外,JIACKPT还具有高可移植性.这些都表明JIACKPT已经成为一个比较实用的系统.  相似文献   

11.
运用马尔可夫过程分析了容侵系统的可信性,结合SITAR容侵系统体系中的状态迁移模型,给出了一种基于随机过程的容侵系统可信性的可用度量化方法。最后在此基础上讨论了入侵容忍系统的性质。  相似文献   

12.
The growth in coordinated network attacks such as scans, worms and distributed denial-of-service (DDoS) attacks is a profound threat to the security of the Internet. Collaborative intrusion detection systems (CIDSs) have the potential to detect these attacks, by enabling all the participating intrusion detection systems (IDSs) to share suspicious intelligence with each other to form a global view of the current security threats. Current correlation algorithms in CIDSs are either too simple to capture the important characteristics of attacks, or too computationally expensive to detect attacks in a timely manner. We propose a decentralized, multi-dimensional alert correlation algorithm for CIDSs to address these challenges. A multi-dimensional alert clustering algorithm is used to extract the significant intrusion patterns from raw intrusion alerts. A two-stage correlation algorithm is used, which first clusters alerts locally at each IDS, before reporting significant alert patterns to a global correlation stage. We introduce a probabilistic approach to decide when a pattern at the local stage is sufficiently significant to warrant correlation at the global stage. We then implement the proposed two-stage correlation algorithm in a fully distributed CIDS. Our experiments on a large real-world intrusion data set show that our approach can achieve a significant reduction in the number of alert messages generated by the local correlation stage with negligible false negatives compared to a centralized scheme. The proposed probabilistic threshold approach gains a significant improvement in detection accuracy in a stealthy attack scenario, compared to a naive scheme that uses the same threshold at the local and global stages. A large scale experiment on PlanetLab shows that our decentralized architecture is significantly more efficient than a centralized approach in terms of the time required to correlate alerts.  相似文献   

13.
Scheduling is a key component for performance guarantees in the case of distributed applications running in large scale heterogeneous environments. Another function of the scheduler in such system is the implementation of resilience mechanisms to cope with possible faults. In this case resilience is best approached using dedicated rescheduling mechanisms. The performance of rescheduling is very important in the context of large scale distributed systems and dynamic behavior. The paper proposes a generic rescheduling algorithm. The algorithm can use a wide variety of scheduling heuristics that can be selected by users in advance, depending on the system’s structure. The rescheduling component is designed as a middleware service that aims to increase the dependability of large scale distributed systems. The system was evaluated in a real-world implementation for a Grid system. The proposed approach supports fault tolerance and offers an improved mechanism for resource management. The evaluation of the proposed rescheduling algorithm was performed using modeling and simulation. We present experimental results confirming the performance and capabilities of the proposed rescheduling algorithm.  相似文献   

14.
分布式系统的可信性研究   总被引:2,自引:0,他引:2       下载免费PDF全文
本文首先介绍了构造高可信性计算机系统的技术与方法,然后对分布系统的可信性研究现状与存在的问题进行了综述。  相似文献   

15.
在分布式环境下,大多数的入侵检测系统缺少针对本身部件的安全措施。在本文中,我们提出了一种基于移动代理的分布式抗攻击的入侵检测系统的模型,并分析了其关键的技术。该模型对建立一个安全的入侵检测系统有一定的指导意义。  相似文献   

16.
基于PROFIBUS的双边剪控制系统的可靠性设计   总被引:1,自引:0,他引:1  
针对一个兼有高度实时性和多机械联动的双边剪机组基于PROFIBUS的PLC控制系统,分析了造成系统不可靠的因素,提出了一整套故障预防和故障容错方案来提高系统的可靠性;故障预防主要靠硬件结构设计实现,而故障容错主要根据PROFIBUS--DP固有的特性和CPU强大的诊断能力通过软件实现;主要方法有:抗干扰设计、冗余设计、容错的协议机制、失效安全技术和程序测试;系统的稳定运行表明了该系统可靠性设计的有效性。  相似文献   

17.
一种关键任务系统自律可信性模型与量化分析   总被引:1,自引:0,他引:1  
将现有入侵容忍、自毁技术与自律计算相结合,提出了一种基于SM-PEPA(semi-Markov performance evaluation process algebra)的关键任务系统自律可信性模型以支持形式化分析和推理.该模型具有一定程度的自管理能力,采用分级处理的方式应对各种程度的可信性威胁,满足了关键任务系统对可信性的特殊需求.在此基础上,从稳态概率角度提出了一种自律可信性度量方法.最后,结合具体实例对模型参数对自律可信性的影响进行了初步分析.实验结果表明,增大关键任务系统可信性威胁检测率和自恢复成功率,可在较大范围内提高系统的自律可信 特性.  相似文献   

18.
Fault‐tolerant control problems have been extensively studied in all kinds of control systems. However, there is little work on fault‐tolerant control for distributed parameter systems. In this paper, a novel adaptive fault‐tolerant boundary control scheme is proposed for a nonlinear flexible aircraft wing system against actuator faults. The whole system is regarded as a distributed parameter system, and the dynamic model of the flexible wing system is described by a set of partial differential equations (PDEs) and ordinary differential equations (ODEs). The proposed controller is designed by using the Lyapunov's direct method and adaptive control strategies. Based on the online estimation of actuator faults, the adaptive controller parameters can update automatically to compensate the actuator faults of the system. Besides, a fault‐tolerant controller is also developed for this system in the presence of external disturbances. Differing from existing works about adaptive fault‐tolerant control, the adaptive controller presented in this paper is designed for a distributed parameter system. Finally, numerical simulations are carried out to illustrate the effectiveness of the proposed control scheme.  相似文献   

19.
Advanced automotive control applications such as steer-by-wire are typically implemented as distributed systems comprising many embedded processors, sensors, and actuators interacting via a communication bus. They have severe cost constraints, but demand a high level of safety and performance. Motivated by the need for timely diagnosis of faulty actuators in such systems, we present a method to achieve distributed failure diagnosis under deadline and resource constraints. Actuators are diagnosed in distributed fashion by processors to provide a global view of their fault status. The integration of software-based tests for actuator diagnosis within the overall control application is studied. These tests are implemented using analytical redundancy and execute concurrently with the control tasks. The test scheduling problem is then formulated and solved to guarantee actuator diagnosis within designer-specified deadlines while meeting control performance goals. As a secondary objective, the scheduling algorithm also reduces the number of processors required for diagnosis. We demonstrate the practicality of the proposed diagnosis approach by applying it to a steer-by-wire example to identify failed actuators in timely fashion.  相似文献   

20.
故障注入技术在BIT软件测试中是一种有效的手段。针对电路板级BIT软件测试中遇到的问题,介绍了一种基于开源模拟器QEMU实现的处理器类故障模拟方法。采用该方法对多种处理器故障进行仿真建模,通过对QEMU的扩展开发,加入故障行为模拟模块和故障注入模块,以实现一个具有处理器类故障注入功能的系统级模拟器BitVaSim。首先分析处理器功能故障模式,提取故障的关键字值对,用XML Schema定义故障并用于故障建模;其次对QEMU代码进行二次开发以实现对处理器故障行为的模拟;然后通过配置故障注入接口实现模拟器运行时的故障模式匹配、故障按条件触发等功能;最后通过实验案例来观察模拟器的故障表现,评价这种基于模拟器的故障注入技术。实验过程和结果显示这种方法是有效可行的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号