首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
As software and software intensive systems are becoming increasingly ubiquitous, the impact of failures can be tremendous. In some industries such as aerospace, medical devices, or automotive, such failures can cost lives or endanger mission success. Software faults can arise due to the interaction between the software, the hardware, and the operating environment. Unanticipated environmental changes lead to software anomalies that may have significant impact on the overall success of the mission. Latent coding errors can at any time during system operation trigger faults despite the fact that usually a significant effort has been expended in verification and validation (V&V) of the software system. Nevertheless, it is becoming increasingly more apparent that pre-deployment V&V is not enough to guarantee that a complex software system meets all safety, security, and reliability requirements. Software Health Management (SWHM) is a new field that is concerned with the development of tools and technologies to enable automated detection, diagnosis, prediction, and mitigation of adverse events due to software anomalies, while the system is in operation. The prognostic capability of the SWHM to detect and diagnose failures before they happen will yield safer and more dependable systems for the future. This paper addresses the motivation, needs, and requirements of software health management as a new discipline and motivates the need for SWHM in safety critical applications.  相似文献   

2.
Software health management (SWHM) techniques complement the rigorous verification and validation processes that are applied to safety-critical systems prior to their deployment. These techniques are used to monitor deployed software in its execution environment, serving as the last line of defense against the effects of a critical fault. SWHM monitors use information from the specification and implementation of the monitored software to detect violations, predict possible failures, and help the system recover from faults. Changes to the monitored software, such as adding new functionality or fixing defects, therefore, have the potential to impact the correctness of both the monitored software and the SWHM monitor. In this work, we describe how the results of a software change impact analysis technique, Directed Incremental Symbolic Execution (DiSE ), can be applied to monitored software to identify the potential impact of the changes on the SWHM monitor software. The results of DiSE can then be used by other analysis techniques, e.g., testing, debugging, to help preserve and improve the integrity of the SWHM monitor as the monitored software evolves.  相似文献   

3.
Electrical wiring interconnection system (EWIS) of civil aircraft has been paid more attention in recent years, and intermittent failure detection of electrical connectors in EWIS is a challenging problem. This paper presents a sliding mode observer (SMO) approach for the intermittent failure detection of an aircraft electrical system with multiple connector failures. The mathematical model of the aircraft electrical system which contains multiple connector failures is established for transforming the intermittent failure detection problem into observer-based multiplicative faults isolation and estimation problems. A set of adaptive sliding mode observers are designed to locate the failure connectors preliminarily, the observers can adapt the unknown upper bound of the faults. Furthermore, a fault-reconstruction scheme applying the equivalent output error injection principle is proposed for fault estimation, where the characteristic parameters of connecters are reconstructed to identify the failures. Finally, a numerical example is provided to show the effectiveness of the proposed method.  相似文献   

4.
This paper deals with evaluation of the dependability (considered as a generic term, whose main measures are reliability, availability, and maintainability) of software systems during their operational life, in contrast to most of the work performed up to now, devoted mainly to development and validation phases. The failure process due to design faults, and the behavior of a software system up to the first failure and during its life cycle are successively examined. An approximate model is derived which enables one to account for the failures due to the design faults in a simple way when evaluating a system's dependability. This model is then used for evaluating the dependability of 1) a software system tolerating design faults, and 2) a computing system with respect to physical and design faults.  相似文献   

5.
Failure of a safety critical system can lead to big losses.Very high software reliability is required for automating the working of systems such as aircraft controller and nuclear reactor controller software systems.Fault-tolerant softwares are used to increase the overall reliability of software systems.Fault tolerance is achieved using the fault-tolerant schemes such as fault recovery (recovery block scheme),fault masking (N-version programming (NVP)) or a combination of both (Hybrid scheme).These softwares incorporate the ability of system survival even on a failure.Many researchers in the field of software engineering have done excellent work to study the reliability of fault-tolerant systems.Most of them consider the stable system reliability.Few attempts have been made in reliability modeling to study the reliability growth for an NVP system.Recently,a model was proposed to analyze the reliability growth of an NVP system incorporating the effect of fault removal efficiency.In this model,a proportion of the number of failures is assumed to be a measure of fault generation while an appropriate measure of fault generation should be the proportion of faults removed.In this paper,we first propose a testing efficiency model incorporating the effect of imperfect fault debugging and error generation.Using this model,a software reliability growth model (SRGM) is developed to model the reliability growth of an NVP system.The proposed model is useful for practical applications and can provide the measures of debugging effectiveness and additional workload or skilled professional required.It is very important for a developer to determine the optimal release time of the software to improve its performance in terms of competition and cost.In this paper,we also formulate the optimal software release time problem for a 3VP system under fuzzy environment and discuss a the fuzzy optimization technique for solving the problem with a numerical illustration.  相似文献   

6.
This paper introduces a novel verification framework for Prognostics and Health Management (PHM) systems. Critical aircraft, spacecraft and industrial systems are required to perform robustly, reliably and safely. They must integrate hardware and software tools intended to detect and identify incipient failures and predict the remaining useful life (RUL) of failing components. Furthermore, it is desirable that non-catastrophic faults be accommodated, that is fault tolerant or contingency management algorithms be developed that will safeguard the operational integrity of such assets for the duration of the emergency. It is imperative, therefore, that models and algorithms designed to achieve these objectives be verified before they are validated and implemented on-board an aircraft. This paper develops a verification approach that builds upon concepts from system analysis, specification definition, system modeling, and Monte Carlo simulations. The approach is implemented in a hierarchical structure at various levels from component to system safety. Salient features of the proposed methodology are illustrated through its application to a spacecraft propulsion system.  相似文献   

7.
8.
Based on extensive field failure data for Tandem's GUARDIAN operating system, the paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate  相似文献   

9.
The benefits of the analysis of software faults and failures have been widely recognized. However, detailed studies based on empirical data are rare. In this paper, we analyze the fault and failure data from two large, real-world case studies. Specifically, we explore: 1) the localization of faults that lead to individual software failures and 2) the distribution of different types of software faults. Our results show that individual failures are often caused by multiple faults spread throughout the system. This observation is important since it does not support several heuristics and assumptions used in the past. In addition, it clearly indicates that finding and fixing faults that lead to such software failures in large, complex systems are often difficult and challenging tasks despite the advances in software development. Our results also show that requirement faults, coding faults, and data problems are the three most common types of software faults. Furthermore, these results show that contrary to the popular belief, a significant percentage of failures are linked to late life cycle activities. Another important aspect of our work is that we conduct intra- and interproject comparisons, as well as comparisons with the findings from related studies. The consistency of several main trends across software systems in this paper and several related research efforts suggests that these trends are likely to be intrinsic characteristics of software faults and failures rather than project specific.  相似文献   

10.
Fault diagnosis and prediction for complex control systems rely either on the collection of rich data for training neural networks or on the system models and prior knowledge of faults. These methods are difficult to apply directly in complex integrated systems due to the large uncertainties in practical scenarios. A new fault diagnosis and prediction technique that is based on extended state observer (ESO) and a hidden Markov model (HMM) for control systems is proposed in this paper. Real-time and predictive information that is obtained by ESO of the active disturbance rejection control (ADRC) is utilized to improve the HMM method for the fault prediction of control systems with large uncertain disturbances. The proposed approach realizes a high recognition rate with a small demand for data, and the dependence on the system model is weak without prior knowledge of faults. Fault prediction of the control system output can be realized without additional sensors. The proposed solution is evaluated in simulations of an asynchronous servo motor control system against the traditional control method and the ADRC control. The results indicate that the proposed method performs well in fault prediction and outperforms the traditional method in terms of control when disturbances and failures occur.  相似文献   

11.
Complex engineering systems, such as aircraft, industrial processes, and transportation systems, are experiencing a paradigm shift in the way they are operated and maintained. Instead of traditional scheduled or breakdown maintenance practices, they are maintained on the basis of their current state/condition. Condition-Based Maintenance (CBM) is becoming the preferred practice since it improves significantly the reliability, safety and availability of these critical systems. CBM enabling technologies include sensing and monitoring, information processing, fault diagnosis and failure prognosis algorithms that are capable of detecting accurately and in a timely manner incipient failures and predicting the remaining useful life of failing components. If such technologies are to be implemented on-line and in real-time, it is essential that an integrating system architecture be developed that possesses features of modularity, flexibility and interoperability while exhibiting attributes of computational efficiency for both on-line and off-line applications. This paper presents a .NET framework as the integrating software platform linking all constituent modules of the fault diagnosis and failure prognosis architecture. The inherent characteristics of the .NET framework provide the proposed system with a generic architecture for fault diagnosis and failure prognosis for a variety of applications. Functioning as data processing, feature extraction, fault diagnosis and failure prognosis, the corresponding modules in the system are built as .NET components that are developed separately and independently in any of the .NET languages. With the use of Bayesian estimation theory, a generic particle-filtering-based framework is integrated in the system for fault diagnosis and failure prognosis. The system is tested in two different applications—bearing spalling fault diagnosis and failure prognosis and brushless DC motor turn-to-turn winding fault diagnosis. The results suggest that the system is capable of meeting performance requirements specified by both the developer and the user for a variety of engineering systems.  相似文献   

12.
软件系统存在故障,无法避免,如何使软件系统在自身发生错误或者受到外界干扰时仍然能够提供连续的无差错的服务,是一个亚待解决的理论问题。大型软件的静态和动态函数调用网络具有小世界效应和无标度特性,基于藕合映像格子的相继故障模型,分析软件系统相继故障的形成机理与传播行为,以提高基于关键节点的软件测试的可信性.  相似文献   

13.
With the increasing size and complexity of software in embedded systems, software has now become a primary threat for the reliability. Several mature conventional reliability engineering techniques exist in literature but traditionally these have primarily addressed failures in hardware components and usually assume the availability of a running system. Software architecture analysis methods aim to analyze the quality of software-intensive system early at the software architecture design level and before a system is implemented. We propose a Software Architecture Reliability Analysis Approach (SARAH) that benefits from mature reliability engineering techniques and scenario-based software architecture analysis to provide an early software reliability analysis at the architecture design level. SARAH defines the notion of failure scenario model that is based on the Failure Modes and Effects Analysis method (FMEA) in the reliability engineering domain. The failure scenario model is applied to represent so-called failure scenarios that are utilized to derive fault tree sets (FTS). Fault tree sets are utilized to provide a severity analysis for the overall software architecture and the individual architectural elements. Despite conventional reliability analysis techniques which prioritize failures based on criteria such as safety concerns, in SARAH failure scenarios are prioritized based on severity from the end-user perspective. SARAH results in a failure analysis report that can be utilized to identify architectural tactics for improving the reliability of the software architecture. The approach is illustrated using an industrial case for analyzing reliability of the software architecture of the next release of a Digital TV.  相似文献   

14.
Demands on software reliability and availability have increased tremendously due to the nature of present day applications. We focus on the aspect of software for the high availability of application servers since the unavailability of servers more often originates from software faults rather than hardware faults. The software rejuvenation technique has been widely used to avoid the occurrence of unplanned failures, mainly due to the phenomena of software aging or caused by transient failures. In this paper, first we present a new way of using the virtual machine based software rejuvenation named VMSR to offer high availability for application server systems. Second we model a single physical server which is used to host multiple virtual machines (VMs) with the VMSR framework using stochastic modeling and evaluate it through both numerical analysis and SHARPE (Symbolic Hierarchical Automated Reliability and Performance Evaluator) tool simulation. This VMSR model is very general and can capture application server characteristics, failure behavior, and performability measures. Our results demonstrate that VMSR approach is a practical way to ensure uninterrupted availability and to optimize performance for aging applications. This research was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) under Grant No. KRF2007-210-D00006.  相似文献   

15.
基于直接自适应控制的重构飞控系统研究   总被引:2,自引:0,他引:2       下载免费PDF全文
通常,飞控系统重构设计需知系统的故障信息,而使用直接自适应控制技术可在不知道系统故障信息的情况下,对飞控系统操纵面损伤进行重构,并且可使故障飞机很好地跟踪参考模型的输出.采用优化算法设计反馈补偿器以保证故障系统的严格正实性,并利用Lyapunov函数证明重构系统的渐近稳定性.将该方法用于某型飞机侧向控制系统的设计,仿真结果表明,在操纵面严重受损的情况下,飞机仍能保持良好的性能.  相似文献   

16.
许多关键业务应用要求系统能够持续稳定地运行以便为用户提供不问断的服务。高可用系统通过配置冗余硬件和故障监控处理软件,可以大幅度地提高系统的可用性。本文设计实现了一个基于Motorola CPX8216和HA-Linux的电信高可用应用软件。实验证明该软件能够自动检测系统控制节点和服务进程的错误、失效,并且当发生这种情况时能够自动将系统控制和服务切换到备用节点。文中最后指出了需要进一步研究的问题。  相似文献   

17.
A direct adaptive approach is developed for control of a class of multi-input multi-output (MIMO) nonlinear systems in the presence of uncertain failures of redundant actuators. An adaptive failure compensation controller is designed which is capable of accommodating uncertainties in actuator failure time instants, values and patterns. A realistic situation is studied with fixed grouping of actuators and proportional actuation within actuator groups. The adaptive control system is analyzed, to show its desired stability and asymptotic tracking properties in the presence of actuator failure uncertainties. As an application, such an adaptive controller is used for actuator failure compensation of a twin otter aircraft longitudinal model, with design conditions verified and control structure and adaptive laws developed for a nonlinear aircraft dynamic model. The effectiveness of adaptive failure compensation is demonstrated by simulation results.  相似文献   

18.
From P2P to reliable semantic P2P systems   总被引:1,自引:0,他引:1  
Current research to harness the power of P2P networks involves building reliable Semantic Peer-to-Peer (SP2P) systems. SP2P systems combine two complementary technologies: P2P networking and ontologies. There are several types of SP2P systems with applications to knowledge management systems, databases, the Semantic Web, emergent semantics, web services, and information systems. Correct semantic mapping is fundamental for success of SP2P systems where semantic mapping refers to semantic relationship between concepts from different ontologies. Current research on SP2P systems has emphasized semantics at the cost of dealing with the traditional issues of P2P networks of reliability and scalability. As a result of their lack of resilience to temporary mapping faults, SP2P systems can suffer from disconnection failures. Disconnection failures arise when SP2P systems that use adaptive query routing methods treat temporary mapping faults as permanent mapping faults. This paper identifies the disconnection failure problem due to temporary semantic mapping faults and proposes an algorithm to resolve it. To identify the problem, we will use a simulation model of SP2P systems. The Fault-Tolerant Adaptive Query Routing (FTAQR) algorithm proposed to resolve the problem is an adaptation of the generous tit-for-tat method originally developed in evolutionary game theory. The paper demonstrates that the reliability of an SP2P system increases by using the algorithm.  相似文献   

19.
Faults modelling is essential to anticipate failures in critical systems. Traditionally, Static Fault Trees are employed to this end, but Temporal and Dynamic Fault Trees are gaining evidence due to their enriched power to model and detect intricate propagation of faults that lead to a failure. In previous work, we showed a strategy based on the process algebra CSP and Simulink models to obtain fault traces that lead to a failure. Although that work used Static Fault Trees, it could be used with Temporal or Dynamic Fault Trees. In the present work we define an algebra of temporal faults (with a notion of fault propagation) and prove that it is indeed a Boolean algebra. This allows us to inherit Boolean algebra’s properties, laws and existing reduction techniques, which are very beneficial for faults modelling and analysis. We illustrate our work on a simple but real case study supplied by our industrial partner EMBRAER.  相似文献   

20.
The increased complexity and scale of high performance computing and future extreme-scale systems have made resilience a key issue, since it is expected that future systems will have various faults during critical operations. It is also expected that current solutions for resiliency, mainly counting on checkpointing in hardware and applications, will become infeasible because of unacceptable recovery time for checkpointing and restarting. In this paper, we present innovative concepts for anomaly detection and identification, analyzing the duration of pattern transition sequences of an execution window. We use a three-dimensional array of features to capture spatial and temporal variability to be used by an anomaly analysis system to immediately generate an alert and identify the source of faults when an abnormal behavior pattern is captured, indicating some kind of software or hardware failure. The main contributions of this paper include the innovative analysis methodology and feature selection to detect and identify anomalous behavior. Evaluating the effectiveness of this approach to detect faults injected asynchronously shows a detection rate of above 99.9% with no occurrences of false alarms for a wide range of scenarios, and accuracy rate of 100% with short root cause analysis time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号