首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 537 毫秒
1.
传统的自愈系统能够自主发现、诊断及排除错误,但是对用户存在影响并丢失组件间的状态一致性,不能够满足高可靠性和高可用性的要求.微重启是面向恢复计算ROC的重要软件恢复机制,以递归方式重启故障组件子集,并结合适毁性组件维护状态一致.针对具有自愈功能的Minix3操作系统,本文提出了基于微重启的自愈体系架构,并以适毁性驱动组件的实验证明,组件重启后可以继续执行未完成的任务,并且重启时间略有增长.  相似文献   

2.
高可用性是任务关键网络系统的主要特征和根本需求。在分析系统重启层次及微重启技术的基础上,给出了一种基于概率的任务关键网络系统最佳恢复策略判定方法。该方法根据系统不同层次对象的失效概率,递归计算出各种策略所需要的重启恢复时间,然后将重启时间最短的策略作为最佳恢复策略。实例研究结果表明,该方法可尽可能地降低系统的重启恢复时间,提高系统的可用性。  相似文献   

3.
任务关键系统要求高可用性,因此当系统发生故障时需要快速恢复。由于任务关键系统中的故障多与软件相关,为满足恢复时间短、对系统副作用小等任务关键系统的恢复需求,提出一种分阶处理的软件自恢复方法。在预处理过程中,采用所提出的微重启技术自动修改二进制文件使其在出现异常时重启恢复成为一种自律行为;在系统运行时,采用所提出的基于MD5监测算法及热插拔技术的软件自修复方法对系统实施监测及自动修复。该方法可恢复系统内部失效如响应超时、资源泄漏等问题;对系统受到外部攻击的情况,如病毒修改二进制文件,非授权用户非法篡改系统文件等,该方法也可有效动态恢复,同时可以兼容系统动态升级需求。  相似文献   

4.
卫星被越来越广泛地应用到军事侦察、资源勘探、气象预报、电视转播、通讯等重要领域,所以卫星地面站系统比普通的软件系统对安全性、可靠性、抗毁性、错误检测和恢复等有更高的要求。卫星系统高度的实时性要求卫星地面站软件系统是不允许失效的。但是软件系统存在软件老化现象,导致软件失效,为了对抗软件老化,采用了一种针对卫星地面站软件系统抗衰的微重启策略,可缩短应用系统的平均恢复时间(MTTR),提高系统的可靠性,可以为卫星测轨和定轨提供精确的原始卫星观测数据,对改善卫星测轨和定轨的精度具有重要的现实意义和应用价值。  相似文献   

5.
为导览机器人实现服务目标的引导规划,提出一种以机器人操作系统为实验平台,结合马尔科夫决策过程和微重启技术的任务规划方法。该方法在全面考虑服务对象身份需求信息及服务过程的总代价后,利用马尔科夫决策模型确立最优执行方案。采用基于机器人分布式操作系统建立的微重启自我修复机制解决功能失效问题。仿真结果验证了该规划方案在执行导览任务中的有效性,同时表明微重启技术在处理功能失效问题时相对于传统处理方法具有优越性,其在随机附加障碍的情况下可获得91.03%的规划成功率。  相似文献   

6.
基于组件的嵌套软件抗衰策略及建模   总被引:1,自引:0,他引:1       下载免费PDF全文
将软件抗衰粒度细化到组件级,执行嵌套的抗衰重启策略,可以降低抗衰成本,提高软件可靠性。该文根据软件系统中组件间控制、调用及数据访问的关系,确定了寻找直接耦合组件的途径和系统组件重启树生成方法,构建嵌套的组件级抗衰重启策略,并为系统细粒度软件抗衰提供支持。  相似文献   

7.
面向恢复的集群计算技术   总被引:1,自引:0,他引:1       下载免费PDF全文
针对面向恢复计算(ROC)技术致力于在故障发生后使系统尽快恢复,从而提高系统可用性,而非从根本上避免故障发生的特点,对面向恢复的相关技术进行研究,给出ROC技术在集群系统中的应用,提出基于节点组的递归重启方法和基于Checkpoint的Undo恢复模型,用以提高集群系统的可用性,并对2种方法的改善效果进行评估。  相似文献   

8.
针对软件系统中进程间控制、调用及数据访问的关系,分析了进程间的耦合程度,给出了判定进程间重启相关度方法和系统重启树的构建规则,并结合DNA计算的原理和特性,给出了判定进程间重启相关度DNA计算模型,并初步制定了重启实施策略,为实现智能化细粒度软件抗衰提供支持.  相似文献   

9.
面向服务的体系结构(SOA)将业务逻辑和具体地实现技术分离开来,利于复用和集成.SOA服务化已成为信息系统的主要发展趋势.在SOA服务特点及比较现有服务恢复方法的基础上,结合微重启(Microreboot)技术提出了一种新的基于微重启的SOA服务恢复方法,并结合项目支撑的实验平台进行验证,实验表明了该方法的实用性和高效...  相似文献   

10.
在嵌入式开发的过程中,程序失效问题是一个非常重要而棘手的问题.一般采用方法是在程序正常运行必然要经过的地方安插WatchDog复位指令,当遇到程序死锁、跑飞等程序失效问题时,通过WatchDog发送系统重启信号,使程序恢复正常.  相似文献   

11.
A planning process formulates action assignments for various agents to accomplish a goal statement. In a real situation, unexpected environmental changes (called failures) may invalidate the preformulated plan. When a failure occurs, effective and efficient handling procedures must be taken to prevent irreversible damages. A failure-handling mechanism is a key component in a fault-tolerant system, which makes autonomous operation possible. There are two basic approaches to failure handling—replanning and recovery. In the replanning approach, the currently failure-encountered state is treated as a new initial state, and a brand-new plan is derived from scratch. On the other hand, the recovery approach preserves the applicable components of the original plan and makes necessary adjustments to the preserved plan components to fit the new state. This article presents a method of achieving recovery and compares its performance with replanning. In general, the recovery approach provides a better response time, and the replanning approach sometimes provides a better plan quality.  相似文献   

12.
张建华  张文博  徐继伟  魏峻  钟华  黄涛 《软件学报》2014,25(11):2702-2714
随着虚拟化技术的发展与普及,越来越多的企业将关键业务系统部署到了虚拟化平台上。虚拟化技术降低了企业的硬件和管理成本,但同时也给系统的可靠性带来了严峻挑战。传统的方法通过运行时系统状态备份的方法来提高系统的失效恢复能力,但该方法会引入了巨大的系统开销。提出了一种基于隐马尔可夫模型的系统失效恢复性能优化方法。通过对系统运行时状态的预测分析,计算系统未来运行状态的概率趋势,并在运行过程中动态调整系统失效恢复功能与正常业务功能之间的资源分配,从而降低了系统的运行时性能开销,提高了业务系统服务能力。实验分析显示,该方法可以在保障系统可靠性的同时有效地降低系统的性能开销,在系统运行状态稳定的情况下,最高可以降低2/3的系统响应时间。  相似文献   

13.
A Web service-based system never fulfills a user’s goal unless a failure recovery approach exists. It is inevitable that several Web services may either perish or fail before or during transactions. The completion of a composite process relies on the smooth execution of all constituent Web services. A mediator acts as an intermediary between providers and consumers to monitor the execution of these services. If a service fails, the mediator has to recover the whole composite process or else jeopardize achieving the intended goals. The atomic replacement of a perished Web service usually does not apply because the process of locating a matched Web service is unreliable. Even the system cannot depend on the replacement of the dead service with a composite service. In this paper, we propose an automatic renovation plan for failure recovery of composite semantic services based on an approach of subdigraph replacement. A replacement subdigraph is posed in lieu of an original subdigraph, which includes the failed service. The replacement is done in two separate phases, offline and online, to make the recovery faster. The offline phase foresees all possible subdigraphs, pre-calculates them, and ranks several possible replacements. The online phase compensates the unwanted effects and executes the replacement subdigraph in lieu of the original subdigraph. We have evaluated our approach during an experiment and have found that we could recover more than half of the simulated failures. These achievements show a significant improvement compared to current approaches.  相似文献   

14.
针对当前部队通信装备排故的现状,设计了一种基于.NET的通信装备故障排除电子手册.系统使用.NET环境、Access数据库以及ADO.NET核心组件,实现了用户管理、故障信息管理以及数据维护等功能.首先分析了系统的总体架构和系统功能,其次对系统关键技术进行了详细阐述,最后结合日常应用说明了系统的可行性.  相似文献   

15.
本文用奇偶向量讨论了捷联惯导的可靠性,建立了奇偶方程,进而利用假设检验求取故障检测的决策函数和隔离函数.研究了用卡尔曼滤波方法估计动态测量等项误差,用误差估计补偿奇偶向量,从而实现用常值门限进行故障检测和故障隔离.对六个单自由度陀螺组成的捷联余度系统的数字仿真表明该方法是有效的.  相似文献   

16.
As the mean-time-between-failures (MTBF) continues to decline with the increasing number of components on large-scale high performance computing (HPC) systems, program failures might occur during the execution period with high probability. Ensuring successful execution of the HPC programs has become an issue that the unprivileged users should be concerned. From the user perspective, if the program failure cannot be detected and handled in time, it would waste resources and delay the progress of program execution. Unfortunately, the unprivileged users are unable to perform program state checking due to execution control by the job management system as well as the limited privilege. Currently, automated tools for supporting user-level failure detection and autorecovery of parallel programs in HPC systems are missing. This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs. The state checker in our method is encapsulated as an independent job to reduce interference with the user jobs. In addition, we propose a dual-checker mechanism to improve the robustness of our approach.We implement the proposed method as a tool named automatic re-launcher (ARL) and evaluate it on the Tianhe-2 system. Experiment results show that ARL can detect the execution failures effectively on Tianhe-2 system. In addition, the communication and performance overhead caused by ARL is negligible. The good scalability of ARL makes it applicable for large-scale HPC systems.  相似文献   

17.
嵌入式实时系统越来越多地应用于交通、航空、核能等安全关键环境。尽管系统设计可能没有任何缺陷,但由于物理组件的磨损或环境的突变而导致的随机故障在运行时仍可能导致系统发生危险。目前基于失效传播模型的危害分析方法要么仅考虑失效传播时间,要么仅考虑失效概率,缺少综合分析失效传播时间及失效概率对危害分析的影响。时间失效传播图TFPGs模型用于建模安全关键系统设计阶段中失效传播过程,该模型包含失效传播时延建模。考虑到失效传播路径的不确定对危害发生的概率影响,提出了一种危害分析方法,用概率 时间失效传播图P-TFPGs模型建模失效传播过程,并基于该模型设计了一种分析 危害发生时间与发生概率之间关系的方法;最后,给出了一个案例来说明方法的可行性。  相似文献   

18.
Optimal operation and maintenance of engineering systems heavily rely on the accurate prediction of their failures. Most engineering systems, especially mechanical systems, are susceptible to failure interactions. These failure interactions can be estimated for repairable engineering systems when determining optimal maintenance strategies for these systems. An extended Split System Approach is developed in this paper. The technique is based on the Split System Approach and a model for interactive failures. The approach was applied to simulated data. The results indicate that failure interactions will increase the hazard of newly repaired components. The intervals of preventive maintenance actions of a system with failure interactions, will become shorter compared with scenarios where failure interactions do not exist.  相似文献   

19.
This paper studies maintenance policies for multi-component systems which have failure interaction among their components. Component failure might accelerate deterioration processes or induce instantaneous failures of the remaining components. We formulate this maintenance problem as a Markov decision process (MDP) with an objective of minimising a total discounted maintenance cost. However, the action set and state space in MDP exponentially grow as the number of components increases. This makes traditional approaches computationally intractable. To deal with this curse of dimensionality, a modified iterative aggregation procedure (MIAP) is proposed. We mathematically prove that iterations in MIAP guarantee the convergence and the policy obtained is optimal. Numerical case studies find that failure interaction should not be ignored in a maintenance policy decision making and the proposed MIAP is faster and requires less computational memory size than that of linear programming.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号