首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
近来的研究表明,长时间运行的通信软件往往存在老化现象。为防止软件老化引起的突发性系统停机,提高系统的可靠性,人们提出了一种预防性的软件容错策略,称为rejuvenation。由于它的过程复杂,总的停机成本仍然是可观的。检查点是一种轻量级的软件容错策略,其成本远小于rejuvenation的成本。该文通过合理结合rejuvenation和检查点技术,实现了降低总的系统停机成本的目的。文中给出了系统的Petri网模型,并结合实例进行了分析。  相似文献   

2.
陈新  黄永忠  鲍天明  郑霄 《计算机应用》2010,30(10):2741-2744
Java服务已成为支撑关键业务的业务服务,其可用性成为关键业务系统是否能持续提供服务的关键。采用主动容错技术可提高Java服务的可用性,建立Java服务的主动容错模型,便于分析和评估主动容错技术的有效性。通过模型分析与仿真实验的方法比较了采用Rejuvenation策略与不采用Rejuvenation策略的容错效果,通过分析评估得出,采用主动容错技术将有效改善Java服务的可用性,如果合理选择实施软件Rejuvenation策略的时间点,则可以取得更好的容错效果。  相似文献   

3.
一种新的优化的检查点间隔的求解模型   总被引:1,自引:0,他引:1  
在具有容错功能的高性能计算环境中,由于加入检查点机制会给系统引入额外负载,检查点间隔的适当选定能使系统性能优化,Vaidya的贡献是用他的模型得出的检查点间隔的求解等式独立于检查点潜伏时间(L)及检查点恢复时间(R),本文介绍了一种新的基于时间分段的模型NSBM,引入了系统平均利用率这一容错领域更易理解的概念代替Vaidya模型中的平均负载率并推导出了也是独立于LR的求解等等式,实验结果表明NSBM的求解模型比Vaidya的求解模型更优化。  相似文献   

4.
一个适合大规模集群并行计算的检查点系统   总被引:4,自引:1,他引:4  
分布式检查点系统是大规模并行计算系统容错的重要手段.协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈.针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性.初步测试结果表明,C系统的设计策略适合大规模并行计算的容错.  相似文献   

5.
应用级checkpointing技术是同构系统上最为常用和成熟的容错技术,但在异构系统下的应用还处于起步阶段,还没有一套严谨合理的针对异构系统架构和故障模型特点的实现方案和配置方法。针对这一现况,本文基于CUDA异构系统的体系结构和编程模型,对CUDA程序在CPU和GPU上的执行模式进行分析,提出了一种面向异构系统应用级checkpointing技术的异步执行机制,并基于这一机制对异构系统的检查点优化设置问题进行讨论,设计了一套优化方案。最后在CUDA平台下通过三个实例验证了这一技术的可行性和实用性,并进行了性能评估。结果表明,这种面向CPU-GPU的异构系统的应用级checkpointing异步执行机制是行之有效的,相比CPU-GPU同步执行的checkpointing机制在设置上更为灵活,优化空间更大。而本文基于这一机制所提出的检查点优化设置方法也有效地减少了check-pointing的开销,从而获得了更高的容错性能。  相似文献   

6.
节点崩溃或者仿真资源不足导致的分布式仿真系统故障,降低了仿真系统可靠性。为保证系统容错效果,降低容错开销,提出了一种基于虚拟化技术的仿真系统容错方法,按照系统故障发生的位置,对不同类型故障动态采用不同类型的容错策略。分析了检查点容错策略的优化方法,给出了最优设置间隔;结合虚拟化技术的优势,解决了副本容错策略的节点选择、副本数量以及位置分布问题;同时,引入基于虚拟机迁移的容错策略,并将其作为检查点容错策略和副本容错策略的补充,以降低容错开销。通过仿真实验数据对比,分析了动态容错策略与普通容错策略的性能,可知动态容错策略保证了系统容错性能,容错开销也保持在较低水平。  相似文献   

7.
在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。  相似文献   

8.
复杂系统的形式化描述对新系统的设计以及现有系统的改进与评价都具有十分重要的作用;针对处理机系统容错实时混合任务调度,提出采用确定与随机Petri网进行建模与性能分析;首先,根据任务执行的优先级、周期性、容错性和实时性,将任务分为四类;然后,采用DSPN对任务调度执行过程,不同优先级任务抢占式调度,处理机故障及故障恢复过程进行建模,由此构成处理机系统容错实时任务调度过程的DSPN模型;最后,仿真实验结果表明,在负载相同情况下,处理机利用率基本相同,且具有容错的实时任务调度算法可以有效地降低任务错失率;容错实时任务调度DSPN模型可以为复杂任务调度系统的Petri网建模与分析奠定了基础,并为实际工程应用提供了理论指导。  相似文献   

9.
王一拙  陈旭  计卫星  苏岩  王小军  石峰 《软件学报》2016,27(7):1789-1804
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持.  相似文献   

10.
双机容错系统中最佳检查点间隔的分析   总被引:2,自引:0,他引:2       下载免费PDF全文
设置检查点是容错计算机系统进行故障恢复的重要手段。因为检查点间隔选择过大或过小都将使系统性能受到影响,所以检查点间隔的适当选定是系统性能优化的一个重要指标。该文针对双机容错系统,采用检查点设置与回卷恢复的方法提出了一种系统模型,利用马尔科夫链得到了最佳检查点间隔的求解等式,通过实验证实了求解等式的正确性。  相似文献   

11.
Failure of a safety critical system can lead to big losses.Very high software reliability is required for automating the working of systems such as aircraft controller and nuclear reactor controller software systems.Fault-tolerant softwares are used to increase the overall reliability of software systems.Fault tolerance is achieved using the fault-tolerant schemes such as fault recovery (recovery block scheme),fault masking (N-version programming (NVP)) or a combination of both (Hybrid scheme).These softwares incorporate the ability of system survival even on a failure.Many researchers in the field of software engineering have done excellent work to study the reliability of fault-tolerant systems.Most of them consider the stable system reliability.Few attempts have been made in reliability modeling to study the reliability growth for an NVP system.Recently,a model was proposed to analyze the reliability growth of an NVP system incorporating the effect of fault removal efficiency.In this model,a proportion of the number of failures is assumed to be a measure of fault generation while an appropriate measure of fault generation should be the proportion of faults removed.In this paper,we first propose a testing efficiency model incorporating the effect of imperfect fault debugging and error generation.Using this model,a software reliability growth model (SRGM) is developed to model the reliability growth of an NVP system.The proposed model is useful for practical applications and can provide the measures of debugging effectiveness and additional workload or skilled professional required.It is very important for a developer to determine the optimal release time of the software to improve its performance in terms of competition and cost.In this paper,we also formulate the optimal software release time problem for a 3VP system under fuzzy environment and discuss a the fuzzy optimization technique for solving the problem with a numerical illustration.  相似文献   

12.
提出了一种双冗余千兆以太网实时容错通信方案,描述了双网失效检测、故障隔离、系统重构等容错机制的实现,探讨了通过达到低延迟、可预测的通信从而实现实时通信的机制,并对该方案的可靠性和实时性进行了定性与定量的分析。  相似文献   

13.
在空间环境下运行的计算机系统,高空辐射可能引发各种各样的异常或错误而导致故障。为了提高系统的可靠性,同时尽可能减少对系统实时性能的影响,需要对其进行有效的容错。针对节点和应用软件的故障检测和故障恢复进行研究与分析,提出了多种灵活有效的软件容错策略与设计方案,并基于四节点的多机硬件体系结构和RTEMS软件操作系统,设计并实现了一个系统原型。运行结果显示,该方案有效地提高了嵌入式实时系统的可靠性。  相似文献   

14.
Today's technology evolution provides users inexpensive and powerful computer systems. However, there are argues that system reliability and fault tolerance is necessary in the systems as well. A proper design for the reliable and fault-tolerant computer system requires a trade-off among cost, reliability, and availability. In this paper, we propose a low-cost recovery scheme for reliable system performance. With this approach, it completely eliminates the roll-back overhead on branch misprediction. Thus, the instruction fetcher does not stop and it fetches instructions from the correct path immediately after the misprediction detected. So, this approach prevents a processor from flushing the pipeline, even under branch misprediction by allowing the instruction fetcher to work continuously. Our approach reduces the branch misprediction penalty for achieving reliable system performance. It instantly reconstructs the map table to any mispredicted branch and it outperforms the conventional RMT by an average of 10.93%.  相似文献   

15.
在分布式存储系统存储数据时,如果一个或几个设备出现故障,不仅该设备中的数据不能使用,而且会导致用户无法完整地访问资源。针对该问题,提出一种基于RS码的错误容忍存储方案,当系统中错误设备的数量不超过m时,就可以对其进行恢复,实现容错。该方案具有较高的安全性与执行效率,能满足存储系统容错的要求,可以利用其构造对可靠性要求较高的存储系统。  相似文献   

16.
This paper is concerned with an observer-based optimal fault-tolerant control for an offshore steel jacket platform. The dynamic characteristics of the actuator faults under consideration are formulated by an exogenous system. Based on a dynamic fault observer designed, a feedforward and feedback optimal fault-tolerant controller is developed to improve the reliability of the offshore platform. The controller can be obtained by solving an algebraic Riccati equation and Sylvester equations, respectively. It is shown through simulation results that the proposed control scheme is effective to guarantee the reliability of the offshore platform with the actuator faults. The vibration amplitudes of the displacement, the velocity of the offshore platform and the required control force under the proposed fault-tolerant controller can be reduced significantly.  相似文献   

17.
Stimulated by the growing demand for improving system performance and reliability, fault-tolerant system design has been receiving significant attention. This paper proposes a new fault-tolerant control methodology using adaptive estimation and control approaches based on the learning capabilities of neural networks or fuzzy systems. On-line approximation-based stable adaptive neural/fuzzy control is studied for a class of input–output feedback linearizable time-varying nonlinear systems. This class of systems is large enough so that it is not only of theoretical interest but also of practical applicability. Moreover, the fault-tolerance ability of the adaptive controller has been further improved by exploiting information estimated from a fault-diagnosis unit designed by interfacing multiple models with an expert supervisory scheme. Simulation examples for a fault-tolerant jet engine control problem are given to demonstrate the effectiveness of the proposed scheme.  相似文献   

18.
Many workstation-based distributed systems allow programs to be executed on remote idling machines for effective utilization of system resources. Usually, the control policies in these systems force a remote job be discontinued by the arrival of local jobs to guarantee the autonomy of individual workstations. Therefore, one special concern in the design of such systems is the fault-tolerant aspects for the execution of remote jobs. In the paper we discuss two control policies of workstation-based distributed systems, checkpointing and non-checkpointing policy, which support fault-tolerant execution of remote jobs on idling workstations. An analytical analysis on the reliability and mean turnaround time of the execution of remote jobs are conducted for both control policies. The optimal time interval between checkpoints in the checkpointing policy is formulated based on the given reliability and overhead of the system. In addition, several sample results derived from these analyses are compared with the outcome of corresponding simulation programs. Some observations of fault-tolerant features of each control policy are thereupon presented as guidelines for the future development of such workstation-based distributed systems.  相似文献   

19.
在很多实时应用领域,如遥感遥测、雷达数据采集中都需要高速连续数据记录系统。文章针对基于盘阵列结构的记录系统,提出了一种提高其可靠性的方法浮动校验组方法。与传统容错方法相比,浮动校验组方法通过阵列结构的动态改变,可以显著提高盘阵列记录系统的可靠性。  相似文献   

20.
提出了冗余组合导航应用软件的实用设计方法,当组合导航应用软件中多个任务模块采用N文本法和恢复块法进行冗余设计时,给出了含有容错模块的软件系统的可靠性的评价模型及计算方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号