共查询到20条相似文献,搜索用时 203 毫秒
1.
2.
Java服务已成为支撑关键业务的业务服务,其可用性成为关键业务系统是否能持续提供服务的关键。采用主动容错技术可提高Java服务的可用性,建立Java服务的主动容错模型,便于分析和评估主动容错技术的有效性。通过模型分析与仿真实验的方法比较了采用Rejuvenation策略与不采用Rejuvenation策略的容错效果,通过分析评估得出,采用主动容错技术将有效改善Java服务的可用性,如果合理选择实施软件Rejuvenation策略的时间点,则可以取得更好的容错效果。 相似文献
3.
一种新的优化的检查点间隔的求解模型 总被引:1,自引:0,他引:1
在具有容错功能的高性能计算环境中,由于加入检查点机制会给系统引入额外负载,检查点间隔的适当选定能使系统性能优化,Vaidya的贡献是用他的模型得出的检查点间隔的求解等式独立于检查点潜伏时间(L)及检查点恢复时间(R),本文介绍了一种新的基于时间分段的模型NSBM,引入了系统平均利用率这一容错领域更易理解的概念代替Vaidya模型中的平均负载率并推导出了也是独立于LR的求解等等式,实验结果表明NSBM的求解模型比Vaidya的求解模型更优化。 相似文献
4.
一个适合大规模集群并行计算的检查点系统 总被引:4,自引:1,他引:4
分布式检查点系统是大规模并行计算系统容错的重要手段.协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈.针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性.初步测试结果表明,C系统的设计策略适合大规模并行计算的容错. 相似文献
5.
贾佳 《计算机工程与科学》2011,33(11)
应用级checkpointing技术是同构系统上最为常用和成熟的容错技术,但在异构系统下的应用还处于起步阶段,还没有一套严谨合理的针对异构系统架构和故障模型特点的实现方案和配置方法。针对这一现况,本文基于CUDA异构系统的体系结构和编程模型,对CUDA程序在CPU和GPU上的执行模式进行分析,提出了一种面向异构系统应用级checkpointing技术的异步执行机制,并基于这一机制对异构系统的检查点优化设置问题进行讨论,设计了一套优化方案。最后在CUDA平台下通过三个实例验证了这一技术的可行性和实用性,并进行了性能评估。结果表明,这种面向CPU-GPU的异构系统的应用级checkpointing异步执行机制是行之有效的,相比CPU-GPU同步执行的checkpointing机制在设置上更为灵活,优化空间更大。而本文基于这一机制所提出的检查点优化设置方法也有效地减少了check-pointing的开销,从而获得了更高的容错性能。 相似文献
6.
节点崩溃或者仿真资源不足导致的分布式仿真系统故障,降低了仿真系统可靠性。为保证系统容错效果,降低容错开销,提出了一种基于虚拟化技术的仿真系统容错方法,按照系统故障发生的位置,对不同类型故障动态采用不同类型的容错策略。分析了检查点容错策略的优化方法,给出了最优设置间隔;结合虚拟化技术的优势,解决了副本容错策略的节点选择、副本数量以及位置分布问题;同时,引入基于虚拟机迁移的容错策略,并将其作为检查点容错策略和副本容错策略的补充,以降低容错开销。通过仿真实验数据对比,分析了动态容错策略与普通容错策略的性能,可知动态容错策略保证了系统容错性能,容错开销也保持在较低水平。 相似文献
7.
8.
复杂系统的形式化描述对新系统的设计以及现有系统的改进与评价都具有十分重要的作用;针对处理机系统容错实时混合任务调度,提出采用确定与随机Petri网进行建模与性能分析;首先,根据任务执行的优先级、周期性、容错性和实时性,将任务分为四类;然后,采用DSPN对任务调度执行过程,不同优先级任务抢占式调度,处理机故障及故障恢复过程进行建模,由此构成处理机系统容错实时任务调度过程的DSPN模型;最后,仿真实验结果表明,在负载相同情况下,处理机利用率基本相同,且具有容错的实时任务调度算法可以有效地降低任务错失率;容错实时任务调度DSPN模型可以为复杂任务调度系统的Petri网建模与分析奠定了基础,并为实际工程应用提供了理论指导。 相似文献
9.
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献
10.
11.
P. K. Kapur Anshu Gupta P.C. Jha 《国际自动化与计算杂志》2007,4(4):369-379
Failure of a safety critical system can lead to big losses.Very high software reliability is required for automating the working of systems such as aircraft controller and nuclear reactor controller software systems.Fault-tolerant softwares are used to increase the overall reliability of software systems.Fault tolerance is achieved using the fault-tolerant schemes such as fault recovery (recovery block scheme),fault masking (N-version programming (NVP)) or a combination of both (Hybrid scheme).These softwares incorporate the ability of system survival even on a failure.Many researchers in the field of software engineering have done excellent work to study the reliability of fault-tolerant systems.Most of them consider the stable system reliability.Few attempts have been made in reliability modeling to study the reliability growth for an NVP system.Recently,a model was proposed to analyze the reliability growth of an NVP system incorporating the effect of fault removal efficiency.In this model,a proportion of the number of failures is assumed to be a measure of fault generation while an appropriate measure of fault generation should be the proportion of faults removed.In this paper,we first propose a testing efficiency model incorporating the effect of imperfect fault debugging and error generation.Using this model,a software reliability growth model (SRGM) is developed to model the reliability growth of an NVP system.The proposed model is useful for practical applications and can provide the measures of debugging effectiveness and additional workload or skilled professional required.It is very important for a developer to determine the optimal release time of the software to improve its performance in terms of competition and cost.In this paper,we also formulate the optimal software release time problem for a 3VP system under fuzzy environment and discuss a the fuzzy optimization technique for solving the problem with a numerical illustration. 相似文献
12.
一种基于双冗余千兆以太网的实时容错通信方案的设计与实现 总被引:1,自引:0,他引:1
提出了一种双冗余千兆以太网实时容错通信方案,描述了双网失效检测、故障隔离、系统重构等容错机制的实现,探讨了通过达到低延迟、可预测的通信从而实现实时通信的机制,并对该方案的可靠性和实时性进行了定性与定量的分析。 相似文献
13.
在空间环境下运行的计算机系统,高空辐射可能引发各种各样的异常或错误而导致故障。为了提高系统的可靠性,同时尽可能减少对系统实时性能的影响,需要对其进行有效的容错。针对节点和应用软件的故障检测和故障恢复进行研究与分析,提出了多种灵活有效的软件容错策略与设计方案,并基于四节点的多机硬件体系结构和RTEMS软件操作系统,设计并实现了一个系统原型。运行结果显示,该方案有效地提高了嵌入式实时系统的可靠性。 相似文献
14.
《Journal of Network and Computer Applications》2012,35(3):982-991
Today's technology evolution provides users inexpensive and powerful computer systems. However, there are argues that system reliability and fault tolerance is necessary in the systems as well. A proper design for the reliable and fault-tolerant computer system requires a trade-off among cost, reliability, and availability. In this paper, we propose a low-cost recovery scheme for reliable system performance. With this approach, it completely eliminates the roll-back overhead on branch misprediction. Thus, the instruction fetcher does not stop and it fetches instructions from the correct path immediately after the misprediction detected. So, this approach prevents a processor from flushing the pipeline, even under branch misprediction by allowing the instruction fetcher to work continuously. Our approach reduces the branch misprediction penalty for achieving reliable system performance. It instantly reconstructs the map table to any mispredicted branch and it outperforms the conventional RMT by an average of 10.93%. 相似文献
15.
16.
《Computers & Electrical Engineering》2014,40(7):2204-2215
This paper is concerned with an observer-based optimal fault-tolerant control for an offshore steel jacket platform. The dynamic characteristics of the actuator faults under consideration are formulated by an exogenous system. Based on a dynamic fault observer designed, a feedforward and feedback optimal fault-tolerant controller is developed to improve the reliability of the offshore platform. The controller can be obtained by solving an algebraic Riccati equation and Sylvester equations, respectively. It is shown through simulation results that the proposed control scheme is effective to guarantee the reliability of the offshore platform with the actuator faults. The vibration amplitudes of the displacement, the velocity of the offshore platform and the required control force under the proposed fault-tolerant controller can be reduced significantly. 相似文献
17.
《Control Engineering Practice》2002,10(8):801-817
Stimulated by the growing demand for improving system performance and reliability, fault-tolerant system design has been receiving significant attention. This paper proposes a new fault-tolerant control methodology using adaptive estimation and control approaches based on the learning capabilities of neural networks or fuzzy systems. On-line approximation-based stable adaptive neural/fuzzy control is studied for a class of input–output feedback linearizable time-varying nonlinear systems. This class of systems is large enough so that it is not only of theoretical interest but also of practical applicability. Moreover, the fault-tolerance ability of the adaptive controller has been further improved by exploiting information estimated from a fault-diagnosis unit designed by interfacing multiple models with an expert supervisory scheme. Simulation examples for a fault-tolerant jet engine control problem are given to demonstrate the effectiveness of the proposed scheme. 相似文献
18.
Many workstation-based distributed systems allow programs to be executed on remote idling machines for effective utilization of system resources. Usually, the control policies in these systems force a remote job be discontinued by the arrival of local jobs to guarantee the autonomy of individual workstations. Therefore, one special concern in the design of such systems is the fault-tolerant aspects for the execution of remote jobs. In the paper we discuss two control policies of workstation-based distributed systems, checkpointing and non-checkpointing policy, which support fault-tolerant execution of remote jobs on idling workstations. An analytical analysis on the reliability and mean turnaround time of the execution of remote jobs are conducted for both control policies. The optimal time interval between checkpoints in the checkpointing policy is formulated based on the given reliability and overhead of the system. In addition, several sample results derived from these analyses are compared with the outcome of corresponding simulation programs. Some observations of fault-tolerant features of each control policy are thereupon presented as guidelines for the future development of such workstation-based distributed systems. 相似文献
19.