共查询到20条相似文献,搜索用时 109 毫秒
1.
以异构系统的过程间相关性分析为基础,研究分析异构系统硬件故障在软件之中的传播行为,指导优化基于异构系统的应用级checkpointing检查点保存问题,并通过实验验证其可行性及性能,对异构系统的容错优化研究具有重大意义. 相似文献
2.
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献
3.
《IEEE transactions on pattern analysis and machine intelligence》1983,(3):355-364
Real-time systems often have very high reliability requirements and are therefore prime candidates for the inclusion of fault tolerance techniques. In order to provide tolerance to software faults, some form of state restoration is usually advocated as a means of recovery. State restoration can be expensive and the cost is exacerbated for systems which utilize concurrent processes. The concurrency present in most real-time systems and the further difficulties introduced by timing constraints suggest that providing tolerance for software faults may be inordinately expensive or complex. We believe that this need not be the case, and propose a straightforward pragmatic approach to software fault tolerance'which is believed to be applicable to many real-time systems. The approach takes advantage of the structure of real-time systems to simplify error recovery, and a classification scheme for errors is introduced. Responses to each type of error are proposed which allow service to be maintained. 相似文献
4.
《IEEE transactions on pattern analysis and machine intelligence》1985,(12):1502-1510
In order to assess the effectiveness of software fault tolerance techniques for enhancing the reliability of practical systems, a major experimental project has been conducted at the University of Newcastle upon Tyne. Techniques were developed for, and applied to, a realistic implementation of a real-time system (a naval command and control system). Reliability data were collected by operating this system in a simulated tactical environment for a variety of action scenarios. This paper provides an overview of the project and presents the results of three phases of experimentation. An analysis of these results shows that use of the software fault tolerance approach yielded a substantial improvement in the reliability of the command and control system. 相似文献
5.
本文主要给出现有主流软件容错技术的一个综述。首先从传统软件容错技术开始,介绍设计多样性和数据多样性;然后介绍主流的软件容错新技术,如重配置与重恢复、指令复制错误探测、SWIFT等,同时,站在软件容错用于处理嵌入式系统硬件暂态故障的角度对这些技术进行了分析;最后在对它们比较的基础上探讨软件容错技术的可能发展方
向。 相似文献
向。 相似文献
6.
为了提高嵌入式系统在恶劣环境下的可靠性,除了在硬件上采用诸如双机冷备份之类的容错方案外,在实时操作系统级提供软件容错处理功能既可以减小硬件资源开销,又可以在不影响系统工作效率的前提下明显提高系统的容错纠错能力.本文针对RTEMS实时操作系统缺乏软件容错支持功能的不足,在操作系统级设计了一套两级软件容错的方案,提高了嵌入式系统的可靠性. 相似文献
7.
8.
9.
基于软件故障注入模型的容错软件可靠性评测 总被引:2,自引:0,他引:2
为了灵活准确地用故障注入技术对容错软件进行可靠性评测,通过对故障注入及容错软件可靠性评测的分析,采用分布式结构,提出了一个动态生成一静态存储一动态触发的故障注入模型,它将故障生成和故障触发分开在不同的机子上实现,从而在保证评测准确性的前提下,解决了故障需求复杂、故障生成困难及目标系统额外负载过大等问题,实现了一个较为理想的故障注入模型;最后,通过在航天某型号容错软件上的试验,证明了该模型的可行性。 相似文献
10.
Grid computing emerges as effective technologies to couple geographically dis-tributed resources and solve large-scale computational problems in wide area networks. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globns fault detection service uses the well-known techniques basedon unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in agrid system, and a convenient toolkit is also needed to maintain the consistency in the grid. Afault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus faultdetection service is presented in this paper. The platform offers effective strategies in such threeaspects as grid key components, user tasks, and high-level applications. 相似文献
11.
Web服务的一个优点就是可以通过基本服务组合形成更为复杂的服务。为了确保Web服务组合的可靠性,可以利用软件容错技术来提高服务组合的可靠性。针对BPEL流程形式描述的组合服务,本文提出了一种利用软件容错模式增强组合服务可靠性的方法,并利用随机回报网模型度量组合服务的可靠性。 相似文献
12.
13.
软件双冗余容错系统的容错能力和性能分析 总被引:1,自引:0,他引:1
双冗余是比较常用的冗余容错设计方法.软件双冗余容错系统通过冗余执行完成相同功能的两个软件副本,并检查它们的结果,根据两者结果是否一致来判断是否出现了错误.建立了软件双冗余容错系统的运行时模型,并引入了软件双冗余容错系统的容错能力的概念.根据该模型分析了单个软件副本的容错能力对软件双冗余容错系统的容错能力和性能的影响.分析结果显示,提高单个软件副本的容错能力不仅能够提高软件双冗余容错系统的容错能力,还能够提高系统的性能.但在极端情况下,双冗余容错系统的容错能力也可能会小于单个软件副本的容错能力. 相似文献
14.
Shye Alex Blomstedt Joseph Moseley Tipp Reddi Vijay Janapa Connors Daniel A. 《Dependable and Secure Computing, IEEE Transactions on》2009,6(2):135-148
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries. 相似文献
15.
Increasing the reliability of continuous process control systems means choosing a fault tolerance technique that matches computer hardware capabilities, as well as applications. 相似文献
16.
Barker W. Halliday D.M. Thoma Y. Sanchez E. Tempesti G. Tyrrell A.M. 《Evolutionary Computation, IEEE Transactions on》2007,11(5):666-684
Fault tolerance is a crucial operational aspect of biological systems and the self-repair capabilities of complex organisms far exceeds that of even the most advanced electronic devices. While many of the processes used by nature to achieve fault tolerance cannot easily be applied to silicon-based systems, in this paper we show that mechanisms loosely inspired by the operation of multicellular organisms can be transported to electronic systems to provide self-repair capabilities. Features such as dynamic routing, reconfiguration, and on-chip reprogramming can be invaluable for the realization of adaptive hardware systems and for the design of highly complex systems based on the kind of unreliable components that are likely to be introduced in the not-too-distant future. In this paper, we describe the implementation of fault tolerant features that address error detection and recovery through dynamic routing, reconfiguration, and on-chip reprogramming in a novel application specific integrated circuit. We take inspiration from three biological models: phylogenesis, ontogenesis, and epigenesis (hence the POE in POEtic). As in nature, our approach is based on a set of separate and complementary techniques that exploit the novel mechanisms provided by our device in the particular context of fault tolerance. 相似文献
17.
计算机控制系统的容错技术 总被引:1,自引:0,他引:1
计算机控制系统的可靠性设计是实现柔性智能控制所面临的一个重要课题,而容错技术是系统可靠性设计的关键技术。本文在分析计算机系统可靠性设计的基础上,综述了容错技术的发展、研究的主要内容及实现的主要方法,对常用的几种容错结构进行了比较和评价。指出了对计算机系统进行容错设计必须解决的主要问题。 相似文献
18.
拜占庭容错算法是一类能够容忍各种形式的软件错误和安全漏洞的容错算法,对云计算的可靠性保障有着重要意义与其他容错算法相比,拜占庭容错算法稳定性更高,但是其性能表现低下,不能满足当前系统对高吞吐、低延时的需求在网计算是一种以数据为中心的体系结构,它用网络承担部分计算功能,使数据在流动过程中获得处理,从而提高系统性能为解决拜... 相似文献
19.
电源中的容错技术 总被引:2,自引:1,他引:1
张谷勋 《计算机与数字工程》1999,27(1):57-61,68
为了满足用户高可靠,不停电供电的需求,作者提出了可以将大型电子系统的供电电源视作一个实时控制系统的概念,在电源设计中采用容错技术。 相似文献