首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
贾佳  杨学军 《软件学报》2011,22(12):2853-2865
以异构系统的过程间相关性分析为基础,研究分析异构系统硬件故障在软件之中的传播行为,指导优化基于异构系统的应用级checkpointing检查点保存问题,并通过实验验证其可行性及性能,对异构系统的容错优化研究具有重大意义.  相似文献   

2.
王一拙  陈旭  计卫星  苏岩  王小军  石峰 《软件学报》2016,27(7):1789-1804
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持.  相似文献   

3.
Real-time systems often have very high reliability requirements and are therefore prime candidates for the inclusion of fault tolerance techniques. In order to provide tolerance to software faults, some form of state restoration is usually advocated as a means of recovery. State restoration can be expensive and the cost is exacerbated for systems which utilize concurrent processes. The concurrency present in most real-time systems and the further difficulties introduced by timing constraints suggest that providing tolerance for software faults may be inordinately expensive or complex. We believe that this need not be the case, and propose a straightforward pragmatic approach to software fault tolerance'which is believed to be applicable to many real-time systems. The approach takes advantage of the structure of real-time systems to simplify error recovery, and a classification scheme for errors is introduced. Responses to each type of error are proposed which allow service to be maintained.  相似文献   

4.
In order to assess the effectiveness of software fault tolerance techniques for enhancing the reliability of practical systems, a major experimental project has been conducted at the University of Newcastle upon Tyne. Techniques were developed for, and applied to, a realistic implementation of a real-time system (a naval command and control system). Reliability data were collected by operating this system in a simulated tactical environment for a variety of action scenarios. This paper provides an overview of the project and presents the results of three phases of experimentation. An analysis of these results shows that use of the software fault tolerance approach yielded a substantial improvement in the reliability of the command and control system.  相似文献   

5.
本文主要给出现有主流软件容错技术的一个综述。首先从传统软件容错技术开始,介绍设计多样性和数据多样性;然后介绍主流的软件容错新技术,如重配置与重恢复、指令复制错误探测、SWIFT等,同时,站在软件容错用于处理嵌入式系统硬件暂态故障的角度对这些技术进行了分析;最后在对它们比较的基础上探讨软件容错技术的可能发展方
向。  相似文献   

6.
RTEMS嵌入式系统中的软件容错设计   总被引:1,自引:0,他引:1       下载免费PDF全文
为了提高嵌入式系统在恶劣环境下的可靠性,除了在硬件上采用诸如双机冷备份之类的容错方案外,在实时操作系统级提供软件容错处理功能既可以减小硬件资源开销,又可以在不影响系统工作效率的前提下明显提高系统的容错纠错能力.本文针对RTEMS实时操作系统缺乏软件容错支持功能的不足,在操作系统级设计了一套两级软件容错的方案,提高了嵌入式系统的可靠性.  相似文献   

7.
一种中间件服务容错配置管理方法   总被引:1,自引:0,他引:1  
李军国  黄罡  邹键  梅宏 《计算机学报》2007,30(10):1696-1704
提出一种基于运行时刻软件体系结构的容错管理方法,支持开发者和管理员针对不同中间件服务失效定制合适的故障检测和修复机制.首先,运行时刻软件体系结构自动构造构件依赖视图和错误传播①视图,为理解和分析整个系统的可靠性提供全局视图;然后,操作运行时刻软件体系结构配置容错机制;最后利用AOP技术将容错机制插装到中间件中,使其具备指定的容错能力.上述过程在一个可视化工具的辅助下半自动实施,并在J2EE中间件上得到验证.  相似文献   

8.
软件避错是提高软件可靠性的主要方法之一,它包含程序检验,测试,正确性证明等技术,然而,随  相似文献   

9.
基于软件故障注入模型的容错软件可靠性评测   总被引:2,自引:0,他引:2  
为了灵活准确地用故障注入技术对容错软件进行可靠性评测,通过对故障注入及容错软件可靠性评测的分析,采用分布式结构,提出了一个动态生成一静态存储一动态触发的故障注入模型,它将故障生成和故障触发分开在不同的机子上实现,从而在保证评测准确性的前提下,解决了故障需求复杂、故障生成困难及目标系统额外负载过大等问题,实现了一个较为理想的故障注入模型;最后,通过在航天某型号容错软件上的试验,证明了该模型的可行性。  相似文献   

10.
Fault-tolerant grid architecture and practice   总被引:10,自引:0,他引:10       下载免费PDF全文
Grid computing emerges as effective technologies to couple geographically dis-tributed resources and solve large-scale computational problems in wide area networks. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globns fault detection service uses the well-known techniques basedon unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in agrid system, and a convenient toolkit is also needed to maintain the consistency in the grid. Afault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus faultdetection service is presented in this paper. The platform offers effective strategies in such threeaspects as grid key components, user tasks, and high-level applications.  相似文献   

11.
利用软件容错技术提高Web服务组合的可靠性   总被引:1,自引:0,他引:1       下载免费PDF全文
Web服务的一个优点就是可以通过基本服务组合形成更为复杂的服务。为了确保Web服务组合的可靠性,可以利用软件容错技术来提高服务组合的可靠性。针对BPEL流程形式描述的组合服务,本文提出了一种利用软件容错模式增强组合服务可靠性的方法,并利用随机回报网模型度量组合服务的可靠性。  相似文献   

12.
基于时间的软件恢复策略的建模与分析   总被引:3,自引:0,他引:3  
针对软件在连续运行过程中普遍发生的老化现象,提出了一种嵌套的基于时间的软件恢复策略,对恢复过程的Petri网模型分析求解,并最终得到了最优恢复时间间隔序列和最优应用级恢复次数。该策略同时考虑应用级和系统级的恢复,从而进一步减少了恢复时间,降低了恢复成本和周期性应用级恢复策略预测失败的风险,提高了系统的可靠性。对于更复杂的系统,策略还可进一步嵌套进程级的恢复,从而具有一定的可扩展性。  相似文献   

13.
软件双冗余容错系统的容错能力和性能分析   总被引:1,自引:0,他引:1  
双冗余是比较常用的冗余容错设计方法.软件双冗余容错系统通过冗余执行完成相同功能的两个软件副本,并检查它们的结果,根据两者结果是否一致来判断是否出现了错误.建立了软件双冗余容错系统的运行时模型,并引入了软件双冗余容错系统的容错能力的概念.根据该模型分析了单个软件副本的容错能力对软件双冗余容错系统的容错能力和性能的影响.分析结果显示,提高单个软件副本的容错能力不仅能够提高软件双冗余容错系统的容错能力,还能够提高系统的性能.但在极端情况下,双冗余容错系统的容错能力也可能会小于单个软件副本的容错能力.  相似文献   

14.
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries.  相似文献   

15.
Mcgill  W.F. Smith  S.E. 《Micro, IEEE》1984,4(6):22-33
Increasing the reliability of continuous process control systems means choosing a fault tolerance technique that matches computer hardware capabilities, as well as applications.  相似文献   

16.
Fault Tolerance Using Dynamic Reconfiguration on the POEtic Tissue   总被引:1,自引:0,他引:1  
Fault tolerance is a crucial operational aspect of biological systems and the self-repair capabilities of complex organisms far exceeds that of even the most advanced electronic devices. While many of the processes used by nature to achieve fault tolerance cannot easily be applied to silicon-based systems, in this paper we show that mechanisms loosely inspired by the operation of multicellular organisms can be transported to electronic systems to provide self-repair capabilities. Features such as dynamic routing, reconfiguration, and on-chip reprogramming can be invaluable for the realization of adaptive hardware systems and for the design of highly complex systems based on the kind of unreliable components that are likely to be introduced in the not-too-distant future. In this paper, we describe the implementation of fault tolerant features that address error detection and recovery through dynamic routing, reconfiguration, and on-chip reprogramming in a novel application specific integrated circuit. We take inspiration from three biological models: phylogenesis, ontogenesis, and epigenesis (hence the POE in POEtic). As in nature, our approach is based on a set of separate and complementary techniques that exploit the novel mechanisms provided by our device in the particular context of fault tolerance.  相似文献   

17.
计算机控制系统的容错技术   总被引:1,自引:0,他引:1  
计算机控制系统的可靠性设计是实现柔性智能控制所面临的一个重要课题,而容错技术是系统可靠性设计的关键技术。本文在分析计算机系统可靠性设计的基础上,综述了容错技术的发展、研究的主要内容及实现的主要方法,对常用的几种容错结构进行了比较和评价。指出了对计算机系统进行容错设计必须解决的主要问题。  相似文献   

18.
拜占庭容错算法是一类能够容忍各种形式的软件错误和安全漏洞的容错算法,对云计算的可靠性保障有着重要意义与其他容错算法相比,拜占庭容错算法稳定性更高,但是其性能表现低下,不能满足当前系统对高吞吐、低延时的需求在网计算是一种以数据为中心的体系结构,它用网络承担部分计算功能,使数据在流动过程中获得处理,从而提高系统性能为解决拜...  相似文献   

19.
电源中的容错技术   总被引:2,自引:1,他引:1  
为了满足用户高可靠,不停电供电的需求,作者提出了可以将大型电子系统的供电电源视作一个实时控制系统的概念,在电源设计中采用容错技术。  相似文献   

20.
姚兰  桂勋  巨军让 《计算机工程》2007,33(6):83-85,1
随着硬件容错技术的成熟,软件容错技术成为提高系统可靠性的热点问题。直接开发容错应用是非常困难的,鉴于中间件为应用系统提供了良好的开发环境,该文研究和设计了一个基于中间件的容错系统模型,提出了一种新的节点容错结构构造方法,为解决冗余、失效检测和恢复等容错的关键技术问题形成了一套较完整的解决方案。采用马尔科夫过程求出系统的可靠度,验证了系统设计的合理性和可靠性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号