首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In order to assess the effectiveness of software fault tolerance techniques for enhancing the reliability of practical systems, a major experimental project has been conducted at the University of Newcastle upon Tyne. Techniques were developed for, and applied to, a realistic implementation of a real-time system (a naval command and control system). Reliability data were collected by operating this system in a simulated tactical environment for a variety of action scenarios. This paper provides an overview of the project and presents the results of three phases of experimentation. An analysis of these results shows that use of the software fault tolerance approach yielded a substantial improvement in the reliability of the command and control system.  相似文献   

2.
As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark.  相似文献   

3.
本文主要给出现有主流软件容错技术的一个综述。首先从传统软件容错技术开始,介绍设计多样性和数据多样性;然后介绍主流的软件容错新技术,如重配置与重恢复、指令复制错误探测、SWIFT等,同时,站在软件容错用于处理嵌入式系统硬件暂态故障的角度对这些技术进行了分析;最后在对它们比较的基础上探讨软件容错技术的可能发展方
向。  相似文献   

4.
软件双冗余容错系统的容错能力和性能分析   总被引:1,自引:0,他引:1  
双冗余是比较常用的冗余容错设计方法.软件双冗余容错系统通过冗余执行完成相同功能的两个软件副本,并检查它们的结果,根据两者结果是否一致来判断是否出现了错误.建立了软件双冗余容错系统的运行时模型,并引入了软件双冗余容错系统的容错能力的概念.根据该模型分析了单个软件副本的容错能力对软件双冗余容错系统的容错能力和性能的影响.分析结果显示,提高单个软件副本的容错能力不仅能够提高软件双冗余容错系统的容错能力,还能够提高系统的性能.但在极端情况下,双冗余容错系统的容错能力也可能会小于单个软件副本的容错能力.  相似文献   

5.
该文讨论了根据系统划分和功能两种不同角度建立的系统层次模型,在此基础上分析了这两种系统层次模型对建立故障模型和客借机制模型的指导作用。  相似文献   

6.
林星  沈奇威  王纯 《计算机系统应用》2012,21(4):111-114,104
设计了一种在工作流子系统中,可根据工作流的异常类型自动选择容错策略的自适应容错模型,针对不可恢复异常采用了事务补偿机制的容错策略进行异常处理,而针对可恢复异常采用了自动恢复的容错策略进行异常处理。详细描述了该模型所采用的消息队列、事务补偿机制、自动恢复机制。  相似文献   

7.
8.
Emmerson  R. Mcgowan  M.J. 《Micro, IEEE》1984,4(6):34-43
This quad-modular redundant system offers a cost-effective alternativefor supporting fault tolerance by incorporating hardware/software independence and five redundancy mechanisms to correct both transient and permanent errors.  相似文献   

9.
10.
The increasing cost of computer system failure has stimulated interest in improving software reliability. One way to do this is by adding redundant structural data to data structures. Such redundancy can be used to detect and correct (structural) errors in instances of a data structure. The intuitive approach of this paper, which makes heavy use of examples, is complemented by the more formal development of the companion paper, "Redundancy in Data Structures: Some Theoretical Results."  相似文献   

11.
工作站群机系统已成为分布式并行处理发展的主流方向之一。随着群机系统应用领域的逐渐拓展和规模的不断扩大,人们对其可靠性的要求日益提高。设计高可靠的群机系统,需要着重研究其系统容错技术。本文主要论述Linux群机分布式系统进程的容错和恢复。重点讲述用户层中的检查点设置、卷回和进程迁移关键技术。  相似文献   

12.
Mcgill  W.F. Smith  S.E. 《Micro, IEEE》1984,4(6):22-33
Increasing the reliability of continuous process control systems means choosing a fault tolerance technique that matches computer hardware capabilities, as well as applications.  相似文献   

13.
研究冗余设计的失效率计算问题,分析了多路冗余的共因对失效率的影响,从理论上 证明了只有充分地减少多多路冗余的公共失效因素才能有效地降低系统的失效率。  相似文献   

14.
15.
Raghavendra  C.S. 《Micro, IEEE》1984,4(6):44-53
Computer systems with large numbers of processors pose serious reliability problems. One solution is to build redundant communication paths and dynamic reconfiguration into network designs.  相似文献   

16.
As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system.  相似文献   

17.
Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kennel can be tolerated without checkpointing or message logging. It has been proved in previous algorithm-based fault tolerance that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging.  相似文献   

18.
电源中的容错技术   总被引:2,自引:1,他引:1  
为了满足用户高可靠,不停电供电的需求,作者提出了可以将大型电子系统的供电电源视作一个实时控制系统的概念,在电源设计中采用容错技术。  相似文献   

19.
一种中间件服务容错配置管理方法   总被引:1,自引:0,他引:1  
李军国  黄罡  邹键  梅宏 《计算机学报》2007,30(10):1696-1704
提出一种基于运行时刻软件体系结构的容错管理方法,支持开发者和管理员针对不同中间件服务失效定制合适的故障检测和修复机制.首先,运行时刻软件体系结构自动构造构件依赖视图和错误传播①视图,为理解和分析整个系统的可靠性提供全局视图;然后,操作运行时刻软件体系结构配置容错机制;最后利用AOP技术将容错机制插装到中间件中,使其具备指定的容错能力.上述过程在一个可视化工具的辅助下半自动实施,并在J2EE中间件上得到验证.  相似文献   

20.
We propose a distributed dictionary that allows insert and search operations and that tolerates arbitrary single server crashes. The distinctive feature of our model is that the crash of a server cannot be detected. This is in contrast to all other proposals of distributed fault-tolerant search structures presented thus far. It reflects the real situation in the internet more accurately, and is in general more suitable to complex overall conditions. This makes our solution fundamentally different from all previous ones, but also more complicated. We present in detail the algorithms for searching, insertion, and graceful recovery of crashed servers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号