期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《IEEE transactions on pattern analysis and machine intelligence》1985,(12):1502-1510

In order to assess the effectiveness of software fault tolerance techniques for enhancing the reliability of practical systems, a major experimental project has been conducted at the University of Newcastle upon Tyne. Techniques were developed for, and applied to, a realistic implementation of a real-time system (a naval command and control system). Reliability data were collected by operating this system in a simulated tactical environment for a variety of action scenarios. This paper provides an overview of the project and presents the results of three phases of experimentation. An analysis of these results shows that use of the software fault tolerance approach yielded a substantial improvement in the reliability of the command and control system. 相似文献

2.

Application-Level Fault Tolerance as a Complement to System-Level Fault Tolerance

Haines Joshua Lakamraju Vijay Koren Israel Krishna C. Mani 《The Journal of supercomputing》2000,16(1-2):53-68

As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark. 相似文献

3.

软件容错：技术与展望

下载免费PDF全文

孙鹏赵军锁张文君《计算机工程与科学》2007,29(8):88-93

本文主要给出现有主流软件容错技术的一个综述。首先从传统软件容错技术开始，介绍设计多样性和数据多样性；然后介绍主流的软件容错新技术，如重配置与重恢复、指令复制错误探测、SWIFT等，同时，站在软件容错用于处理嵌入式系统硬件暂态故障的角度对这些技术进行了分析；最后在对它们比较的基础上探讨软件容错技术的可能发展方
向。相似文献

4.

软件双冗余容错系统的容错能力和性能分析 总被引：1，自引：0，他引：1

吴斌高珑《计算机研究与发展》2009,46(Z2)

双冗余是比较常用的冗余容错设计方法.软件双冗余容错系统通过冗余执行完成相同功能的两个软件副本,并检查它们的结果,根据两者结果是否一致来判断是否出现了错误.建立了软件双冗余容错系统的运行时模型,并引入了软件双冗余容错系统的容错能力的概念.根据该模型分析了单个软件副本的容错能力对软件双冗余容错系统的容错能力和性能的影响.分析结果显示,提高单个软件副本的容错能力不仅能够提高软件双冗余容错系统的容错能力,还能够提高系统的性能.但在极端情况下,双冗余容错系统的容错能力也可能会小于单个软件副本的容错能力. 相似文献

5.

故障和容错机制的层次模型

孙峻朝王建莹《计算机工程与应用》1999,35(10):5-7

该文讨论了根据系统划分和功能两种不同角度建立的系统层次模型,在此基础上分析了这两种系统层次模型对建立故障模型和客借机制模型的指导作用。相似文献

6.

一种工作流自适应容错模型

林星沈奇威王纯《计算机系统应用》2012,21(4):111-114,104

设计了一种在工作流子系统中,可根据工作流的异常类型自动选择容错策略的自适应容错模型,针对不可恢复异常采用了事务补偿机制的容错策略进行异常处理,而针对可恢复异常采用了自动恢复的容错策略进行异常处理。详细描述了该模型所采用的消息队列、事务补偿机制、自动恢复机制。相似文献

7.

Understanding Fault Tolerance And Reliability

Somani A.K. Vaidya N.H. 《Computer》1997,30(4):45-50

相似文献

8.

Fault Tolerance Achieved in VLSI

Emmerson R. Mcgowan M.J. 《Micro, IEEE》1984,4(6):34-43

This quad-modular redundant system offers a cost-effective alternativefor supporting fault tolerance by incorporating hardware/software independence and five redundancy mechanisms to correct both transient and permanent errors. 相似文献

9.

Fault Tolerance by Design Diversity: Concepts and Experiments

Avizienis A. Kelly J.P.J. 《Computer》1984,17(8):67-80

相似文献

10.

Redundancy in Data Structures: Improving Software Fault Tolerance

《IEEE transactions on pattern analysis and machine intelligence》1980,(6):585-594

The increasing cost of computer system failure has stimulated interest in improving software reliability. One way to do this is by adding redundant structural data to data structures. Such redundancy can be used to detect and correct (structural) errors in instances of a data structure. The intuitive approach of this paper, which makes heavy use of examples, is complemented by the more formal development of the companion paper, "Redundancy in Data Structures: Some Theoretical Results." 相似文献

11.

群机系统的容错和恢复

丁俊童维勤《计算机应用》2001,21(6):90-92

工作站群机系统已成为分布式并行处理发展的主流方向之一。随着群机系统应用领域的逐渐拓展和规模的不断扩大,人们对其可靠性的要求日益提高。设计高可靠的群机系统,需要着重研究其系统容错技术。本文主要论述Linux群机分布式系统进程的容错和恢复。重点讲述用户层中的检查点设置、卷回和进程迁移关键技术。相似文献

12.

Fault Tolerance in Continuous Process Control

Mcgill W.F. Smith S.E. 《Micro, IEEE》1984,4(6):22-33

Increasing the reliability of continuous process control systems means choosing a fault tolerance technique that matches computer hardware capabilities, as well as applications. 相似文献

13.

容错的冗余设计分析

宋晓秋薛小青《计算机工程与设计》1998,19(4):11-13

研究冗余设计的失效率计算问题，分析了多路冗余的共因对失效率的影响，从理论上证明了只有充分地减少多多路冗余的公共失效因素才能有效地降低系统的失效率。相似文献

14.

Fault Tolerance Techniques for Systolic Arrays

Abraham J.A. Banerjee P. Chien-Yi Chen Fuchs W.K. Sy-Yen Kuo Narasimha Reddy A.L. 《Computer》1987,20(7):65-75

相似文献

15.

Fault Tolerance in Regular Network Architectures

Raghavendra C.S. 《Micro, IEEE》1984,4(6):44-53

Computer systems with large numbers of processors pose serious reliability problems. One solution is to build redundant communication paths and dynamic reconfiguration into network designs. 相似文献

16.

Replication-Based Fault Tolerance for MPI Applications

Walters John Paul Chaudhary Vipin 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(7):997-1010

As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system. 相似文献

17.

Algorithm-Based Fault Tolerance for Fail-Stop Failures

Chen Zizhong Dongarra Jack 《Parallel and Distributed Systems, IEEE Transactions on》2008,19(12):1628-1641

Fail-stop failures in distributed environments are often tolerated by checkpointing or message logging. In this paper, we show that fail-stop process failures in ScaLAPACK matrix-matrix multiplication kennel can be tolerated without checkpointing or message logging. It has been proved in previous algorithm-based fault tolerance that, for matrix-matrix multiplication, the checksum relationship in the input checksum matrices is preserved at the end of the computation no mater which algorithm is chosen. From this checksum relationship in the final computation results, processor miscalculations can be detected, located, and corrected at the end of the computation. However, whether this checksum relationship can be maintained in the middle of the computation or not remains open. In this paper, we first demonstrate that, for many matrix matrix multiplication algorithms, the checksum relationship in the input checksum matrices is not maintained in the middle of the computation. We then prove that, however, for the outer product version algorithm, the checksum relationship in the input checksum matrices can be maintained in the middle of the computation. Based on this checksum relationship maintained in the middle of the computation, we demonstrate that fail-stop process failures (which are often tolerated by checkpointing or message logging) in ScaLAPACK matrix-matrix multiplication can be tolerated without checkpointing or message logging. 相似文献

18.

电源中的容错技术 总被引：2，自引：1，他引：1

张谷勋《计算机与数字工程》1999,27(1):57-61,68

为了满足用户高可靠,不停电供电的需求,作者提出了可以将大型电子系统的供电电源视作一个实时控制系统的概念,在电源设计中采用容错技术。相似文献

19.

一种中间件服务容错配置管理方法 总被引：1，自引：0，他引：1

李军国黄罡邹键梅宏《计算机学报》2007,30(10):1696-1704

提出一种基于运行时刻软件体系结构的容错管理方法,支持开发者和管理员针对不同中间件服务失效定制合适的故障检测和修复机制.首先,运行时刻软件体系结构自动构造构件依赖视图和错误传播①视图,为理解和分析整个系统的可靠性提供全局视图;然后,操作运行时刻软件体系结构配置容错机制;最后利用AOP技术将容错机制插装到中间件中,使其具备指定的容错能力.上述过程在一个可视化工具的辅助下半自动实施,并在J2EE中间件上得到验证. 相似文献

20.

Distributed Search Trees: Fault Tolerance in an Asynchronous Environment

Konrad Schlude Eljas Soisalon-Soininen Peter Widmayer 《Theory of Computing Systems》2003,36(6):611-629

We propose a distributed dictionary that allows insert and search operations and that tolerates arbitrary single server crashes. The distinctive feature of our model is that the crash of a server cannot be detected. This is in contrast to all other proposals of distributed fault-tolerant search structures presented thus far. It reflects the real situation in the internet more accurately, and is in general more suitable to complex overall conditions. This makes our solution fundamentally different from all previous ones, but also more complicated. We present in detail the algorithms for searching, insertion, and graceful recovery of crashed servers. 相似文献