期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

姚兰桂勋巨军让《计算机工程》2007,33(6):83-85,1

随着硬件容错技术的成熟，软件容错技术成为提高系统可靠性的热点问题。直接开发容错应用是非常困难的，鉴于中间件为应用系统提供了良好的开发环境，该文研究和设计了一个基于中间件的容错系统模型，提出了一种新的节点容错结构构造方法，为解决冗余、失效检测和恢复等容错的关键技术问题形成了一套较完整的解决方案。采用马尔科夫过程求出系统的可靠度，验证了系统设计的合理性和可靠性。相似文献

2.

A novel concurrent error detection scheme for FFT networks

Tao D.L. Hartmann C.R.P. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(2):198-221

The algorithm-based fault tolerance techniques have been proposed to obtain reliable results at very low hardware overhead. Even though 100% fault coverage can be theoretically obtained by using these techniques, the system performance, i.e., fault coverage and throughput, can be drastically reduced due to many practical problems, e.g., round-off errors. A novel algorithm-based fault tolerance scheme is proposed for fast Fourier transform (FFT) networks. It is shown that the proposed scheme achieves 100% fault coverage theoretically. An accurate measure of the fault coverage for FFT networks is provided by taking the round-off error into account. The proposed scheme is shown to provide concurrent error detection capability to FFT networks with low hardware overhead, high throughput, and high fault coverage 相似文献

3.

Fault Tolerance Using Dynamic Reconfiguration on the POEtic Tissue 总被引：1，自引：0，他引：1

Barker W. Halliday D.M. Thoma Y. Sanchez E. Tempesti G. Tyrrell A.M. 《Evolutionary Computation, IEEE Transactions on》2007,11(5):666-684

Fault tolerance is a crucial operational aspect of biological systems and the self-repair capabilities of complex organisms far exceeds that of even the most advanced electronic devices. While many of the processes used by nature to achieve fault tolerance cannot easily be applied to silicon-based systems, in this paper we show that mechanisms loosely inspired by the operation of multicellular organisms can be transported to electronic systems to provide self-repair capabilities. Features such as dynamic routing, reconfiguration, and on-chip reprogramming can be invaluable for the realization of adaptive hardware systems and for the design of highly complex systems based on the kind of unreliable components that are likely to be introduced in the not-too-distant future. In this paper, we describe the implementation of fault tolerant features that address error detection and recovery through dynamic routing, reconfiguration, and on-chip reprogramming in a novel application specific integrated circuit. We take inspiration from three biological models: phylogenesis, ontogenesis, and epigenesis (hence the POE in POEtic). As in nature, our approach is based on a set of separate and complementary techniques that exploit the novel mechanisms provided by our device in the particular context of fault tolerance. 相似文献

4.

一种支持容错的任务并行程序设计模型

王一拙陈旭计卫星苏岩王小军石峰《软件学报》2016,27(7):1789-1804

任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献

5.

A comparative analysis of hardware and software fault tolerance: Impact on software reliability engineering

Hany H. Ammar Bojan Cukic Ali Mili Cris Fuhrman 《Annals of Software Engineering》2000,10(1-4):103-150

Today's digital systems are growing increasingly complex, and are being used in increasingly critical functions. The first premise makes them more prone to contain faults, and the second premise makes their failure less tolerable. This widening gap highlights the need for fault tolerant techniques, which make provisions for reliable operation of digital systems despite the presence and occasional manifestation of faults. In this paper we present a brief comparative survey of fault tolerance as it arises in hardware systems and software systems. We discuss logical models as well as statistical models of fault tolerance, and use these models to analyze design tradeoffs of fault tolerant systems. This revised version was published online in June 2006 with corrections to the Cover Date. 相似文献

6.

Application-Level Fault Tolerance as a Complement to System-Level Fault Tolerance 总被引：1，自引：1，他引：0

Haines Joshua Lakamraju Vijay Koren Israel Krishna C. Mani 《The Journal of supercomputing》2000,16(1-2):53-68

As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark. 相似文献

7.

实时操作系统CPU使用率监测的软件容错研究

王余伟曹东施书成《计算机工程与科学》2018,40(8):1337-1343

在硬件实时操作系统中,系统CPU的使用率是系统性能的一项重要指标,如果任务占据了系统的全部CPU,其它任务将无法继续运行,给系统带来灾难性后果。通过分析实时操作系统中软件运行的特点,系统设计需要采取一定容错策略,以提高系统可靠性和容错能力。在μC/ OS-Ⅱ实时操作系统下对飞行控制软件中的任务进行实时监测。首先给出在μC/ OS Ⅱ实时操作系统下CPU使用率的计算方法,合理提出CPU的监测周期。其次,给出对CPU使用率异常的故障检测算法,对故障进行故障处置,提高系统的容错能力。最后,通过在MPC5674飞行控制计算机中编写嵌入式飞行控制软件来验证四种对CPU使用率异常的处置方法。仿真结果表明,实时操作系统中CPU的软件容错方法可以有效提高系统可靠性和容错能力。相似文献

8.

一种适用于P2P存储系统的自反馈故障检测算法 总被引：2，自引：1，他引：1

万亚平冯丹欧阳利军刘立杨天明《计算机科学》2010,37(2):48-52

在构建高可用性P2P存储系统的过程中,针对系统中节点的高度动态特征,设计了一种自反馈的心跳故障检测算法。它结合心跳策略和无偏灰色预测模型,根据应用需求和网络环境的变化动态地改变检测的质量,在保持一定检测时间的前提下,提高了故障检测的精度。实验表明,根据该算法实现的故障检测器具有较好的性能,提高了P2P存储系统的可用性。相似文献

9.

E级高性能计算机的维护故障诊断系统研究

建澜涛任秀江张祯石嵩黄益明张春林《计算机工程》2022,48(12):24-37

E级计算机系统规模巨大,使得故障异常总量随之增多,导致诊断发现的难度增加,因此,迫切需要一套更加准确高效的实时维护故障诊断系统,对硬件系统进行全面的异常及故障信息实时检测、故障诊断及故障预测。传统故障诊断系统在面对数万节点规模的诊断时存在执行效率低、异常检测误报率高的问题,异常检测及故障诊断的覆盖率不足。对异常及故障检测、故障诊断与故障预测相关技术进行研究,分析技术原理及适用性,并结合E级高性能计算机实际工程需求,设计一套满足数E级高性能计算机需求的维护故障诊断系统。基于维护系统的结构组成设计可扩展的边缘诊断架构,将高性能计算机系统知识、专家知识与数理统计、机器学习相融合给出故障检测、诊断及预测算法,并针对专用场景建立预测模型。实验结果表明,该系统具有较好的可扩展性,能在10 s内完成对十万个节点规模系统的故障诊断,与传统故障诊断系统相比,异常检测某特定指标误报率从3.3%降低到几乎为0,硬件故障检测覆盖率从90.2%提升至96%以上,硬件故障诊断覆盖率从71%提升至约94%,能较准确地预测多个重要应用场景下的故障。相似文献

10.

组合事务块及其在C语言和FOXPRO中的实现

江建慧赵晓东《计算机研究与发展》1998,35(9):859-864

高级程序设计语言（如Ｃ语言）和数据库系统的嵌套使用，作为一种有效的数据操纵和管理结构，目前在传统的离线或在线事务处理系统，以及智能控制系统中得到了广泛应用．这些应用的特点之一是要求系统具有很高的可靠性．文中提出了一种称为组合事务块的新颖容错原语，详细论述了它在Ｃ语言和ＦＯＸＰＲＯ中的实现方案．本质上，组合事务块是一种将数据容错、程序容错及算法容错结合起来的混合容错机制．文中还分析了组合事务块在冗余处理机上的执行时间，并用软件实验验证了它的容错特性．相似文献

11.

PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures

Shye Alex Blomstedt Joseph Moseley Tipp Reddi Vijay Janapa Connors Daniel A. 《Dependable and Secure Computing, IEEE Transactions on》2009,6(2):135-148

Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries. 相似文献

12.

Designing fault-tolerant techniques for SRAM-based FPGAs 总被引：2，自引：0，他引：2

de Lima Kastensmidt F.G. Neuberger G. Hentschke R.F. Carro L. Reis R. 《Design & Test of Computers, IEEE》2004,21(6):552-562

FPGAs have become prevalent in critical applications in which transient faults can seriously affect the circuit's operation. We present a fault tolerance technique for transient and permanent faults in SRAM-based FPGAs. This technique combines duplication with comparison (DWC) and concurrent error detection (CEO) to provide a highly reliable circuit while maintaining hardware, pin, and power overheads far lower than with classic triple-modular-redundancy techniques. 相似文献

13.

存储系统中权故障检测器的设计

下载免费PDF全文

杨光周敬利《计算机工程与应用》2010,46(25):12-15

容灾存储系统主要是针对数据备份而言,确保应用不因为意外事件而带来重大损失,但不能保证应用不因为故障的发生而被中断;不能快速反映系统视图的变化。而一般的故障检测器不满足灵活性需求。提出一种权故障检测器,根据权值的变化来反映网络状态的变化,能按照网络状态和应用程序的需要自动调整。实验结果也证明这种检测器能满足灵活性的需求。相似文献

14.

人工免疫系统及其应用 总被引：2，自引：0，他引：2

孙勇智韦巍《计算机工程》2003,29(15):1-2,62

人工免疫系统是基于人类和其它高等动物免疫系统原理而提出的一种新的信息处理系统。简要介绍了生物免疫系统的特点，概述了当前几种主要的人工免疫系统和在计算机安全、优化、故障检测及处理、控制等方面的工程应用，并对其应用前景作了展望。相似文献

15.

An operating system for a fault-tolerant multiprocessor controller

Williams R.D. Johnson B.W. Roberts T.E. 《Micro, IEEE》1988,8(4):18-29

The development of an operating system that is a central component of a fault-tolerant multiprocessor is described. The operating system, while relatively simple and small, supports multitasking and multiprocessing, as well as both self-diagnostics and cross-diagnostics for fault detection. In the event of a fault, the system permits rapid reconfiguration in a manner that retains processing for the highest-priority tasks. Since the hardware needed to provide fault tolerance is available when there are no faults, the operating system can utilize this excess capacity to accomplish lower-priority tasks during normal operation. This approach yields graceful degradation in response to faults in the system components 相似文献

16.

Fault tolerance in highly parallel hardware systems

Grosspietsch K.E. 《Micro, IEEE》1994,14(1):60-68

As the demand for highly parallel systems grows, the vast amount of concurrently operating hardware involved can make it difficult to guarantee proper system behavior. Problems arise both from permanent and transient hardware faults and from errors caused by improper programming. A number of fault tolerance solutions have emerged. Following a survey of fault tolerance in arrays, a discussion of solutions for more specialized architectures is presented 相似文献

17.

面向云应用系统的容错即服务优化提供方法

杨娜刘靖《软件学报》2019,30(4):1191-1202

通过提供高效且持续可用的容错服务以保障云应用系统的可靠运行是至关重要的.采用容错即服务的模式,提出了一种优化的云容错服务动态提供方法,从云应用组件的可靠性及响应时间等方面描述云应用容错需求,以常用的复制、检查点和NVP（N-version programming）等容错技术为基础,充分考虑容错服务动态切换开销,分别针对支撑容错服务的底层云资源是否足够的场景,给出可用容错即服务提供方案的最优化求解方法.实验结果表明,所提方法降低了云应用系统支付的容错服务费用及支撑容错服务的底层云资源的开销,提高了容错服务提供商为多个云应用实施高效、可靠容错即服务的能力. 相似文献

18.

软件双冗余容错系统的容错能力和性能分析 总被引：1，自引：0，他引：1

吴斌高珑《计算机研究与发展》2009,46(Z2)

双冗余是比较常用的冗余容错设计方法.软件双冗余容错系统通过冗余执行完成相同功能的两个软件副本,并检查它们的结果,根据两者结果是否一致来判断是否出现了错误.建立了软件双冗余容错系统的运行时模型,并引入了软件双冗余容错系统的容错能力的概念.根据该模型分析了单个软件副本的容错能力对软件双冗余容错系统的容错能力和性能的影响.分析结果显示,提高单个软件副本的容错能力不仅能够提高软件双冗余容错系统的容错能力,还能够提高系统的性能.但在极端情况下,双冗余容错系统的容错能力也可能会小于单个软件副本的容错能力. 相似文献

19.

An efficient modular spare allocation scheme and its application tofault tolerant binary hypercubes

Alam M.S. Melhem R.G. 《Parallel and Distributed Systems, IEEE Transactions on》1991,2(1):117-126

Consideration is given to fault tolerant systems that are built from modules called fault tolerant basic blocks (FTBBs), where each module contains some primary nodes and some spare nodes. Full spare utilization is achieved when each spare within an FTBB can replace any other primary or spare node in that FTBB. This, however, may be prohibitively expensive for larger FTBBs. Therefore, it is shown that for a given hardware overhead more reliable systems can be designed using bigger FTBBs without full spare utilization than using smaller FTBBs with full spare utilization. Sufficient conditions for maximizing the reliability of a spare allocation strategy in an FTBB for a given hardware overhead are presented. The proposed spare allocation strategy is applied to two fault tolerant reconfiguration schemes for binary hypercubes. One scheme uses hardware switches to replace a faulty node, and the other scheme uses fault tolerant routing to bypass faulty nodes in the system and deliver messages to the destination node 相似文献

20.

Automating the addition of fault tolerance with discrete controller synthesis

Alain Girault Éric Rutten 《Formal Methods in System Design》2009,35(2):190-225

Discrete controller synthesis (DCS) is a formal approach, based on the same state-space exploration algorithms as model-checking. Its interest lies in the ability to obtain automatically systems satisfying by construction formal properties specified a priori. In this paper, our aim is to demonstrate the feasibility of this approach for fault tolerance. We start with a fault intolerant program, modeled as the synchronous parallel composition of finite labeled transition systems; we specify formally a fault hypothesis; we state some fault tolerance requirements; and we use DCS to obtain automatically a program, having the same behavior as the initial fault intolerant one in the absence of faults, and satisfying the fault tolerance requirements under the fault hypothesis. Our original contribution resides in the demonstration that DCS can be elegantly used to design fault tolerant systems, with guarantees on key properties of the obtained system, such as the fault tolerance level, the satisfaction of quantitative constraints, and so on. We show with numerous examples taken from case studies that our method can address different kinds of failures (crash, value, or Byzantine) affecting different kinds of hardware components (processors, communication links, actuators, or sensors). Besides, we show that our method also offers an optimality criterion very useful to synthesize fault tolerant systems compliant to the constraints of embedded systems, like power consumption. 相似文献