共查询到20条相似文献,搜索用时 296 毫秒
1.
2.
The algorithm-based fault tolerance techniques have been proposed to obtain reliable results at very low hardware overhead. Even though 100% fault coverage can be theoretically obtained by using these techniques, the system performance, i.e., fault coverage and throughput, can be drastically reduced due to many practical problems, e.g., round-off errors. A novel algorithm-based fault tolerance scheme is proposed for fast Fourier transform (FFT) networks. It is shown that the proposed scheme achieves 100% fault coverage theoretically. An accurate measure of the fault coverage for FFT networks is provided by taking the round-off error into account. The proposed scheme is shown to provide concurrent error detection capability to FFT networks with low hardware overhead, high throughput, and high fault coverage 相似文献
3.
Barker W. Halliday D.M. Thoma Y. Sanchez E. Tempesti G. Tyrrell A.M. 《Evolutionary Computation, IEEE Transactions on》2007,11(5):666-684
Fault tolerance is a crucial operational aspect of biological systems and the self-repair capabilities of complex organisms far exceeds that of even the most advanced electronic devices. While many of the processes used by nature to achieve fault tolerance cannot easily be applied to silicon-based systems, in this paper we show that mechanisms loosely inspired by the operation of multicellular organisms can be transported to electronic systems to provide self-repair capabilities. Features such as dynamic routing, reconfiguration, and on-chip reprogramming can be invaluable for the realization of adaptive hardware systems and for the design of highly complex systems based on the kind of unreliable components that are likely to be introduced in the not-too-distant future. In this paper, we describe the implementation of fault tolerant features that address error detection and recovery through dynamic routing, reconfiguration, and on-chip reprogramming in a novel application specific integrated circuit. We take inspiration from three biological models: phylogenesis, ontogenesis, and epigenesis (hence the POE in POEtic). As in nature, our approach is based on a set of separate and complementary techniques that exploit the novel mechanisms provided by our device in the particular context of fault tolerance. 相似文献
4.
任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献
5.
Hany H. Ammar Bojan Cukic Ali Mili Cris Fuhrman 《Annals of Software Engineering》2000,10(1-4):103-150
Today's digital systems are growing increasingly complex, and are being used in increasingly critical functions. The first
premise makes them more prone to contain faults, and the second premise makes their failure less tolerable. This widening
gap highlights the need for fault tolerant techniques, which make provisions for reliable operation of digital systems despite
the presence and occasional manifestation of faults. In this paper we present a brief comparative survey of fault tolerance
as it arises in hardware systems and software systems. We discuss logical models as well as statistical models of fault tolerance,
and use these models to analyze design tradeoffs of fault tolerant systems.
This revised version was published online in June 2006 with corrections to the Cover Date. 相似文献
6.
Haines Joshua Lakamraju Vijay Koren Israel Krishna C. Mani 《The Journal of supercomputing》2000,16(1-2):53-68
As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark. 相似文献
7.
在硬件实时操作系统中,系统CPU的使用率是系统性能的一项重要指标,如果任务占据了系统的全部CPU,其它任务将无法继续运行,给系统带来灾难性后果。
通过分析实时操作系统中软件运行的特点,系统设计需要采取一定容错策略,以提高系统可靠性和容错能力。在μC/ OS-Ⅱ实时操作系统下对飞行控制软件中的任务进行实时监测。首先给出在μC/ OS Ⅱ实时操作系统下CPU使用率的计算方法,合理提出CPU的监测周期。其次,给出对CPU使用率异常的故障检测算法,对故障进行故障处置,提高系统的容错能力。最后,通过在MPC5674飞行控制计算机中编写嵌入式飞行控制软件来验证四种对CPU使用率异常的处置方法。仿真结果表明,实时操作系统中CPU的软件容错方法可以有效提高系统可靠性和容错能力。 相似文献
8.
9.
E级计算机系统规模巨大,使得故障异常总量随之增多,导致诊断发现的难度增加,因此,迫切需要一套更加准确高效的实时维护故障诊断系统,对硬件系统进行全面的异常及故障信息实时检测、故障诊断及故障预测。传统故障诊断系统在面对数万节点规模的诊断时存在执行效率低、异常检测误报率高的问题,异常检测及故障诊断的覆盖率不足。对异常及故障检测、故障诊断与故障预测相关技术进行研究,分析技术原理及适用性,并结合E级高性能计算机实际工程需求,设计一套满足数E级高性能计算机需求的维护故障诊断系统。基于维护系统的结构组成设计可扩展的边缘诊断架构,将高性能计算机系统知识、专家知识与数理统计、机器学习相融合给出故障检测、诊断及预测算法,并针对专用场景建立预测模型。实验结果表明,该系统具有较好的可扩展性,能在10 s内完成对十万个节点规模系统的故障诊断,与传统故障诊断系统相比,异常检测某特定指标误报率从3.3%降低到几乎为0,硬件故障检测覆盖率从90.2%提升至96%以上,硬件故障诊断覆盖率从71%提升至约94%,能较准确地预测多个重要应用场景下的故障。 相似文献
10.
高级程序设计语言(如C语言)和数据库系统的嵌套使用,作为一种有效的数据操纵和管理结构,目前在传统的离线或在线事务处理系统,以及智能控制系统中得到了广泛应用.这些应用的特点之一是要求系统具有很高的可靠性.文中提出了一种称为组合事务块的新颖容错原语,详细论述了它在C语言和FOXPRO中的实现方案.本质上,组合事务块是一种将数据容错、程序容错及算法容错结合起来的混合容错机制.文中还分析了组合事务块在冗余处理机上的执行时间,并用软件实验验证了它的容错特性. 相似文献
11.
Shye Alex Blomstedt Joseph Moseley Tipp Reddi Vijay Janapa Connors Daniel A. 《Dependable and Secure Computing, IEEE Transactions on》2009,6(2):135-148
Transient faults are emerging as a critical concern in the reliability of general-purpose microprocessors. As architectural trends point toward multicore designs, there is substantial interest in adapting such parallel hardware resources for transient fault tolerance. This paper presents process-level redundancy (PLR), a software technique for transient fault tolerance, which leverages multiple cores for low overhead. PLR creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources. PLR uses a software-centric approach to transient fault tolerance, which shifts the focus from ensuring correct hardware execution to ensuring correct software execution. As a result, many benign faults that do not propagate to affect program correctness can be safely ignored. A real prototype is presented that is designed to be transparent to the application and can run on general-purpose single-threaded programs without modifications to the program, operating system, or underlying hardware. The system is evaluated for fault coverage and performance on a four-way SMP machine and provides improved performance over existing software transient fault tolerance techniques with a 16.9 percent overhead for fault detection on a set of optimized SPEC2000 binaries. 相似文献
12.
Designing fault-tolerant techniques for SRAM-based FPGAs 总被引:2,自引:0,他引:2
de Lima Kastensmidt F.G. Neuberger G. Hentschke R.F. Carro L. Reis R. 《Design & Test of Computers, IEEE》2004,21(6):552-562
FPGAs have become prevalent in critical applications in which transient faults can seriously affect the circuit's operation. We present a fault tolerance technique for transient and permanent faults in SRAM-based FPGAs. This technique combines duplication with comparison (DWC) and concurrent error detection (CEO) to provide a highly reliable circuit while maintaining hardware, pin, and power overheads far lower than with classic triple-modular-redundancy techniques. 相似文献
13.
容灾存储系统主要是针对数据备份而言,确保应用不因为意外事件而带来重大损失,但不能保证应用不因为故障的发生而被中断;不能快速反映系统视图的变化。而一般的故障检测器不满足灵活性需求。提出一种权故障检测器,根据权值的变化来反映网络状态的变化,能按照网络状态和应用程序的需要自动调整。实验结果也证明这种检测器能满足灵活性的需求。 相似文献
14.
人工免疫系统及其应用 总被引:2,自引:0,他引:2
人工免疫系统是基于人类和其它高等动物免疫系统原理而提出的一种新的信息处理系统。简要介绍了生物免疫系统的特点,概述了当前几种主要的人工免疫系统和在计算机安全、优化、故障检测及处理、控制等方面的工程应用,并对其应用前景作了展望。 相似文献
15.
The development of an operating system that is a central component of a fault-tolerant multiprocessor is described. The operating system, while relatively simple and small, supports multitasking and multiprocessing, as well as both self-diagnostics and cross-diagnostics for fault detection. In the event of a fault, the system permits rapid reconfiguration in a manner that retains processing for the highest-priority tasks. Since the hardware needed to provide fault tolerance is available when there are no faults, the operating system can utilize this excess capacity to accomplish lower-priority tasks during normal operation. This approach yields graceful degradation in response to faults in the system components 相似文献
16.
As the demand for highly parallel systems grows, the vast amount of concurrently operating hardware involved can make it difficult to guarantee proper system behavior. Problems arise both from permanent and transient hardware faults and from errors caused by improper programming. A number of fault tolerance solutions have emerged. Following a survey of fault tolerance in arrays, a discussion of solutions for more specialized architectures is presented 相似文献
17.
通过提供高效且持续可用的容错服务以保障云应用系统的可靠运行是至关重要的.采用容错即服务的模式,提出了一种优化的云容错服务动态提供方法,从云应用组件的可靠性及响应时间等方面描述云应用容错需求,以常用的复制、检查点和NVP(N-version programming)等容错技术为基础,充分考虑容错服务动态切换开销,分别针对支撑容错服务的底层云资源是否足够的场景,给出可用容错即服务提供方案的最优化求解方法.实验结果表明,所提方法降低了云应用系统支付的容错服务费用及支撑容错服务的底层云资源的开销,提高了容错服务提供商为多个云应用实施高效、可靠容错即服务的能力. 相似文献
18.
软件双冗余容错系统的容错能力和性能分析 总被引:1,自引:0,他引:1
双冗余是比较常用的冗余容错设计方法.软件双冗余容错系统通过冗余执行完成相同功能的两个软件副本,并检查它们的结果,根据两者结果是否一致来判断是否出现了错误.建立了软件双冗余容错系统的运行时模型,并引入了软件双冗余容错系统的容错能力的概念.根据该模型分析了单个软件副本的容错能力对软件双冗余容错系统的容错能力和性能的影响.分析结果显示,提高单个软件副本的容错能力不仅能够提高软件双冗余容错系统的容错能力,还能够提高系统的性能.但在极端情况下,双冗余容错系统的容错能力也可能会小于单个软件副本的容错能力. 相似文献
19.
Consideration is given to fault tolerant systems that are built from modules called fault tolerant basic blocks (FTBBs), where each module contains some primary nodes and some spare nodes. Full spare utilization is achieved when each spare within an FTBB can replace any other primary or spare node in that FTBB. This, however, may be prohibitively expensive for larger FTBBs. Therefore, it is shown that for a given hardware overhead more reliable systems can be designed using bigger FTBBs without full spare utilization than using smaller FTBBs with full spare utilization. Sufficient conditions for maximizing the reliability of a spare allocation strategy in an FTBB for a given hardware overhead are presented. The proposed spare allocation strategy is applied to two fault tolerant reconfiguration schemes for binary hypercubes. One scheme uses hardware switches to replace a faulty node, and the other scheme uses fault tolerant routing to bypass faulty nodes in the system and deliver messages to the destination node 相似文献
20.
Discrete controller synthesis (DCS) is a formal approach, based on the same state-space exploration algorithms as model-checking.
Its interest lies in the ability to obtain automatically systems satisfying by construction formal properties specified a
priori. In this paper, our aim is to demonstrate the feasibility of this approach for fault tolerance. We start with a fault
intolerant program, modeled as the synchronous parallel composition of finite labeled transition systems; we specify formally
a fault hypothesis; we state some fault tolerance requirements; and we use DCS to obtain automatically a program, having the
same behavior as the initial fault intolerant one in the absence of faults, and satisfying the fault tolerance requirements
under the fault hypothesis. Our original contribution resides in the demonstration that DCS can be elegantly used to design
fault tolerant systems, with guarantees on key properties of the obtained system, such as the fault tolerance level, the satisfaction
of quantitative constraints, and so on. We show with numerous examples taken from case studies that our method can address
different kinds of failures (crash, value, or Byzantine) affecting different kinds of hardware components (processors, communication
links, actuators, or sensors). Besides, we show that our method also offers an optimality criterion very useful to synthesize
fault tolerant systems compliant to the constraints of embedded systems, like power consumption. 相似文献