期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Graceful degradation in algorithm-based fault tolerantmultiprocessor systems

Yajnik S. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):137-153

Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t=1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system 相似文献

2.

Design of algorithm-based fault-tolerant multiprocessor systems forconcurrent error detection and fault diagnosis

Vinnakota B. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(10):1099-1106

Algorithm-based fault tolerance (ABPT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. We present new methods for the design of ABFT systems. Our design procedure is applicable to a wide range of systems in which processors share data elements. A feature of our design approach is that the type of checks to be used in the final system can be controlled by the system designer. We also present some new bounds on the number of checks needed in ABFT system design 相似文献

3.

Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

Ragini Narasimhan Daniel J. Rosenkrantz S. S. Ravi 《International journal of parallel programming》1999,27(4):289-323

Algorithm-Based Fault Tolerance (ABFT) is a well known technique for achieving fault and error detection in multiprocessor systems. We examine several issues concerning ABFT systems when the data flow information for the underlying multiprocessor computation is available. Our results show that this finergrained information can be exploited to obtain test schemes involving fewer checks, in some cases, dramatically fewer checks. We address both the analysis and design of ABFT systems when the data flow information is available. The analysis problem for a given ABFT system is to determine the fault detectability and the fault locatability (maximum number of detectable and locatable faulty processors) of the system. We show that the analysis problem can be solved efficiently when the number of faults is fixed. We also address the computational difficulty of this problem when the number of faults is not fixed. The design problem is concerned with the construction of a minimal collection of checks which can detect or locate a specified number of faults for a given multiprocessor computation. We examine some special classes of data flow graphs and establish upper and lower bounds on the number of checks needed to detect or locate a given number of faults. We also address the computational difficulty of this design problem for several cases. 相似文献

4.

Fault-tolerant computing: fundamental concepts 总被引：2，自引：0，他引：2

Nelson V.P. 《Computer》1990,23(7):19-25

The basic concepts of fault-tolerant computing are reviewed, focusing on hardware. Failures, faults, and errors in digital systems are examined, and measures of dependability, which dictate and evaluate fault-tolerance strategies for different classes of applications, are defined. The elements of fault-tolerance strategies are identified, and various strategies are reviewed. They are: error detection, masking, and correction; error detection and correction codes; self-checking logic; module replication for error detection and masking; protocol and timing checks; fault containment; reconfiguration and repair; and system recovery 相似文献

5.

Transient fault tolerance in digital systems 总被引：1，自引：0，他引：1

Sosnowski J. 《Micro, IEEE》1994,14(1):24-35

It is hard to shield systems effectively from transient faults (fault avoidance techniques). So some other means must be employed to assure appropriate levels of transient fault tolerance (insensitivity to transient faults). They are based on fault-masking and fault recovery ideas. Having analyzed this problem, the author identifies critical design points and outlines some practical solutions that refer to efficient on-line detectors (detecting errors during the system operation) and error handling procedures. This framework provides a basis for understanding transient fault problems in digital systems. It can be helpful in selecting optimum techniques to mask or eliminate transient fault effects in developed systems 相似文献

6.

A Framework for Software Fault Tolerance in Real-Time Systems

《IEEE transactions on pattern analysis and machine intelligence》1983,(3):355-364

Real-time systems often have very high reliability requirements and are therefore prime candidates for the inclusion of fault tolerance techniques. In order to provide tolerance to software faults, some form of state restoration is usually advocated as a means of recovery. State restoration can be expensive and the cost is exacerbated for systems which utilize concurrent processes. The concurrency present in most real-time systems and the further difficulties introduced by timing constraints suggest that providing tolerance for software faults may be inordinately expensive or complex. We believe that this need not be the case, and propose a straightforward pragmatic approach to software fault tolerance'which is believed to be applicable to many real-time systems. The approach takes advantage of the structure of real-time systems to simplify error recovery, and a classification scheme for errors is introduced. Responses to each type of error are proposed which allow service to be maintained. 相似文献

7.

Loop transformations for fault detection in regular loops onmassively parallel systems

Chun Gong Melhem R. Gupta R. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(12):1238-1249

Distributed-memory systems can incorporate thousands of processors at a reasonable cost. However, with an increasing number of processors in a system, fault detection and fault tolerance become critical issues. By replicating the computation on more than one processor and comparing the results produced by these processors, errors can be detected. During the execution of a program, due to data dependencies, typically not all of the processors in a multiprocessor system are busy at all times. Therefore processor schedules contain idle time slots and it is the goal of this work to exploit these idle time slots to schedule duplicated computation for the purpose of fault detection. We propose a compiler-assisted approach to fault detection in regular loops on distributed-memory systems. This approach achieves fault detection by duplicating the execution of statement instances. After carefully analyzing the data dependencies of a regular loop, selected instances of loop statements are duplicated in a way that ensures the desired fault coverage. We first present duplication strategies for fault detection and show that these strategies use idle processor times for executing replicated statements, whenever possible. Next, we present loop transformations to implement these fault-detection strategies. Also, a general framework for selecting appropriate loop transformations is developed. Experimental results performed on the CRAY-T3D show that the overhead of adding the fault detection capability is usually less than 25%, and is less than 10% when communication overhead is reduced by grouping messages 相似文献

8.

基于集值观测器的风能转换系统多类型故障检测的设计

赵睿楠沈艳霞《控制理论与应用》2021,38(2):187-196

针对受延时输入影响的风能转换系统的故障检测问题,本文设计了基于集值观测器的故障检测策略.风能转换系统常见故障可以区分为加性故障与乘性故障,本文对增广统一风能转换系统模型,设计一种全局集值观测器,通过改变所选择的状态集顶点和参考误差区间,改变增益矩阵元素的取值范围,选择合适的增益矩阵,实现精确的状态跟踪.在此基础上设计了... 相似文献

9.

Connective fault tolerance in multiple bus systems

Hung-Kuei Ku Hayes J.P. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):574-586

We present an efficient approach to characterizing the fault tolerance of multiprocessor systems that employ multiple shared buses for interprocessor communication. Of concern is connective fault tolerance, which is defined as the ability to maintain communication between any two fault-free processors in the presence of faulty processors, buses, or processor-bus links. We introduce a model called processor-bus-link (PBL) graphs to represent a multiple-bus system's interconnection structure. The model is more general than previously proposed models, and has the advantages of simple representation, broad application, and the ability to model partial bus failures. The PBL graph implies a set of component adjacency graphs that highlights various connectivity features of the system. Using these graphs, we propose a method for analyzing the maximum number of faults a multiple-bus system can tolerate, and for identifying every minimum set of faulty components that disconnects the processors of the system. We also analyze the connective fault tolerance of several proposed multiple-bus systems to illustrate the application of our method 相似文献

10.

Tolerating Radiation-Induced Transient Faults in Modern Processors

Xiaobin Li Jean-Luc Gaudiot 《International journal of parallel programming》2010,38(2):85-116

As MOS device sizes continue shrinking, lower charges, for example those charges carried by single ionizing particles of naturally occurring radiation, are sufficient to upset the functioning of complex modern microprocessors. In order to handle these inevitable errors, designs should include fault-tolerant features so that the processors can continue to correctly perform despite the occurrence of errors. The main goal of this work is to develop architecture mechanisms to protect processors against the effect of such radiation-induced transient faults. It should first be noted that, from a program execution perspective, many faults manifest themselves as control flow errors that cause processors to violate the correct sequencing of instructions. We present here at first a basic compile-time signature assignment algorithm and describe a novel approach to improve the fault detection coverage of the basic algorithm. Moreover, to allow the processor to efficiently check the run-time sequence and detect control flow errors, we introduce an on-chip assigned-signature checker which is capable of executing three additional instructions (SIC, SIJ, SIJC). Second, since the very concept of simultaneous multi-threading (SMT) provides the necessary redundancy, some proposals have been made to run two copies of the same thread on top of SMT platforms in order to detect and correct soft errors. This allows, upon detection of an error, the rolling back of the processor state to a known safe point, and then a retry of the instructions, thereby effecting a completely error-free execution. This paper has focused on two crucial implementation issues introduced by this scheme: (1) the design trade-off between the fault detection coverage versus design costs; (2) the possible occurrence of deadlock situations. 相似文献

11.

人工免疫在网络故障诊断中的应用研究

李辉《计算机与数字工程》2012,40(5):84-86

人工免疫系统（AIS）已被广泛的应用在许多领域,如数据分析、多峰函数优化、故障检测等。文章将人工免疫方法引入到PMC模型下网络故障诊断中,文中主要研究如何将AIS应用于系统级故障诊断。理论分析和实验结果表明,基于人工免疫系统的网络故障诊断方法在平均和最差情况下均优于传统的方法。相似文献

12.

The incremental agreement

M.L. Chiang L.Y. Tseng 《Information Processing Letters》2008,107(5):165-170

To achieve reliable distributed systems, the fault-tolerance must be studied. One of the most important problems of fault-tolerance issues lies in the Byzantine Agreement (BA) problem. The primary issue surrounding BA is that fault-free processors must obtain common agreement even in cases where faults persist. In this field, the fault diagnosis protocol has been proposed so that each fault-free processor detects/locates a common set of faulty processors. However, in this study, the incremental agreement is invoked to make each processor able to agreement upon executing the fault diagnosis protocol using minimal rounds of message exchange in the presence of dual failure characteristics of processors. 相似文献

13.

Fault tolerance in partitioned manufacturing networks

Anders Adlemo Sven-Arne Andréasson 《Journal of Systems Integration》1993,3(1):63-84

Fault tolerance is especially important for computer systems that require a high degree of confidence. Computer Integrated Manufacturing (CIM) is an area where computer systems must not be disturbed by uncontrolled failures. This article deals with two problems that are related to fault tolerance and network partitions in automated manufacturing systems.The first problem relates to the distribution of information in partitioned data networks in CIM systems. We indicate how to overcome this problem by using the material network as a redundant data network:The second problem relates to fault detection and diagnosis in manufacturing systems. The problem is whether the indication of a fault means that a production unit itself has actually broken down, or that the indication is instead due to disturbances in the transmission of material. That is, the production unit continues to operate propcrly despite indications to the contrary. We describe how the material network can be used for detection and diagnosis. 相似文献

14.

The links between human error diversity and software diversity: Implications for fault diversity seeking

《Science of Computer Programming》2014

Software diversity is known to improve fault tolerance in N-version software systems by independent development. As the leading cause of software faults, human error is considered an important factor in diversity seeking. However, there is little scientific research focusing on how to seek software fault diversity based on human error mechanisms. A literature review was conducted to extract factors that may differentiate people with respect to human error-proneness. In addition, we constructed a conceptual model of the links between human error diversity and software diversity. An experiment was designed to validate the hypotheses, in the form of a programming contest, accompanied by a survey of cognitive styles and personality traits. One hundred ninety-two programs were submitted for the identical problem, and 70 surveys were collected. Code inspection revealed 23 faults, of which 10 were coincident faults. The results show that personality traits seems not effective predictors for fault diversity as a whole model, whereas cognitive styles and program measurements moderately account for the variation of fault density. The results also show causal relations between performance levels and coincident faults: coincident faults are unlikely to occur at skill-based performance level; the coincident faults introduced in rule-based performances show a high probability of occurrence, and the coincident faults introduced in knowledge-based performances are shaped by the content and formats of the task itself. Based on these results, we have proposed a model to seek software diversity and prevent coincident faults. 相似文献

15.

Diagnosability of repairable faults

Eric Fabre Loïc Hélouët Engel Lefaucheux Hervé Marchand 《Discrete Event Dynamic Systems》2018,28(2):183-213

The diagnosis problem for discrete event systems consists in deciding whether some fault event occurred or not in the system, given partial observations on the run of that system. Diagnosability checks whether a correct diagnosis can be issued in bounded time after a fault, for all faulty runs of that system. This problem appeared two decades ago and numerous facets of it have been explored, mostly for permanent faults. It is known for example that diagnosability of a system can be checked in polynomial time, while the construction of a diagnoser is exponential. The present paper examines the case of transient faults, that can appear and be repaired. Diagnosability in this setting means that the occurrence of a fault should always be detected in bounded time, but also before the fault is repaired, in order to prepare for the detection of the next fault or to take corrective measures while they are needed. Checking this notion of diagnosability is proved to be PSPACE-complete. It is also shown that faults can be reliably counted provided the system is diagnosable for faults and for repairs. 相似文献

16.

Multi-Fault Tolerance for Cartesian Data Distributions

Nawab Ali Sriram Krishnamoorthy Mahantesh Halappanavar Jeff Daily 《International journal of parallel programming》2013,41(3):469-493

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead. 相似文献

17.

Fault-tolerance through scheduling of aperiodic tasks in hardreal-time multiprocessor systems

Ghosh S. Melhem R. Mosse D. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(3):272-284

Real time systems are being increasingly used in several applications which are time critical in nature. Fault tolerance is an important requirement of such systems, due to the catastrophic consequences of not tolerating faults. We study a scheme that provides fault tolerance through scheduling in real time multiprocessor systems. We schedule multiple copies of dynamic, aperiodic, nonpreemptive tasks in the system, and use two techniques that we call deallocation and overloading to achieve high acceptance ratio (percentage of arriving tasks scheduled by the system). The paper compares the performance of our scheme with that of other fault tolerant scheduling schemes, and determines how much each of deallocation and overloading affects the acceptance ratio of tasks. The paper also provides a technique that can help real time system designers determine the number of processors required to provide fault tolerance in dynamic systems. Lastly, a formal model is developed for the analysis of systems with uniform tasks 相似文献

18.

Analysis of Fault Tolerance in Artificial Neural Networks

《Journal of Parallel and Distributed Computing》2001,61(1):18-48

Wide attention was recently given to the problem of fault-tolerance in neural networks; while most authors dealt with aspects related to specific VLSI implementations, attention was also given to the intrinsic capacity of survival to faults characterizing the neural modes. The present paper tackles this second theme, considering in particular multilayered feed forward nets. One of the main goals is to identify the real influence of faults on the neural computation in order to show that neural paradigms cannot be considered intrinsically fault tolerant (i.e., able to survive to faults, even several of the most common and simple ones). A high abstraction level (corresponding to the neural graphs) is taken as the basis of the study and a corresponding error model is introduced. The effects of such errors induced by faults are analytically derived to verify the probability of intrinsic masking in the final neural outputs. Then, conditions allowing for complete compensation of the errors induced by faults through weight adjustment are evaluated to test the masking abilities of the network. The designer of a neural architecture should perform such a mathematical analysis to check the actual fault-tolerance features of his or her system. Unfortunately, this involves a very high computational overhead. As a cost-effective alternative for the designer, the use of a behavioral simulation is proposed for a quantitative evaluation of the error effect on the neural computation. Repeated learning (i.e., a new application of the learning procedure on the faulty network) is then experimented to induce error masking. Experimental results prove that even single errors affect the computation in a relevant way and that weight redistribution is not able to induce complete masking after a fault occurred, i.e., the network cannot be considered per se intrinsically fault tolerant and it is not possible to rely on learning only in order to achieve complete masking abilities. Mapping criteria of physical faults onto the abstract errors are finally examined to show the usability of the proposed analysis in evaluating the actual robustness of a neural networks' implementation and in identifying the critical areas where architectural redundancy should be introduced to achieve fault tolerance. 相似文献

19.

Cooperative control for consensus of multi-agent systems with actuator faults

《Computers & Electrical Engineering》2014,40(7):2154-2166

Consensus tracking problem in multi-agent systems with both outage and partial loss of effectiveness types of actuator faults is investigated in this paper. By adopting the virtual actuator technique based on the obtained fault estimates, the effects of possible actuator faults can be effectively compensated in each individual agent. Based on this, a distributed control strategy is developed by using locally available information. It is shown that the tracking error of each fault-free agent can converge to zero and the tracking errors of the faulty agents can remain bounded if the fault estimates are accurate. Moreover, a sufficient condition for achieving bounded tracking errors for all the followers is derived in terms of the fault estimation error. The sliding mode observer is utilized to estimate the faults. Simulation results are given to show the effectiveness of the proposed approach. 相似文献

20.

Reducing the soft-error rate of a high-performance microprocessor

《Micro, IEEE》2004,24(6):30-37

Single-bit upsets from transient faults have emerged as a key challenge in microprocessor design. Soft errors will be an increasing burden for microprocessor designers as the number of on-chip transistors continues to grow exponentially. Unlike traditional approaches, which focus on detecting and recovering from faults, this article introduces techniques to reduce the probability that a fault will cause a declared error. The first approach reduces the time instructions sit in vulnerable storage structures. The second avoids declaring errors on benign faults. Applying these techniques to a microprocessor instruction queue significantly reduces its error rate with only minor performance degradation 相似文献