期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Using Data Flow Information to Obtain Efficient Check Sets for Algorithm-Based Fault Tolerance

Ragini Narasimhan Daniel J. Rosenkrantz S. S. Ravi 《International journal of parallel programming》1999,27(4):289-323

Algorithm-Based Fault Tolerance (ABFT) is a well known technique for achieving fault and error detection in multiprocessor systems. We examine several issues concerning ABFT systems when the data flow information for the underlying multiprocessor computation is available. Our results show that this finergrained information can be exploited to obtain test schemes involving fewer checks, in some cases, dramatically fewer checks. We address both the analysis and design of ABFT systems when the data flow information is available. The analysis problem for a given ABFT system is to determine the fault detectability and the fault locatability (maximum number of detectable and locatable faulty processors) of the system. We show that the analysis problem can be solved efficiently when the number of faults is fixed. We also address the computational difficulty of this problem when the number of faults is not fixed. The design problem is concerned with the construction of a minimal collection of checks which can detect or locate a specified number of faults for a given multiprocessor computation. We examine some special classes of data flow graphs and establish upper and lower bounds on the number of checks needed to detect or locate a given number of faults. We also address the computational difficulty of this design problem for several cases. 相似文献

2.

Graceful degradation in algorithm-based fault tolerantmultiprocessor systems

Yajnik S. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):137-153

Algorithm-based fault tolerance (ABFT) is a technique which improves the reliability of a multiprocessor system by providing concurrent error detection and fault location capability to it. It encodes data at the system level and modifies the algorithm to operate on the encoded data in order to expose both transient and permanent faults in any processor. Work done till now in this area takes care of only the fault detection and location part of the problem. However, if spare processors are not available, then after a faulty processor has been located, the work initially assigned to it has to be mapped to some nonfaulty processors in the system in such a way that the fault tolerance capability of the system is still maintained with as small a degradation in performance as possible. In this paper, we propose an integrated deterministic solution to the above problem which combines concurrent error detection and fault location with graceful degradation. There exists no previous deterministic ABFT method for the design of general t-fault locating systems, even for the case of t=1. We propose a general method for designing one-fault locating/s-fault detecting systems. We use an extended model for representing ABFT systems. This model considers the processors computing the checks to be a part of the ABFT system, so that faults in the check computing processors can also be detected and located using a simple diagnosis algorithm, and the checks can be mapped to other nonfaulty processors in the system 相似文献

3.

Analysis and randomized design of algorithm-based fault tolerantmultiprocessor systems under an extended model

Yajnik S. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(7):757-768

Reliability of compute-intensive applications can be improved by introducing fault tolerance into the system. Algorithm based fault tolerance (ABFT) is a low-cost scheme which provides the required fault tolerance to the system through system level encoding. In this paper, we propose randomized construction techniques, under an extended model, for the design of ABFT systems with the required fault tolerance capability. The model considers failures in the processors performing the checking operations 相似文献

4.

Synthesis of algorithm-based fault-tolerant systems from dependencegraphs

Vinnakota B. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(8):864-874

Algorithm-based fault tolerance (ABFT) is a method for improving the reliability of parallel architectures used for computation-intensive tasks. A two-stage approach to the synthesis of ABFT systems is proposed. In the first stage, a system-level code is chosen to encode the data used in the algorithm. In the second stage, the optimal architecture to implement the scheme is chosen using dependence graphs. Dependence graphs are a graph-theoretic form of algorithm representation. The authors demonstrate that not all architectures are ideal for the implementation of a particular ABFT scheme. They propose new measures to characterize the fault tolerance capability of a system to better exploit the proposed synthesis method. Dependence graphs can also be used for the synthesis of ABFT schemes for non-linear problems. An example of a fault-tolerant median filter is provided to illustrate their utility for such problems 相似文献

5.

New encoding/decoding methods for designing fault-tolerant matrixoperations

Tao D.L. Hartmann C.R.P. Han Y.S. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(9):931-938

Algorithm-based fault tolerance (ABFT) can provide a low-cost error protection for array processors and multiprocessor systems. Several ABFT techniques (weighted check-sum) have been proposed to design fault-tolerant matrix operations. In these schemes, encoding/decoding uses either multiplications or divisions so that overhead is high. In this paper, new encoding/decoding methods are proposed for designing fault-tolerant matrix operations. The unique feature of these new methods is that only additions and subtractions are used in encoding/decoding. In this paper, new algorithms are proposed to construct error detecting/correcting codes with the minimum Hamming distance 3 and 4. We will show that the overhead introduced due to the incorporation of fault tolerance is drastically reduced by using these new coding schemes 相似文献

6.

FCT网络的并发错误检测结构

陈禾毛志刚叶以正《计算机研究与发展》1999,36(10):1246-1252

文中提出了一种快速离散余弦变换电路的开发错误检测结构。为了达到１００％的故障覆盖率,ＦＣＴ采用基于第３类离散余弦变换的Ｂ．Ｇ．Ｌｃｅ算法蝶型结构实现。相似文献

7.

Almost certain fault diagnosis through algorithm-based faulttolerance

Blough D.M. Pelc A. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(5):532-539

Algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. We investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements 相似文献

8.

A new algorithm based on Givens rotations for solving linearequations on fault-tolerant mesh-connected processors

Murthy K.N.B. Bhuvaneswari K. Ram Murthy C.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(8):825-832

In this paper, we propose a new I/O overhead free Givens rotations based parallel algorithm for solving a system of linear equations. The algorithm uses a new technique called two-sided elimination and requires an N×(N+1) mesh-connected processor array to solve N linear equations in (5N-log N-4) time steps. The array is well suited for VLSI implementation as identical processors with simple and regular interconnection pattern are required. We also describe a fault-tolerant scheme based on an algorithm based fault tolerance (ABFT) approach. This scheme has small hardware and time overhead and can tolerate up to N processor failures 相似文献

9.

Diagnosability of repairable faults

Eric Fabre Loïc Hélouët Engel Lefaucheux Hervé Marchand 《Discrete Event Dynamic Systems》2018,28(2):183-213

The diagnosis problem for discrete event systems consists in deciding whether some fault event occurred or not in the system, given partial observations on the run of that system. Diagnosability checks whether a correct diagnosis can be issued in bounded time after a fault, for all faulty runs of that system. This problem appeared two decades ago and numerous facets of it have been explored, mostly for permanent faults. It is known for example that diagnosability of a system can be checked in polynomial time, while the construction of a diagnoser is exponential. The present paper examines the case of transient faults, that can appear and be repaired. Diagnosability in this setting means that the occurrence of a fault should always be detected in bounded time, but also before the fault is repaired, in order to prepare for the detection of the next fault or to take corrective measures while they are needed. Checking this notion of diagnosability is proved to be PSPACE-complete. It is also shown that faults can be reliably counted provided the system is diagnosable for faults and for repairs. 相似文献

10.

Timely robust fault detection for multirate linear systems

M. S. Fadali H. E. Emara-Shabaik 《International journal of control》2013,86(5):305-313

This paper presents a fault detection and isolation scheme for multirate systems with a fast input sampling rate and slower output sampling rates. We design a separate observer for each set of simultaneous measurements with the observer operating at their sampling rate. We use an unknown input observer design to allow state estimation in the presence of disturbances and modelling errors. The observer allows us to estimate the system state and obtain a residual vector to be used in fault detection. Furthermore, we are able to use single-rate methodologies for fault diagnosis. We provide necessary and sufficient conditions for the existence of the observer and the detection of the fault vector. An example is given to illustrate the new fault detection approach and another to demonstrate fault isolation. 相似文献

11.

Component based design of multitolerant systems

Arora A. Kulkarni S.S. 《IEEE transactions on pattern analysis and machine intelligence》1998,24(1):63-78

The concept of multitolerance abstracts problems in system dependability and provides a basis for improved design of dependable systems. In the abstraction, each source of undependability in the system is represented as a class of faults, and the corresponding ability of the system to deal with that undependability source is represented as a type of tolerance. Multitolerance thus refers to the ability of the system to tolerate multiple fault classes, each in a possibly different way. We present a component based method for designing multitolerance. Two types of components are employed by the method, namely detectors and correctors. A theory of detectors, correctors, and their interference free composition with intolerant programs is developed, which enables stepwise addition of components to provide tolerance to a new fault class while preserving the tolerances to the previously added fault classes. We illustrate the method by designing a fully distributed multitolerant program for a token ring 相似文献

12.

An Approach to Post Mortem Diagnosability Analysis for Interacting Finite State Systems

Dan Lawesson Ulf Nilsson Inger Klein 《Electronic Notes in Theoretical Computer Science》2006,149(2):139

We present a model based approach to diagnosability analysis for interacting finite state systems where fault isolation is deferred until the system comes to a standstill. Local abstractions of the system model are used to alleviate the state space explosion. Pairs of closely coupled automata are merged and replaced by a single automaton with an equivalently behavior as seen from the rest of the system; interaction between the merged automata is internalized and the new equivalent automaton is subsequently abstracted from internal behavior irrelevant to fault isolation. In moderately concurrent systems these steps can often be iterated until the system consists of a single automaton providing a compact encoding of all possible fault scenarios of the original model. We illustrate how the resulting abstraction can be used as a basis for post mortem diagnosability analysis. 相似文献

13.

Constantine: configurable static analysis tool in Eclipse

Makarand Gawade K. Ravikanth Sanjeev Aggarwal 《Software》2014,44(5):537-563

Static code analysers help in exposing internal code quality problems. For higher effectiveness, they must be pressed into use early during the development of code. They must support the formulation of new coding constraints with relative ease to better cope with variations in coding standards. We present the design of a static analyser that addresses these twin objectives. Our system provides interactive feedback to programmers on the non‐conformances that occur in response to the changes made to the code. Its rule construction framework empowers programmers to define new conformance rules, which can come into effect immediately after creation. The tool has been realized as an Eclipse plug‐in for the analysis of C, C++ and Java sources. Central to its design is the concept of reusing a set of primitive checks by composing them to form new rules. This renders rule construction accessible to programmers, lowers dependence on tool smiths and accelerates the enforcement of custom checks. We also present our experience in defining rules drawn from an industry standard rule set based on this approach. Copyright © 2012 John Wiley & Sons, Ltd. 相似文献

14.

Improving reliability of cooperative concurrent systems with exception flow analysis 总被引：1，自引：0，他引：1

Fernando Castor Filho Author Vitae Alexander Romanovsky^{Author Vitae} 《Journal of Systems and Software》2009,82(5):874-890

Developers of fault-tolerant distributed systems need to guarantee that fault tolerance mechanisms they build are in themselves reliable. Otherwise, these mechanisms might in the end negatively affect overall system dependability, thus defeating the purpose of introducing fault tolerance into the system. To achieve the desired levels of reliability, mechanisms for detecting and handling errors should be developed rigorously or formally. We present an approach to modeling and verifying fault-tolerant distributed systems that use exception handling as the main fault tolerance mechanism. In the proposed approach, a formal model is employed to specify the structure of a system in terms of cooperating participants that handle exceptions in a coordinated manner, and coordinated atomic actions serve as representatives of mechanisms for exception handling in concurrent systems. We validate the approach through two case studies: (i) a system responsible for managing a production cell, and (ii) a medical control system. In both systems, the proposed approach has helped us to uncover design faults in the form of implicit assumptions and omissions in the original specifications. 相似文献

15.

Robust fault detection filter design for a class of time-delay systems via equivalent transformation

Jilie ZHANG Huaguang ZHANG Feisheng YANG Shenquan WANG 《控制理论与应用(英文版)》2013,11(1):54-60

This paper is concerned with the robust fault detection filter (RFDF) design for a class of linear timeinvariant systems (LTISs) with output state time delays. Although existing results in literatures study the RFDF for timedelay systems, few is concerned with the output state time-delay systems. The basic idea of our study is to eliminate the time delays of system and transform it to a delay-free system (i.e., a linear time-invariant system without time delays) by the bicausal change of coordinates approach. Then, we design the RFDF for the delay-free LTIS, which is equivalent to the original system with time delays. We first introduce a class of systems with output state time delays, whose fault can be detected by using the RFDF design approach for delay-free systems. Then, since the RFDF design problem can be formulated as a standard H-infinity-model matching problem, it is solved by using H-infinity-optimization LMI techniques. In the last, the adaptive threshold of fault detection is chosen and an illustrative design example is used to demonstrate the validity of the design approach. 相似文献

16.

Reliable Observer-Based Control Against Sensor Failures for Systems With Time Delays in Both State and Input

《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2008,38(5):1018-1029

相似文献

17.

Observer‐based fault diagnosis and self‐restore control for systems with measurement delays

Juan Li Gong‐You Tang Peng Zhang Jian Zou 《Asian journal of control》2012,14(6):1717-1723

The problems of fault diagnosis and fault‐tolerant control are considered for systems with measurement delays. In contrast to the present fault diagnosis and fault‐tolerant control approaches, which consider only the input delay and/or state delay, the main contribution of this paper consists of proposing a new observer‐based reduced‐order fault diagnoser construction approach and a design approach to dynamic self‐restore fault‐tolerant control law for systems with measurement delays. First, the time‐delay system is transformed into a delay‐free system in form by a special functional‐based delay‐free transformation approach for measurement delays. Then, the fault diagnosis is realized online via the proposed reduced‐order fault diagnoser. Using the results of fault diagnosis, two dynamic self‐restore control laws are designed to make the system isolated from faults. A numerical example demonstrates the feasibility and validity of the proposed scheme. © 2012 John Wiley and Sons Asia Pte Ltd and Chinese Automatic Control Society 相似文献

18.

Operating system support for the management of hard real-time disk traffic

《Journal of Systems Architecture》2000,46(4):379-395

Emerging applications like C³I systems, real-time databases, data acquisition systems and multimedia servers require access to secondary storage devices under timing constraints. In this paper, we focus on operating system support needed for managing real-time disk traffic with hard deadlines. We present the design and implementation of a preemptive deadline-driven disk I/O subsystem suitable for real-time disk traffic management. Preemptibility is achieved with a granularity that is automatically controlled by the I/O subsystem according to current workload and filesystem data layout. An admission control test checks the current resource availability for a given workload. We show that contiguous layout is necessary to maintain hard real-time guarantees and a reasonable level of disk throughput. Finally, we show how buffering can be used to obtain utilization factors close to the maximum disk bandwidth possible. 相似文献

19.

A New and Faster Gaussian Elimination Based Fault Tolerant Systolic Linear System Solver

K. Bhuvaneswari K.N. Balasubramanya Murthy C. Siva Ram Murthy 《Journal of Parallel and Distributed Computing》1997,44(2):107

This paper presents a new systolic algorithm for thecompletesolution of a system ofNlinear equations in (N²/2 +O(N)) time steps using 2Nprocessing elements (PEs). It is based on a variant of the Gaussian elimination (GE) algorithm called the successive GE and is faster than any existing GE based algorithm usingO(N) PEs. We also suggest two fault tolerant schemes that tolerate up toNPE failures. The first scheme is a time redundancy based approach with no hardware overhead and 100% time overhead. This scheme can tolerate up toNPE failures. The second scheme is based on algorithm based fault tolerance (ABFT) and usesNextra PEs to tolerate up toN− 1 PE failures with very little time overhead. The number of errors that can be detected/corrected in both schemes is more than that in any existing fault tolerant systolic array. 相似文献

20.

Design of a fault tolerant control system incorporating reliability analysis and dynamic behaviour constraints

F. Guenab P. Weber Y.M. Zhang 《International journal of systems science》2013,44(1):219-233

In highly automated aerospace and industrial systems where maintenance and repair cannot be carried out immediately, it is crucial to design control systems capable of ensuring desired performance when taking into account the occurrence of faults/failures on a plant/process; such a control technique is referred to as fault tolerant control (FTC). The control system processing such fault tolerance capability is referred to as a fault tolerant control system (FTCS). The objective of FTC is to maintain system stability and current performance of the system close to the desired performance in the presence of system component and/or instrument faults; in certain circumstances a reduced performance may be acceptable. Various control design methods have been developed in the literature with the target to modify or accommodate baseline controllers which were originally designed for systems operating under fault-free conditions. The main objective of this article is to develop a novel FTCS design method, which incorporates both reliability and dynamic performance of the faulty system in the design of a FTCS. Once a fault has been detected and isolated, the reconfiguration strategy proposed in this article will find possible structures of the faulty system that best preserve pre-specified performances based on on-line calculated system reliability and associated costs. The new reconfigured controller gains will also be synthesised and finally the optimal structure that has the ‘best’ control performance with the highest reliability will be chosen for control reconfiguration. The effectiveness of this work is illustrated by a heating system benchmark used in a European project entitled intelligent Fault Tolerant Control in Integrated Systems (IFATIS EU-IST-2001-32122). 相似文献