期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Analysis and randomized design of algorithm-based fault tolerantmultiprocessor systems under an extended model

Yajnik S. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(7):757-768

Reliability of compute-intensive applications can be improved by introducing fault tolerance into the system. Algorithm based fault tolerance (ABFT) is a low-cost scheme which provides the required fault tolerance to the system through system level encoding. In this paper, we propose randomized construction techniques, under an extended model, for the design of ABFT systems with the required fault tolerance capability. The model considers failures in the processors performing the checking operations 相似文献

2.

Partitioned encoding schemes for algorithm-based fault tolerance inmassively parallel systems

Rexford J. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(6):649-653

Considers the applicability of algorithm based fault tolerance (ABET) to massively parallel scientific computation. Existing ABET schemes can provide effective fault tolerance at a low cost For computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. This short note proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABET schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead 相似文献

3.

Graceful degradation for reducing jitter of battery life in fault-tolerant embedded systems

Salim Kalla Riadh Hocine Abderrezak Chouki 《International journal of systems science》2018,49(11):2353-2361

Tolerating faults and minimising energy consumption in embedded systems is a difficult task due to the fact that the two objectives are antagonistic. In this paper, we propose a new approach based on graceful degradation to reduce jitter of battery life and thereby energy consumption in fault-tolerant embedded systems. In case of faults, the affected task is re-executed. In our solution, the energy level of battery is periodically verified, and if we detect that the continuity with the current operating mode leads to jitter, the system gracefully degrades to the adequate operating mode. In such degraded mode, the dynamic voltage scaling technique is used to save energy. The effectiveness of using graceful degradation is depending on the application criticality level. Simulation results show that the use of graceful degradation can reduce jitter of battery life, and thereby can minimise energy consumption. 相似文献

4.

Design of algorithm-based fault-tolerant multiprocessor systems forconcurrent error detection and fault diagnosis

Vinnakota B. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(10):1099-1106

Algorithm-based fault tolerance (ABPT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. We present new methods for the design of ABFT systems. Our design procedure is applicable to a wide range of systems in which processors share data elements. A feature of our design approach is that the type of checks to be used in the final system can be controlled by the system designer. We also present some new bounds on the number of checks needed in ABFT system design 相似文献

5.

An analysis of algorithm-based fault tolerance techniques

《Journal of Parallel and Distributed Computing》1988,5(2):172-184

We introduce a unified checksum scheme for the LU decomposition, Gaussian elimination with pairwise pivoting, and the QR decomposition. The purpose is to detect and locate a transient error during a systolic array computation. We show how to represent the error as a rank-one perturbation to the original data, so that we need not worry when the error occurred. Finally, we perform a floating point error analysis to determine the effects of rounding errors on the checksum scheme. 相似文献

6.

Almost certain fault diagnosis through algorithm-based faulttolerance

Blough D.M. Pelc A. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(5):532-539

Algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. We investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements 相似文献

7.

Synthesis of algorithm-based fault-tolerant systems from dependencegraphs

Vinnakota B. Jha N.K. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(8):864-874

Algorithm-based fault tolerance (ABFT) is a method for improving the reliability of parallel architectures used for computation-intensive tasks. A two-stage approach to the synthesis of ABFT systems is proposed. In the first stage, a system-level code is chosen to encode the data used in the algorithm. In the second stage, the optimal architecture to implement the scheme is chosen using dependence graphs. Dependence graphs are a graph-theoretic form of algorithm representation. The authors demonstrate that not all architectures are ideal for the implementation of a particular ABFT scheme. They propose new measures to characterize the fault tolerance capability of a system to better exploit the proposed synthesis method. Dependence graphs can also be used for the synthesis of ABFT schemes for non-linear problems. An example of a fault-tolerant median filter is provided to illustrate their utility for such problems 相似文献

8.

Improved shuffled frog leaping algorithm-based BP neural network and its application in bearing early fault diagnosis

Zhuanzhe Zhao Qingsong Xu Minping Jia 《Neural computing & applications》2016,27(2):375-385

This paper reports on a new back propagation (BP) neural network based on an improved shuffled frog leaping algorithm (ISFLA) and its application in bearing fault diagnosis. The ISFLA is developed on the basis of a chaotic operator and the convergence factor of particle swarm optimization to overcome the shortcomings of conventional shuffled frog leaping algorithm (SFLA). Testing results show that the proposed algorithm can effectively improve the solution accuracy and convergence properties and exhibits an excellent ability of global optimization in high-dimensional space. The presented ISFLA is then employed to optimize the weights and threshold values of BP neural network. An ISFLA-BP network model is established for the early fault diagnosis of rolling bearings. The proposed ISFLA-BP scheme has been compared with BP and SFLA-BP networks through experimental studies. Results indicate that the developed new model demonstrates better generalization capability and stronger robustness. It is able to effectively improve the efficiency of network training and the accuracy of early fault pattern recognition in bearing fault diagnosis tasks. 相似文献

9.

Binary quantum-inspired gravitational search algorithm-based multi-criteria scheduling for multi-processor computing systems

Thakur Abhijeet Singh Biswas Tarun Kuila Pratyay 《The Journal of supercomputing》2021,77(1):796-817

A quantum-inspired hybrid scheduling technique is proposed for multi-processor computing systems. The proposed algorithm is a hybridization of principles of quantum mechanics (QM) and a nature-inspired intelligence, gravitational search algorithm (GSA). The principles of QM such as quantum bit, superposition and rotation gate help to design an efficient agent representation as well as intense exploration capability of GSA enhances toward better converging rate. The fitness function is designed with the aim to minimize makespan, adequate balancing of loads and proper utilization of the deployed resources during the evaluation of agents. Several standard benchmarks as well as synthetic data sets are used to analyze and validate the work. The performance improvement of the proposed algorithm is compared with recently designed algorithms like quantum genetic algorithm, particle swarm optimization-based multi-criteria scheduling, Improved-GA, GSA and Cloudy-GSA. The significance of the algorithm is tested using a hypothesis analysis of variance.

相似文献

10.

Transient fault tolerance in digital systems 总被引：1，自引：0，他引：1

Sosnowski J. 《Micro, IEEE》1994,14(1):24-35

It is hard to shield systems effectively from transient faults (fault avoidance techniques). So some other means must be employed to assure appropriate levels of transient fault tolerance (insensitivity to transient faults). They are based on fault-masking and fault recovery ideas. Having analyzed this problem, the author identifies critical design points and outlines some practical solutions that refer to efficient on-line detectors (detecting errors during the system operation) and error handling procedures. This framework provides a basis for understanding transient fault problems in digital systems. It can be helpful in selecting optimum techniques to mask or eliminate transient fault effects in developed systems 相似文献

11.

Models for fault tolerance in manufacturing systems 总被引：1，自引：0，他引：1

Anders Adlemo Sven-Arne Andréasson 《Journal of Intelligent Manufacturing》1992,3(1):1-10

The field of fault tolerance in computer science and engineering has been thoroughly investigated over a long period of time. A great number of different approaches have been presented on means for improving fault tolerance under certain error conditions in computerized systems. One important area that has introduced computers in order to enhance productivity, flexibility and economy, is manufacturing systems in order to acquire computer-integrated manufacturing (CIM). Using computers in a manufacturing system introduces new sources of difficulties, as well as providing new possibilities for overcoming erroneous situations that might disturb production. The aim of this paper, is to describe how the use of different configurations for a manufacturing system can improve fault tolerance. One specific erroneous situation which may occur in CIM is the partitioning of a network. This situation can be handled satisfactorily by using the suggested manufacturing system configurations. Additional improvements to fault tolerance can be achieved through the introduction of data buffers and material buffers, This approach is described in this paper. 相似文献

12.

Connective fault tolerance in multiple bus systems

Hung-Kuei Ku Hayes J.P. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):574-586

We present an efficient approach to characterizing the fault tolerance of multiprocessor systems that employ multiple shared buses for interprocessor communication. Of concern is connective fault tolerance, which is defined as the ability to maintain communication between any two fault-free processors in the presence of faulty processors, buses, or processor-bus links. We introduce a model called processor-bus-link (PBL) graphs to represent a multiple-bus system's interconnection structure. The model is more general than previously proposed models, and has the advantages of simple representation, broad application, and the ability to model partial bus failures. The PBL graph implies a set of component adjacency graphs that highlights various connectivity features of the system. Using these graphs, we propose a method for analyzing the maximum number of faults a multiple-bus system can tolerate, and for identifying every minimum set of faulty components that disconnects the processors of the system. We also analyze the connective fault tolerance of several proposed multiple-bus systems to illustrate the application of our method 相似文献

13.

Evaluating fault tolerance approaches in multi-agent systems

Rade?Stankovi?Email author Maja??tula Josip?Maras 《Autonomous Agents and Multi-Agent Systems》2017,31(1):151-177

A multi-agent system (MAS) is a distributed system that consists of multiple agents working together to solve mutual problems. Even though MASs are well suited for the development of complex distributed systems, the number of real-world usages is still small. One of the main reasons for this is that MASs are very fragile. In a typical, large-scale MAS, the rate of failure grows with the number of hosts, the number of deployed agents, and the duration of the agent’s task execution. For this reason, numerous approaches have been introduced to deal with aspects of failure handling. However, the absence of centralized control and a large number of individual intelligent components makes it difficult to detect and treat errors. The risk of uncontrollable fault propagation is high and can seriously impact on system performance. There are two important factors that limit the usage of MASs: (1) existing fault tolerance (FT) approaches are not generic, as they focus on and improve specific issues of FT; and (2) despite the plethora of available FT approaches and theories, there is a remarkable lack of general metrics, tools, benchmarks, and experimental methods for formal validation and comparison of existing or newly developed FT approaches. As FT approaches in MASs become a well-established field, the need for generalized, standardized evaluation of FT approaches emerges as imperative. In this paper, we first present a detailed overview of existing FT solutions, approaches, and techniques in agent platform hosted MASs. From that overview, we derive the commonalities in existing research. Next, we present the main contribution of our paper: an evaluation methodology, with a set of metrics, for comparing FT approaches in MASs. We adopt an engineering perspective on the problem, defining a methodology and metrics that are both implementation- and domain-independent. The metrics are formalized with an acyclic directed graph. By using our methodology, evaluators can select an appropriate FT approach for targeted MAS application, thus improving MAS usability, stability, and development speed. In order to show the viability of our approach, a case study that compares two FT approaches for a targeted MAS is presented. The case study results show that our methodology can be used for selecting an appropriate FT approach for the targeted MAS. 相似文献

14.

Methods of fault accommodation in technical systems

E. Yu. Bobko A. N. Zhirabok A. E. Shumskii 《Journal of Computer and Systems Sciences International》2016,55(5):735-749

We consider the design problem of methods of full decoupling from fault-generated actions in systems described by a nonlinear dynamical model, where by a fault we mean inadmissible deviations of the parameters of the system from their nominal values. The solution of the problem depends on the formation of a new control law taking into account the behavior of a system with faults. To solve this problem we propose using the logic-dynamic approach, in which the original model of the system is subject to certain transformations, which enables solving nonlinear problems by linear methods. Another feature of this approach is the possibility of dealing with nondifferentiable nonlinearities, such as saturation, backlash, and hysteresis. The solution proposed in this paper is based on the full decoupling from fault-generated actions, which requires the construction of an auxiliary system for the synthesis of a feedback which provides for the required decoupling. Implementation of this method will be instrumental in enhancing the efficiency of fault accommodation in the part of the reduction of computational and time costs when implementing the new control law. 相似文献

15.

Optimizing fault tolerance in embedded distributed systems 总被引：1，自引：0，他引：1

Draber S. 《Micro, IEEE》2000,20(4):76-84

This article considers a gas-insulated switchgear (GIS) station. Here, all switches are located in SF₆ gas volumes, offering a much better isolation and spark-extinguishing property than air, and thus leading to a far more compact size. All volumes have one or two sensors for monitoring gas density. The station's distributed computing system is arranged in several levels. PISAs-embedded computing systems for preprocessing the data from digital/analog converters and for writing them on the process bus (PB)-link to the sensor and actuator hardware. PISAs in the switches also take the actuator commands from the process bus and carry out the switching actions 相似文献

16.

Neural-network-based robust fault diagnosis in robotic systems 总被引：7，自引：0，他引：7

A T Vemuri M M Polycarpou 《Neural Networks, IEEE Transactions on》1997,8(6):1410-1420

Fault diagnosis plays an important role in the operation of modern robotic systems. A number of researchers have proposed fault diagnosis architectures for robotic manipulators using the model-based analytical redundancy approach. One of the key issues in the design of such fault diagnosis schemes is the effect of modeling uncertainties on their performance. This paper investigates the problem of fault diagnosis in rigid-link robotic manipulators with modeling uncertainties. A learning architecture with sigmoidal neural networks is used to monitor the robotic system for any off-nominal behavior due to faults. The robustness and stability properties of the fault diagnosis scheme are rigorously established. Simulation examples are presented to illustrate the ability of the neural-network-based robust fault diagnosis scheme to detect and accommodate faults in a two-link robotic manipulator. 相似文献

17.

Particle filtering-based fault detection in non-linear stochastic systems 总被引：2，自引：0，他引：2

V. Kadirkamanathan P. Li M. H. Jaward S. G. Fabri 《International journal of systems science》2013,44(4):259-265

Much of the development in model-based fault detection techniques for dynamic stochastic systems has relied on the system model being linear and the noise and disturbances being Gaussian. Linearized approximations have been used in the non-linear systems case. However, linearization techniques, being approximate, tend to suffer from poor detection or high false alarm rates. A novel particle filtering based approach to fault detection in non-linear stochastic systems is developed here. One of the appealing advantages of the new approach is that the complete probability distribution information of the state estimates from particle filter is utilized for fault detection, whereas, only the mean and covariance of an approximate Gaussian distribution are used in a coventional extended Kalman filter-based approach. Another advantage of the new approach is its applicability to general non-linear system with non-Gaussian noise and disturbances. The effectiveness of this new method is demonstrated through Monte Carlo simulations and the detection performance is compared with that using the extended Kalman filter on a non-linear system. 相似文献

18.

Integrated design of fault detection systems in time-frequencydomain

Ye H. Ding S.X. Wang G. 《Automatic Control, IEEE Transactions on》2002,47(2):384-390

Problems related to the integrated design of robust fault detection (FD) systems are studied. First, it is revealed that due to the time window introduced to realize the 2-norm based evaluation function, an optimal design of a FD system with the 2-norm based evaluation function may not ensure the expected optimal performance when the system is realized in real applications. To solve this problem, an integrated design method of FD systems using the absolute value of residual signal as evaluation function is then proposed. It leads to a residual generator which is much easier to be realized. Different from the usual 2-norm based approaches whose mathematical basis is the relationship between the energy of the output and input signals of a dynamic system, a relationship between the instant power of the output signal and the energy of the past input signal of a dynamic system is established and further used for FD system design. Another new kind of evaluation function based on the absolute value of wavelet transform of residual signal and the corresponding integrated design approach for FD systems are further proposed 相似文献

19.

Actuator fault detection and isolation in linear systems

MAKARAND S. PHATAK N. VISWANADHAM 《International journal of systems science》2013,44(12):2593-2603

A scheme for the detection and isolation of actuator faults in linear systems is proposed. A bank of unknown input observers is constructed to generate residual signals which will deviate in characteristic ways in the presence of actuator faults. Residual signals are unaffected by the unknown inputs acting on the system and this decreases the false alarm and miss probabilities. The results are illustrated through a simulation study of actuator fault detection and isolation in a pilot plant doubleeffect evaporator. 相似文献

20.

Algebraic approaches for fault identification in discrete-event systems

Yingquan Wu Hadjicostis C.N. 《Automatic Control, IEEE Transactions on》2005,50(12):2048-2055

In this note, we develop algebraic approaches for fault identification in discrete-event systems that are described by Petri nets. We consider faults in both Petri net transitions and places, and assume that system events are not directly observable but that the system state is periodically observable. The particular methodology we explore incorporates redundancy into a given Petri net in a way that enables fault detection and identification to be performed efficiently using algebraic decoding techniques. The guiding principle in adding redundancy is to keep the number of additional Petri net places small while retaining enough information to be able to systematically detect and identify faults when the system state becomes available. The end result is a redundant Petri net embedding that uses 2k additional places and enables the simultaneous identification of 2k-1 transition faults and k place faults (that may occur at various instants during the operation of the Petri net). The proposed identification scheme has worst-case complexity of O(k(m+n)) operations where m and n are respectively the number of transitions and places in the given Petri net. 相似文献