首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Reliability of compute-intensive applications can be improved by introducing fault tolerance into the system. Algorithm based fault tolerance (ABFT) is a low-cost scheme which provides the required fault tolerance to the system through system level encoding. In this paper, we propose randomized construction techniques, under an extended model, for the design of ABFT systems with the required fault tolerance capability. The model considers failures in the processors performing the checking operations  相似文献   

2.
Considers the applicability of algorithm based fault tolerance (ABET) to massively parallel scientific computation. Existing ABET schemes can provide effective fault tolerance at a low cost For computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. This short note proposes the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABET schemes. It is shown that the partitioned scheme provides scalable linear codes with improved numerical properties with only a small increase in hardware and time overhead  相似文献   

3.
Tolerating faults and minimising energy consumption in embedded systems is a difficult task due to the fact that the two objectives are antagonistic. In this paper, we propose a new approach based on graceful degradation to reduce jitter of battery life and thereby energy consumption in fault-tolerant embedded systems. In case of faults, the affected task is re-executed. In our solution, the energy level of battery is periodically verified, and if we detect that the continuity with the current operating mode leads to jitter, the system gracefully degrades to the adequate operating mode. In such degraded mode, the dynamic voltage scaling technique is used to save energy. The effectiveness of using graceful degradation is depending on the application criticality level. Simulation results show that the use of graceful degradation can reduce jitter of battery life, and thereby can minimise energy consumption.  相似文献   

4.
Algorithm-based fault tolerance (ABPT) is a low-overhead system-level concurrent error detection and fault location scheme for multiprocessor systems. We present new methods for the design of ABFT systems. Our design procedure is applicable to a wide range of systems in which processors share data elements. A feature of our design approach is that the type of checks to be used in the final system can be controlled by the system designer. We also present some new bounds on the number of checks needed in ABFT system design  相似文献   

5.
We introduce a unified checksum scheme for the LU decomposition, Gaussian elimination with pairwise pivoting, and the QR decomposition. The purpose is to detect and locate a transient error during a systolic array computation. We show how to represent the error as a rank-one perturbation to the original data, so that we need not worry when the error occurred. Finally, we perform a floating point error analysis to determine the effects of rounding errors on the checksum scheme.  相似文献   

6.
Algorithm-based fault tolerance has been proposed as a technique to detect incorrect computations in multiprocessor systems. In algorithm-based fault tolerance, processors produce data elements that are checked by concurrent error detection mechanisms. We investigate the efficacy of this approach for diagnosis of processor faults. Because checks are performed on data elements, the problem of location of data errors must first be solved. We propose a probabilistic model for the faults and errors in a multiprocessor system and use it to evaluate the probabilities of correct error location and fault diagnosis. We investigate the number of checks that are necessary to guarantee error location with high probability. We also give specific check assignments that accomplish this goal. We then consider the problem of fault diagnosis when the locations of erroneous data elements are known. Previous work on fault diagnosis required that the data sets produced by different processors be disjoint. We show, for the first time, that fault diagnosis is possible with high probability, even in systems where processors combine to produce individual data elements  相似文献   

7.
Algorithm-based fault tolerance (ABFT) is a method for improving the reliability of parallel architectures used for computation-intensive tasks. A two-stage approach to the synthesis of ABFT systems is proposed. In the first stage, a system-level code is chosen to encode the data used in the algorithm. In the second stage, the optimal architecture to implement the scheme is chosen using dependence graphs. Dependence graphs are a graph-theoretic form of algorithm representation. The authors demonstrate that not all architectures are ideal for the implementation of a particular ABFT scheme. They propose new measures to characterize the fault tolerance capability of a system to better exploit the proposed synthesis method. Dependence graphs can also be used for the synthesis of ABFT schemes for non-linear problems. An example of a fault-tolerant median filter is provided to illustrate their utility for such problems  相似文献   

8.
This paper investigates simultaneous estimations of states, system fault, and sensor fault for a class of Markovian jump systems, resorting to the design of a robust observer, in which both the external disturbance and actuator degradation are taken into consideration. The loss‐of‐effectiveness, stuck, and outage fault cases are involved, and the considered Markovian jump system is assumed to possess generally uncertain transition rates, which can generalize the traditional bounded uncertain transition rates and partially unknown transition rates. In particular, an augmented system whose states cover the original states, system fault, and sensor fault is constructed. Then, a novel adaptive observer is presented with time‐varying gain matrices and parameter accommodation being injected to deal with the disturbance and actuator degradation. Finally, a practical example is shown to demonstrate the high efficiency of the proposed method as well as its advantages over some existing results.  相似文献   

9.
基于混合量子进化计算的混沌系统参数估计   总被引:1,自引:0,他引:1  
任子武  熊蓉 《控制理论与应用》2010,27(11):1448-1454
混沌系统参数估计本质上是一多维参数优化问题.为精确估计混沌系统的未知参数,本文提出一种混合量子进化算法(HQEA)用于求解该优化问题,该方法采用实数量子角形式表示染色体,用量子比特的概率作为个体的当前位置信息;提出由差分进化计算更新量子位置状态的量子差分进化算法(QDE),并将其与实数编码量子进化算法(RQEA)相融合,以便令算法在解空间的全局探索和局部开发能力之间取得平衡.算法还引入量子非门算子,对当前最佳个体中按某个概率选中的量子比特位,进行变换操作,以便增强算法跳出局部最优解的能力.基准函数测试表明混合算法的全局搜索能力及可靠性都有很大改善.通过Lorenz混沌系统进行数值仿真,结果表明了该混合算法的有效性.  相似文献   

10.
In its simplest structure, cloud computing technology is a massive collection of connected servers residing in a datacenter and continuously changing to provide services to users on-demand through a front-end interface. The failure of task during execution is no more an accident but a frequent attribute of scheduling systems in a large-scale distributed environment. Recently, some computational intelligence techniques have been mostly utilized to decipher the problems of scheduling in the cloud environment, but only a few emphasis on the issue of fault tolerance. This research paper puts forward a Checkpointed League Championship Algorithm (CPLCA) scheduling scheme to be used in the cloud computing system. It is a fault-tolerance aware task scheduling mechanisms using the checkpointing strategy in addition to tasks migration against unexpected independent task execution failure. The simulation results show that, the proposed CPLCA scheme produces an improvement of 41%, 33% and 23% as compared with the Ant Colony Optimization (ACO), Genetic Algorithm (GA) and the basic league championship algorithm (LCA) respectively as parametrically measured using the total average makespan of the schemes. Considering the total average response time of the schemes, the CPLCA scheme produces an improvement of 54%, 57% and 30% as compared with ACO, GA and LCA respectively. It also turns out significant failure decrease in jobs execution as measured in terms of failure metrics and performance improvement rate. From the results obtained, CPLCA provides an improvement in both tasks scheduling performance and failure awareness that is more appropriate for scheduling in the cloud computing model.  相似文献   

11.
This paper reports on a new back propagation (BP) neural network based on an improved shuffled frog leaping algorithm (ISFLA) and its application in bearing fault diagnosis. The ISFLA is developed on the basis of a chaotic operator and the convergence factor of particle swarm optimization to overcome the shortcomings of conventional shuffled frog leaping algorithm (SFLA). Testing results show that the proposed algorithm can effectively improve the solution accuracy and convergence properties and exhibits an excellent ability of global optimization in high-dimensional space. The presented ISFLA is then employed to optimize the weights and threshold values of BP neural network. An ISFLA-BP network model is established for the early fault diagnosis of rolling bearings. The proposed ISFLA-BP scheme has been compared with BP and SFLA-BP networks through experimental studies. Results indicate that the developed new model demonstrates better generalization capability and stronger robustness. It is able to effectively improve the efficiency of network training and the accuracy of early fault pattern recognition in bearing fault diagnosis tasks.  相似文献   

12.

A quantum-inspired hybrid scheduling technique is proposed for multi-processor computing systems. The proposed algorithm is a hybridization of principles of quantum mechanics (QM) and a nature-inspired intelligence, gravitational search algorithm (GSA). The principles of QM such as quantum bit, superposition and rotation gate help to design an efficient agent representation as well as intense exploration capability of GSA enhances toward better converging rate. The fitness function is designed with the aim to minimize makespan, adequate balancing of loads and proper utilization of the deployed resources during the evaluation of agents. Several standard benchmarks as well as synthetic data sets are used to analyze and validate the work. The performance improvement of the proposed algorithm is compared with recently designed algorithms like quantum genetic algorithm, particle swarm optimization-based multi-criteria scheduling, Improved-GA, GSA and Cloudy-GSA. The significance of the algorithm is tested using a hypothesis analysis of variance.

  相似文献   

13.
Transient fault tolerance in digital systems   总被引:1,自引:0,他引:1  
Sosnowski  J. 《Micro, IEEE》1994,14(1):24-35
It is hard to shield systems effectively from transient faults (fault avoidance techniques). So some other means must be employed to assure appropriate levels of transient fault tolerance (insensitivity to transient faults). They are based on fault-masking and fault recovery ideas. Having analyzed this problem, the author identifies critical design points and outlines some practical solutions that refer to efficient on-line detectors (detecting errors during the system operation) and error handling procedures. This framework provides a basis for understanding transient fault problems in digital systems. It can be helpful in selecting optimum techniques to mask or eliminate transient fault effects in developed systems  相似文献   

14.
《Information Sciences》1987,42(3):255-282
The paper proposes a technique for providing software fault tolerance in real-time applications demanding fast response and a high degree of reliability. It is shown that a reasonably flexible interprocess communication can be supported with only a small increase in complexity and overhead. The two most prominent features of the proposed scheme are (1) it attempts to exploit fault-avoidance techniques as much as possible to reduce the overhead of fault tolerance and (2) it controls the propagation of errors so as to enable efficient recovery. Formal proofs of the system operation are developed. Besides showing that the scheme works as expected, the arguments serve to highlight the assumptions needed for provably correct operation. Some issues relating to hardware fault tolerance are also considered.  相似文献   

15.
This paper proposes a modelling approach suitable for formalizing fault tolerant systems, taking into account different fault scenarios. Verification of the properties of such systems is then performed using model checking. A general framework for the formal specification and verification of fault tolerant systems is defined starting from these principles, and experience with its application to two case studies is then presented. Copyright © 2002 John Wiley & Sons, Ltd.  相似文献   

16.
A multi-agent system (MAS) is a distributed system that consists of multiple agents working together to solve mutual problems. Even though MASs are well suited for the development of complex distributed systems, the number of real-world usages is still small. One of the main reasons for this is that MASs are very fragile. In a typical, large-scale MAS, the rate of failure grows with the number of hosts, the number of deployed agents, and the duration of the agent’s task execution. For this reason, numerous approaches have been introduced to deal with aspects of failure handling. However, the absence of centralized control and a large number of individual intelligent components makes it difficult to detect and treat errors. The risk of uncontrollable fault propagation is high and can seriously impact on system performance. There are two important factors that limit the usage of MASs: (1) existing fault tolerance (FT) approaches are not generic, as they focus on and improve specific issues of FT; and (2) despite the plethora of available FT approaches and theories, there is a remarkable lack of general metrics, tools, benchmarks, and experimental methods for formal validation and comparison of existing or newly developed FT approaches. As FT approaches in MASs become a well-established field, the need for generalized, standardized evaluation of FT approaches emerges as imperative. In this paper, we first present a detailed overview of existing FT solutions, approaches, and techniques in agent platform hosted MASs. From that overview, we derive the commonalities in existing research. Next, we present the main contribution of our paper: an evaluation methodology, with a set of metrics, for comparing FT approaches in MASs. We adopt an engineering perspective on the problem, defining a methodology and metrics that are both implementation- and domain-independent. The metrics are formalized with an acyclic directed graph. By using our methodology, evaluators can select an appropriate FT approach for targeted MAS application, thus improving MAS usability, stability, and development speed. In order to show the viability of our approach, a case study that compares two FT approaches for a targeted MAS is presented. The case study results show that our methodology can be used for selecting an appropriate FT approach for the targeted MAS.  相似文献   

17.
Optimizing fault tolerance in embedded distributed systems   总被引:1,自引:0,他引:1  
Draber  S. 《Micro, IEEE》2000,20(4):76-84
This article considers a gas-insulated switchgear (GIS) station. Here, all switches are located in SF6 gas volumes, offering a much better isolation and spark-extinguishing property than air, and thus leading to a far more compact size. All volumes have one or two sensors for monitoring gas density. The station's distributed computing system is arranged in several levels. PISAs-embedded computing systems for preprocessing the data from digital/analog converters and for writing them on the process bus (PB)-link to the sensor and actuator hardware. PISAs in the switches also take the actuator commands from the process bus and carry out the switching actions  相似文献   

18.
We consider the design problem of methods of full decoupling from fault-generated actions in systems described by a nonlinear dynamical model, where by a fault we mean inadmissible deviations of the parameters of the system from their nominal values. The solution of the problem depends on the formation of a new control law taking into account the behavior of a system with faults. To solve this problem we propose using the logic-dynamic approach, in which the original model of the system is subject to certain transformations, which enables solving nonlinear problems by linear methods. Another feature of this approach is the possibility of dealing with nondifferentiable nonlinearities, such as saturation, backlash, and hysteresis. The solution proposed in this paper is based on the full decoupling from fault-generated actions, which requires the construction of an auxiliary system for the synthesis of a feedback which provides for the required decoupling. Implementation of this method will be instrumental in enhancing the efficiency of fault accommodation in the part of the reduction of computational and time costs when implementing the new control law.  相似文献   

19.
Neural-network-based robust fault diagnosis in robotic systems   总被引:7,自引:0,他引:7  
Fault diagnosis plays an important role in the operation of modern robotic systems. A number of researchers have proposed fault diagnosis architectures for robotic manipulators using the model-based analytical redundancy approach. One of the key issues in the design of such fault diagnosis schemes is the effect of modeling uncertainties on their performance. This paper investigates the problem of fault diagnosis in rigid-link robotic manipulators with modeling uncertainties. A learning architecture with sigmoidal neural networks is used to monitor the robotic system for any off-nominal behavior due to faults. The robustness and stability properties of the fault diagnosis scheme are rigorously established. Simulation examples are presented to illustrate the ability of the neural-network-based robust fault diagnosis scheme to detect and accommodate faults in a two-link robotic manipulator.  相似文献   

20.
Models for fault tolerance in manufacturing systems   总被引:1,自引:0,他引:1  
The field of fault tolerance in computer science and engineering has been thoroughly investigated over a long period of time. A great number of different approaches have been presented on means for improving fault tolerance under certain error conditions in computerized systems. One important area that has introduced computers in order to enhance productivity, flexibility and economy, is manufacturing systems in order to acquire computer-integrated manufacturing (CIM). Using computers in a manufacturing system introduces new sources of difficulties, as well as providing new possibilities for overcoming erroneous situations that might disturb production. The aim of this paper, is to describe how the use of different configurations for a manufacturing system can improve fault tolerance. One specific erroneous situation which may occur in CIM is the partitioning of a network. This situation can be handled satisfactorily by using the suggested manufacturing system configurations. Additional improvements to fault tolerance can be achieved through the introduction of data buffers and material buffers, This approach is described in this paper.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号