期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An adaptive algorithm for tolerating value faults and crashfailures

Yansong Ren Cukier M. Sanders W.H. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(2):173-192

The AQuA architecture provides adaptive fault tolerance to CORBA applications by replicating objects and providing a high-level method that an application can use to specify its desired level of dependability. This paper presents the algorithms that AQUA uses, when an application's dependability requirements can change at runtime, to tolerate both value faults in applications and crash failures simultaneously. In particular, we provide an active replication communication scheme that maintains data consistency among replicas, detects crash failures, collates the messages generated by replicated objects, and delivers the result of each vote. We also present an adaptive majority voting algorithm that enables the correct ongoing vote while both the number of replicas and the majority size dynamically change. Together, these two algorithms form the basis of the mechanism for tolerating and recovering from value faults and crash failures in AQuA 相似文献

2.

The customizable fault/error model for dependable distributed systems

《Theoretical computer science》2003,290(2):1223-1251

Dependability is a qualitative term referring to a system's ability to meet its service requirements in the presence of faults. The types and number of faults covered by a system play a primary role in determining the level of dependability which that system can potentially provide. Given the variety and multiplicity of fault types, to simplify the design process, the system algorithm design often focuses on specific fault types, resulting in either over-optimistic (all fault permanent) or over-pessimistic (all faults malicious) dependable system designs.A more practical and realistic approach is to recognize that faults of varied severity levels and of differing occurrence probabilities may appear as combinations rather than the assumed single fault type occurrences. The ability to allow the user to select/customize a particular combination of fault types of varied severity characterizes the proposed customizable fault/error model (CFEM). The CFEM organizes diverse fault categories into a cohesive framework by classifying faults according to the effect they have on the required system services rather than by targeting the source of the fault condition. In this paper, we develop (a) the complete framework for the CFEM fault classification, (b) the voting functions applicable under the CFEM, and (c) the fundamental distributed services of consensus and convergence under the CFEM on which dependable distributed functionality can be supported. 相似文献

3.

On the design of communication-aware fault-tolerant scheduling algorithms for precedence constrained tasks in grid computing systems with dedicated communication devices

Qin Zheng Bharadwaj Veeravalli 《Journal of Parallel and Distributed Computing》2009

Fault-tolerant scheduling is an imperative step for large-scale computational Grid systems, as often geographically distributed nodes co-operate to execute a task. By and large, primary-backup approach is a common methodology used for fault tolerance wherein each task has a primary and a backup on two different processors. In this paper, we address the problem of how to schedule DAGs in Grids with communication delays so that service failures can be avoided in the presence of processors faults. The challenge is, that as tasks in a DAG have dependence on each other, a task must be scheduled to make sure that it will succeed when any of its predecessors fails due to a processor failure. We first propose a communication model and determine when communications between a backup and backups of its successors are necessary. Then we determine when a backup can start and its eligible processors so as to guarantee that every DAG can complete upon any processor failure. We develop two algorithms to schedule backups, which minimize response time and replication cost, respectively. We also develop a suboptimal algorithm which targets minimizing replication cost while not affecting response time. We conduct extensive simulation experiments to quantify the performance of the proposed algorithms. 相似文献

4.

Multiple detection test generation with diversified fault partitioning paths

Stelios Neophytou Maria K. Michael 《Microprocessors and Microsystems》2014

The dependability of current and future nanoscale technologies highly depends on the ability of the testing process to detect emerging defects that cannot be modeled traditionally. Generating test sets that detect each fault more than one times has been shown to increase the effectiveness of a test set to detect non-modeled faults, either static or dynamic. Traditional n-detect test sets guarantee to detect a modeled fault with minimum n different tests. Recent techniques examine how to quantify and maximize the difference between the various tests for a fault. The proposed methodology introduces a new systematic test generation algorithm for multiple-detect (including n-detect) test sets that increases the diversity of the fault propagation paths excited by the various tests per fault. A novel algorithm tries to identify different propagating paths (if such a path exists) for each one of the multiple (n) detections of the same fault. The proposed method can be applied to any linear, to the circuit size, static or dynamic fault model for multiple fault detections, such as the stuck-at or transition delay fault models, and avoids any path or path segment enumeration. Experimental results show the effectiveness of the approach in increasing the number of fault propagating paths when compared to traditional n-detect test sets. 相似文献

5.

Performance evaluations of data-centric information retrieval schemes for DTNs

P. Yang M. Chuah 《Computer Networks》2009,53(4):541-555

Mobile nodes in some challenging network scenarios, e.g. battlefield and disaster recovery scenarios, suffer from intermittent connectivity and frequent partitions. Disruption Tolerant Network (DTN) technologies are designed to enable communications in such environments. Several DTN routing schemes have been proposed. However, not much work has been done on designing schemes that provide efficient information access in such challenging network scenarios. In this paper, we explore how a content-based information retrieval system can be designed for DTNs. There are three important design issues, namely (a) how data should be replicated and stored at multiple nodes, (b) how a query is disseminated in sparsely connected networks, and (c) how a query response is routed back to the issuing node. We first describe how to select nodes for storing the replicated copies of data items. We consider the random and the intelligent caching schemes. In the random caching scheme, nodes that are encountered first by a data-generating node are selected to cache the extra copies while in the intelligent caching scheme, nodes that can potentially meet more nodes, e.g. faster nodes, are selected to cache the extra data copies. The number of replicated data copies K can be the same for all data items or varied depending on the access frequencies of the data items. In this work, we consider fixed, proportional and square-root replication schemes. Then, we describe two query dissemination schemes: (a) W-copy Selective Query Spraying (WSS) scheme and (b) L-hop Neighborhood Spraying (LNS) scheme. In the WSS scheme, nodes that can move faster are selected to cache the queries while in the LNS scheme, nodes that are within L-hops of a querying node will cache the queries. For message routing, we use an enhanced Prophet scheme where a next-hop node is selected only if its predicted delivery probability to the destination is higher than a certain threshold. We conduct extensive simulation studies to evaluate different combinations of the replication and query dissemination algorithms. Our results reveal that the scheme that performs the best is the one that uses the WSS scheme combined with binary spread of replicated data copies. The WSS scheme can achieve a higher query success ratio when compared to a scheme that does not use any data and query replication. Furthermore, the square-root and proportional replication schemes provide higher query success ratio than the fixed copy approach with varying node density. In addition, the intelligent caching approach can further improve the query success ratio by 5.3–15.8% with varying node density. Our results using different mobility models reveal that the query success ratio degrades at most 7.3% when the Community-Based model is used compared to the Random Waypoint (RWP) model [J. Broch et al., A Performance Comparison of Multihop wireless Ad hoc Network Routing Protocols, ACM Mobicom, 1998, pp. 85–97]. Compared to the RWP and the Community-Based mobility models, the UmassBusNet model from the DieselNet project [X. Zhang et al., Modeling of a Bus-based Disruption Tolerant Network Trace, Proceedings of ACM Mobihoc, 2007.] achieves much lower query success ratio because of the longer inter-node encounter time. 相似文献

6.

Conditional Fault Diameter of Star Graph Networks

《Journal of Parallel and Distributed Computing》1996,33(1):91-97

It is well known that star graphs are strongly resilient like thencubes in the sense that they are optimally fault tolerant and the fault diameter is increased only by one in the presence of maximum number of allowable faults. We investigate star graphs under the conditions offorbidden faulty sets, where all the neighbors of any node cannot be faulty simultaneously; we show that under these conditions star graphs can tolerate upto (2n− 5) faulty nodes and the fault diameter is increased only by 2 in the worst case in presence of maximum number of faults. Thus, star graphs enjoy the similar property of strong resilience under forbidden faulty sets like then-cubes. We have developed algorithms to trace the vertex disjoint paths under different conditions. 相似文献

7.

A global-state-triggered fault injector for distributed system evaluation

Ramesh Chandra Lefever R.M. Joshi K.R. Cukier M. Sanders W.H. 《Parallel and Distributed Systems, IEEE Transactions on》2004,15(7):593-605

Validation of the dependability of distributed systems via fault injection is gaining importance because distributed systems are being increasingly used in environments with high dependability requirements. The fact that distributed systems can fail in subtle ways that depend on the state of multiple parts of the system suggests that a global-state-based fault injection mechanism should be used to validate them. However, global-state-based fault injection is challenging since it is very difficult in practice to maintain the global state of a distributed system at runtime with minimal intrusion into the system execution. We present Loki, a global-state-based fault injector, which has been designed with the goals of low intrusion, high precision, and high flexibility. Loki achieves these goals by utilizing the ideas of partial view of global state, optimistic synchronization, and offline analysis. In Loki, faults are injected based on a partial, view of the global state of the system, and a post-runtime analysis is performed to place events and injections into a single global timeline and to discard experiments with incorrect fault injections. Finally, the experiments with correct fault injections are used to estimate user-specified performance and dependability measures. A flexible measure language has been designed that facilitates the specification of a wide range of measures. 相似文献

8.

Trajectory planning/re-planning for satellite systems in rendezvous mission in the presence of actuator faults based on attainable efforts analysis

Abbas Chamseddine Didier Theilliol 《International journal of systems science》2013,44(4):690-701

The objective of fault-tolerant control (FTC) is to minimise the effect of faults on system performance (stability, trajectory tracking, etc.). However, the majority of the existing FTC methods continue to force the system to follow the pre-fault trajectories without considering the reduction in available control resources caused by actuator faults. Forcing the system to follow the same trajectories as before fault occurrence may result in actuator saturation and system's instability. Thus, pre-fault objectives should be redefined in function of the remaining resources to avoid potential saturation. The main contribution of this paper is a flatness-based trajectory planning/re-planning method that can be combined with any active FTC approach. The work considers the case of over-actuated systems where a new idea is proposed to evaluate the severity of faults occurred. In addition, the trajectory planning/re-planning approach is formulated as an optimisation problem based on the analysis of attainable efforts domain in fault-free and fault cases. The proposed approach is applied to two satellite systems in rendezvous mission. 相似文献

9.

ER-TCP: an efficient TCP fault-tolerance scheme for cluster computing

Zhiyuan Shao Hai Jin Bin Cheng Wenbin Jiang 《The Journal of supercomputing》2008,43(2):127-145

This paper proposes a novel scheme, named ER-TCP, which transparently masks the failures happened on the server nodes of a cluster from clients at TCP connection granularity. In this scheme, TCP connections at the server side are actively and fully replicated to remain consistency so as to be transplanted over healthy parts during failure. A log mechanism is designed to cooperate with the replication to achieve small sacrifice on the performance of communication and makes the scheme scales beyond a few nodes, even when they have different processing capacities. We built a prototype system at a four-node cluster with ER-TCP, and conducted a series of experiments on that. The experimental result told us that ER-TCP has relatively small penalty on the communication performance, especially when it is used to synchronize multiple replicas. The results of real applications show that ER-TCP will incur small sacrifice on performance of web server at light load, and it can be used to distribute files very efficiently and reliably.

Hai JinEmail:

相似文献

10.

基于自适应未知输入观测器的非线性动态系统故障诊断 总被引：1，自引：0，他引：1

胡正高赵国荣李飞周大旺《控制与决策》2016,31(5):901-906

针对以往故障诊断研究中要求故障或故障导数及系统干扰的上界是已知的不足,以及难以同时诊断执行器故障和传感器故障的问题,提出一种自适应未知输入故障诊断观测器,能够同时重构非线性动态系统的执行器故障和传感器故障.首先,利用H_∞性能指标抑制未知输入对故障重构的影响,采用Lyapunov泛函得到观测误差动态系统的稳定性;然后,通过线性矩阵不等式求解观测器增益阵,并实现故障重构;最后,通过直流电机系统的仿真验证了所提出方法的有效性. 相似文献

11.

Robust Redundancy Scheme for the Repair Process: Hierarchical Codes in the Bandwidth-Limited Systems

Zhen Huang Yisong Lin Yuxing Peng 《Journal of Grid Computing》2012,10(3):579-597

High performance computing can be well supported by the Grid or cloud computing systems. However, these systems have to overcome the failure risks, where data is stored in the “unreliable” storage nodes that can leave the system at any moment and the nodes’ network bandwidth is limited. In this case, the basic way to assure data reliability is to add redundancy using either replication or erasure codes. As compared to replication, erasure codes are more space efficient. Erasure codes break data into blocks, encode these blocks and distribute them into different storage nodes. When storage nodes permanently or temporarily abandon the system, new redundant blocks must be created to guarantee the data reliability, which is referred to as repair. Later when the churn nodes rejoin the system, the blocks stored in these nodes can reintegrate the data group to enhance the data reliability. For “classical” erasure codes, generating a new block requires to transmit a number of k blocks over the network, which brings lots of repair traffic, high computation complexity and high failure probability for the repair process. Then a near-optimal erasure code named Hierarchical Codes, has been proposed that can significantly reduce the repair traffic by reducing the number of nodes participating in the repair process, which is referred to as the repair degree d. To overcome the complexity of reintegration and provide an adaptive reliability for Hierarchical Codes, we refine two concepts called location and relocation, and then propose an integrated maintenance scheme for the repair process. Our experiments show that Hierarchical Code is the most robust redundancy scheme for the repair process as compared to other famous coding schemes. 相似文献

12.

Analyzing User-Perceived Dependability and Performance Characteristics of Voting Algorithms for Managing Replicated Data

Ing-Ray Chen Ding-Chau Wang Chih-Ping Chu 《Distributed and Parallel Databases》2003,14(3):199-219

User-perceived dependability and performance metrics are very different from conventional ones in that the dependability and performance properties must be assessed from the perspective of users accessing the system. In this paper, we develop techniques based on stochastic Petri nets (SPN) to analyze user-perceived dependability and performance properties of quorum-based algorithms for managing replicated data. A feature of the techniques developed in the paper is that no assumption is made regarding the interconnection topology, the number of replicas, or the quorum definition used by the replicated system, thus making it applicable to a wide class of quorum-based algorithms. We illustrate this technique by comparing conventional and user-perceived metrics in majority voting algorithms. Our analysis shows that when the user-perceiveness is taken into consideration, the effect of increasing the network connectivity and number of replicas on the availability and dependability properties perceived by users is very different from that under conventional metrics. Thus, unlike conventional metrics, user-perceived metrics allow a tradeoff to be exploited between the hardware invested, i.e., higher network connectivity and number of replicas, and the performance and dependability properties perceived by users. 相似文献

13.

Petri-net-based robust supervisory control of automated manufacturing systems

《Control Engineering Practice》2016

Supervisory control that ensures deadlock-free and nonblocking operation has been an active research area of manufacturing engineering. So far, most of deadlock control policies in the existing literature assume that allocated resources are reliable. Additionally, a large number of methods are for systems of simple sequential processes with resources (S³PRs), where a part uses only one copy of one resource at each processing step. In contrast, we investigate the automated manufacturing systems (AMSs) that can be modeled by a class of Petri nets, namely S^*PUR. S^*PUR is a generalization of the S^*PR Petri net model, while S^*PR is a superclass of S³PR. This work addresses the robust supervision for deadlock avoidance in S^*PUR. Specifically, we take into account unreliable resources that may break down while working or being in idle, and the considered AMSs allow the use of multiple copies of different resources per operation stage. Our objective is to control the system so that: 1) when there are breakdowns, the system can continue producing parts of some types whose production does not need any failed resources; and 2) given the correction of all faults, it is possible to complete all the on-going part instances remaining in the system. We illustrate the characteristics of a desired supervisor through several examples, define the corresponding properties of robustness, and develop a control policy that satisfies such properties. 相似文献

14.

Unknown input modeling and robust fault diagnosis using black box observers

Seema Manuja Shankar Narasimhan Sachin C. Patwardhan 《Journal of Process Control》2009,19(1):25-37

A key issue that needs to be addressed while performing fault diagnosis using black box models is that of robustness against abrupt changes in unknown inputs. A fundamental difficulty with the robust FDI design approaches available in the literature is that they require some a priori knowledge of the model for unmeasured disturbances or modeling uncertainty. In this work, we propose a novel approach for modeling abrupt changes in unmeasured disturbances when innovation form of state space model (i.e. black box observer) is used for fault diagnosis. A disturbance coupling matrix is developed using singular value decomposition of the extended observability matrix and further used to formulate a robust fault diagnosis scheme based on generalized likelihood ratio test. The proposed modeling approach does not require any a priori knowledge of how these faults affect the system dynamics. To isolate sensor and actuator biases from step jumps in unmeasured disturbances, a statistically rigorous method is developed for distinguishing between faults modeled using different number of parameters. Simulation studies on a heavy oil fractionator example show that the proposed FDI methodology based on identified models can be used to effectively distinguish between sensor biases, actuator biases and other soft faults caused by changes in unmeasured disturbance variables. The fault tolerant control scheme, which makes use of the proposed robust FDI methodology, gives significantly better control performance than conventional controllers when soft faults occur. The experimental evaluation of the proposed FDI methodology on a laboratory scale stirred tank temperature control set-up corroborates these conclusions. 相似文献

15.

A New and Improved Algorithm for Fault-Tolerant Clock Synchronization

《Journal of Parallel and Distributed Computing》1995,27(1):1-14

Synchronous clocks are an essential requirement for a variety of distributed system applications. Many of these applications are safety-critical and require fault tolerance. In this paper, a new "Sliding Window" clock synchronization algorithm is presented. It offers two significant advantages. First, it can tolerate considerably higher percentages of faults than any known algorithm. In addition, it achieves clock synchronization tightness that is tighter than or as tight as that of other algorithms. A comprehensive simulation environment is used for an evaluation and comparison of the Sliding Window Algorithm with other clock synchronization algorithms. A quantitative evaluation using this environment outlines the achievable tightness under different conditions and shows that the Sliding Window Algorithm is capable of tolerating more than 50% of the nodes being faulty at any time as well as short fault bursts that affect all nodes. The evaluation also shows that our algorithm synchronizes up to 38% tighter than other algorithms. Finally, it is proven that the algorithm is able to guarantee synchronization in an n-node system even if the number of Byzantine faults is n/4. 相似文献

16.

Threshold-Based Reconfiguration Strategies for Gracefully Degradable Parallel Computations

《Journal of Parallel and Distributed Computing》1998,55(1):138-151

The occurrence of faults in multicomputers with hundreds or thousands of nodes is a likely event that can be dealt with hardware or software fault-tolerant approaches. This paper presents a unifying model that describes software reconfiguration strategies for parallel applications with regular computational pattern. We show that most existing strategies can be obtained as instances of the proposedthreshold-basedreconfiguration meta-algorithm. Moreover, this approach is useful to discover several yet unexplored strategies among which we consider the class of theadaptive threshold-basedstrategies. The performance optimization analysis demonstrates that these strategies, applied to data-parallel regular computations, give optimal results for worst fault patterns. A wide spectrum of simulations, where the system parameters have been settled to those of actual multicomputers, confirms that adaptive threshold-based strategies yield the most stable performance for a variety of workloads, independently of the number and pattern of faults. 相似文献

17.

Fault-tolerant computing: fundamental concepts 总被引：2，自引：0，他引：2

Nelson V.P. 《Computer》1990,23(7):19-25

The basic concepts of fault-tolerant computing are reviewed, focusing on hardware. Failures, faults, and errors in digital systems are examined, and measures of dependability, which dictate and evaluate fault-tolerance strategies for different classes of applications, are defined. The elements of fault-tolerance strategies are identified, and various strategies are reviewed. They are: error detection, masking, and correction; error detection and correction codes; self-checking logic; module replication for error detection and masking; protocol and timing checks; fault containment; reconfiguration and repair; and system recovery 相似文献

18.

Actuator fault robust estimation and fault-tolerant control for a class of nonlinear descriptor systems 总被引：4，自引：0，他引：4

Zhiwei Gao Author Vitae Steven X. Ding Author Vitae 《Automatica》2007,43(5):912-920

相似文献

19.

Partial syndrome-based system-level fault diagnosis using game theory

Mourad Elhadef Sofiane Grira 《International Journal of Parallel, Emergent and Distributed Systems》2018,33(1):69-86

This paper introduces a novel diagnosis approach, using game theory, to solve the comparison-based system-level fault identification problem in distributed and parallel systems based on the asymmetric comparison model. Under this diagnosis model tasks are assigned to pairs of nodes and the results of executing these tasks are compared. Using the agreements and disagreements among the nodes’ outputs, i.e. the input syndrome, the fault diagnosis algorithm identifies the fault status of the system’s nodes, under the assumption that at most t of these nodes can permanently fail simultaneously. Since the introduction of the comparison model, significant progress has been made in both theory and practice associated with the original model and its offshoots. Nevertheless, the problem of efficiently identifying the set of faulty nodes when not all the comparison outcomes are available to the fault identification algorithm prior to initiating the diagnosis phase, i.e. partial syndromes, remains an outstanding research issue. In this paper, we first show how game theory can be adapted to solve the fault diagnosis problem by maximising the payoffs of all players (nodes). We then demonstrate, using results from a thorough simulation, the effectiveness of this approach in solving the fault identification problem using partial syndromes from randomly generated diagnosable systems of different sizes and under various fault scenarios. We have considered large diagnosable systems, and we have experimented extreme faulty situations by simulating all possible fault sets even those that are less likely to occur in practice. Over all the extensive simulations we have conducted, the new game-theory-based diagnosis algorithm performed very well and provided good diagnosis results, in terms of correctness, latency, and scalability, making it a viable addition or alternative to existing diagnosis algorithms. 相似文献

20.

Sensor fault masking of a ship propulsion system

《Control Engineering Practice》2006,14(11):1337-1345

This paper presents the results of a study on fault-tolerant control of a ship propulsion benchmark [Izadi-Zamanabadi, R., & Blanke, M. (1999). A ship propulsion system as a benchmark for fault tolerant control. Control Engineering Practice, 7 (2), 227–239] which uses estimated or virtual measurements as feedback variables. The estimator operates on a self-adjustable design model so that its outputs can be made immune to the effects of a specific set of component and sensor faults. The adequacy of sensor redundancy is measured using the control reconfigurability [Wu, N. E., Zhou, K., & Salomon, G. (2000). Reconfigurability in linear time-invariant systems. Automatica, 36 (11), 1767–1771] and the number of sensor based measurements are increased when this level is found inadequate. As a result, sensor faults that are captured in the estimator's design model can be tolerated without the need for any reconfiguration actions. Simulations for the ship propulsion benchmark show that, with additional sensors added as described and the estimator in the loop, satisfactory fault-tolerance is achieved under two additive sensor faults, an incipient fault, and a parametric fault, without having to alter the original controller in the benchmark. 相似文献