期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Error detection and diagnosis for fault tolerance in distributed systems

Kassem Saleh Khaled Al-Saqabi 《Information and Software Technology》1998,39(14-15)

The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. 相似文献

2.

一种基于环拓扑面向容灾的失效检测算法

下载免费PDF全文

王强周恩强陈海涛陈伟宁《计算机工程与科学》2010,32(7):30-34

随着信息系统在关键应用中的普及,信息系统的容灾能力日益成为人们关注的焦点。失效检测技术是构建容灾系统的关键技术之一,快速、高效、准确的失效检测是实现有效容灾的前提与保障。本文研究了一种基于环拓扑的失效检测算法FDA-DR-BR,该算法改进了环拓扑面向容灾需求的不足,克服了树形拓扑的单点失效问题,比树形拓扑的网络开销小。实验表明,该算法具有较低的诊断延迟和较好的可扩展性,可有效增强容灾系统失效检测的可扩展性,能够应用到容灾的监控管理系统中。相似文献

3.

FDR: fault detection and recovery scheme for wireless sensor networks using virtual grid

Kulwardhan Singh T. P. Sharma 《International Journal of Parallel, Emergent and Distributed Systems》2017,32(6):617-631

Due to autonomous operation and constrained resources, nodes of a wireless sensor network are susceptible to failures. Due to multiple node failures, network topology may change or partition into many disconnected segments causing data/query path breakage. Early detection and recovery of faults is desirable in most scenarios. Also, due to limited battery life of a node, the solution must consume minimum possible energy for prolonged network operation. Hence in this paper, we propose a Fault Detection and Recovery scheme which is an energy efficient fault detection and recovery strategy that achieves minimum data loss by efficiently replacing faulty node on the present route or by finding full or partial alternate paths. Topology is managed by constructing a virtual grid over the entire network which helps in managing dynamically changing network topology easily and makes failure detection and recovery effective. It also helps to create energy efficient path between a source and a sink by finding shortest possible path. Further, nodes’ cooperation is exploited to create certain zones on the data/query path which provides alternate nodes or possible alternate paths if required on some node failures. Thus, scheme achieves fault tolerance and at the same time achieves energy efficiency by always selecting shortest path for data delivery between source and sink. Analytical and simulation study reveals the significant improvement in terms of fault detection and energy conservation over existing similar schemes. 相似文献

4.

基于多Agent的容错中间件失效处理系统的研究

黄细闽郭朝珍《微型机与应用》2013,32(17):72-76

针对敏感行业中分布式应用的容错需求问题,分析介绍Agent、多Agent系统和容错中间件技术,根据Agent和中间件特性结构上的相似性,对利用多Agent技术构建容错中间件作了尝试,并着重研究了失效检测与恢复系统;建立局部检测与全局检测互相结合的双层失效检测模型,提出融入定点恢复和异机恢复的改进型REDO失效恢复策略;最后给出基于JADE的一个系统实现。实验结果显示双层检测模型和改进型REDO恢复策略是可行的、高效率的。、相似文献

5.

Calibrating embedded protocols on asynchronous systems

Yukiko Yamauchi Doina Bein Toshimitsu Masuzawa Linda Morales I. Hal Sudborough 《Information Sciences》2010,180(10):1793-1801

Embedding is a method of projecting one topology into another. In one-to-one node embedding, paths in the target topology correspond to links in the original topology. A protocol running on the original topology can be modified to be executed on a target topology by means of embedding. However, if the protocol is tolerant to a number of faults - faults that affect the data but not the code of a distributed protocol executed by the nodes in a distributed systems - then the adapted protocol will not have the fault tolerance property preserved, due to the fact that links in the original topology can be embedded into paths of length greater than one: faults at the intermediate nodes on such paths are not accounted for in the protocol. We propose a communication protocol in the target topology that preserves the fault tolerance characteristics of any protocol designed for the original topology, namely by our mechanism the modification preserves fault tolerance. 相似文献

6.

分布式并行系统中的一种故障分类模型

陆广远吴悦杨洪斌《计算机应用与软件》2006,23(7):35-36

故障诊断是分布式并行环境下容错系统的关键部分,故障分类模型是影响故障诊断性能的重要因素之一。由于不同的分布式系统有其不同的特点．为了减少系统在故障诊断方面的负担,故障诊断方案一般都考虑程序的需求和系统的属性,选择最合适的故障分类模型。本文提出了一种新的分布式并行环境下的故障分类模型,可以将故障诊断限定在一个合理的故障集中。将这种分类模型和特殊的程序需求以及系统属性相结合,能够得到一个效果理想的故障检测方案。相似文献

7.

一种高效的超节点网络容错机制

谭义红栾悉道李彬《计算机科学》2011,38(11):75-78,95

超节点网络采用超节点作为普通节点服务器,负责管理和查询处理的机制,提高了搜索效率。但是,如果超节点失效,将会严重影响网络的稳定性和查询效率。提出一种高效的容错机制。首先,改进了无向双环结构,提出k-无向双环拓扑结构,并利用此技术,构建了超节点层拓扑结构,使网络具有高容错特性;同时在此基础上,给出了超节点选择和超节点负载均衡的方法,降低了超节点因负载过重而失效的可能性,另外,给出了超节点失效恢复算法和容错路由算法,解决了超节点失效后的恢复和路由问题。实验结果显示,该网络具有易维护、高容错的特点。相似文献

8.

A general method for maximizing the error-detecting ability ofdistributed algorithms

Schollmeyer M. McMillin B. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(2):164-172

The bound on component failures and their spatial distribution govern the fault tolerance of any candidate error-detecting algorithm. For distributed memory multiprocessors, the specific algorithm and the topology of the processor interconnection network define these bounds. This paper introduces the maximal fault index, derived from the system topology and local communication patterns, to demonstrate how a maximal number of simultaneous component failures can be tolerated for a particular interconnection network and error-detecting algorithm. The index is used to design a mapping of processes to processor groups such that the error-detecting ability of the algorithm is preserved for certain multiple simultaneous processor failures 相似文献

9.

The Effects of an ARMOR-based SIFT environment on the performance and dependability of user applications

Whisnant K. Iyer R.K. Kalbarczyk Z.T. Jones P.H. III Rennels D.A. Some R. 《IEEE transactions on pattern analysis and machine intelligence》2004,30(4):257-277

Few, distributed software-implemented fault tolerance (SIFT) environments have been experimentally evaluated using substantial applications to show that they protect both themselves and the applications from errors. We present an experimental evaluation of a SIFT environment used to oversee spaceborne applications as part of the Remote Exploration and Experimentation (REE) program at the Jet Propulsion Laboratory. The SIFT environment is built around a set of self-checking ARMOR processes running on different machines that provide error detection and recovery services to themselves and to the REE applications. An evaluation methodology is presented in which over 28,000 errors were injected into both the SIFT processes and two representative REE applications. The experiments were split into three groups of error injections, with each group successively stressing the SIFT error detection and recovery more than the previous group. The results show that the SIFT environment added negligible overhead to the application's execution time during failure-free runs. Correlated failures affecting a SIFT process and application process are possible, but the division of detection and recovery responsibilities in the SIFT environment allows it to recover from these multiple failure scenarios. Only 28 cases were observed in which either the application failed to start or the SIFT environment failed to recognize that the application had completed. Further investigations showed that assertions within the SIFT processes-coupled with object-based incremental checkpointing-were effective in preventing system failures by protecting dynamic data within the SIFT processes. 相似文献

10.

一种基于中间件的容错系统的研究与设计

下载免费PDF全文

姚兰桂勋巨军让《计算机工程》2007,33(6):83-85,1

随着硬件容错技术的成熟,软件容错技术成为提高系统可靠性的热点问题。直接开发容错应用是非常困难的,鉴于中间件为应用系统提供了良好的开发环境,该文研究和设计了一个基于中间件的容错系统模型,提出了一种新的节点容错结构构造方法,为解决冗余、失效检测和恢复等容错的关键技术问题形成了一套较完整的解决方案。采用马尔科夫过程求出系统的可靠度,验证了系统设计的合理性和可靠性。相似文献

11.

存在未知时延和Markov 丢包的网络控制系统故障检测与优化

王昭磊王青董朝阳牛尔卓《控制与决策》2014,29(9):1537-1544

将网络控制系统(NCSs) 的未知短时延处理成范数有界不确定性, 结合Markov 丢包影响将NCSs 建模为不确定Markov 跳变系统, 设计模态依赖的鲁棒故障检测滤波器. 为了提高检测系统性能, 采用后置滤波器对残差信号进行时域优化, 并以Moore-Penrose 逆形式给出其最优解. 同时, 设计自适应检测阈值, 并给出时变参数阵的迭代方法,降低了计算量. 数值仿真表明, 所提出的方法能够有效地抑制时延和丢包影响, 提高故障检测系统的检测能力和检测速度.

相似文献

12.

基于T-S 模糊模型的一类非线性网络控制系统故障检测 总被引：1，自引：0，他引：1

黄鹤谢德晓韩笑冬张登峰王执铨《信息与控制》2009,38(6):1-1

针对同时存在网络时延和数据包丢失的网络环境,研究了一类非线性网络控制系统的鲁棒故障检测问题．基于不确定T-S 模糊模型描述的非线性网络控制系统模型,完成了网络环境下鲁棒故障检测观测器的设计,使得残差信号对故障敏感而对外部扰动具有鲁棒性．构造Lyapunov-Krasovskii 函数,并引入一个积分不等式,给出了使得观测器误差动态系统渐近稳定的充分条件．采用线性矩阵不等式技术将鲁棒故障检测问题转化为具有线性矩阵不等式约束的凸优化问题求解．仿真算例验证了上述方法应用于此类系统的故障检测的有效性．相似文献

13.

MPLS故障恢复机制及其仿真研究 总被引：1，自引：1，他引：1

陈雪非李蓬黄河《计算机工程与设计》2008,29(16)

对MPLS故障恢复机制进行了研究,分析了各种恢复机制在恢复时机、恢复拓扑、恢复效率、备份路径资源耗费等方面的性能.对NS2进行扩展,设计和实现了支持MPLS故障恢复机制的仿真组件,包括故障检测、故障通告和故障切换功能.故障恢复仿真组件提供了基本MPLS恢复能力,支持多种故障恢复机制,为深入研究MPLS故障恢复方法、优化MPLS恢复算法提供了试验平台. 相似文献

14.

无线传感器网络中连通问题的容错性分析

吴玓文 XIE Dong-qing 王鲁鹏《小型微型计算机系统》2008,29(8)

在确定部署的无线传感器网络中,由于节点本身的脆弱性及应用环境的恶劣性,在部署及研究分析网络时应该考虑到网络节点出错的因素.当网络连通概率和网络规模给定时,网络节点的出错概率应在多大的范围之内;在给定的网络规模和节点出错概率下,网络的覆盖与连通情况如何,这些都是本文分析研究的内容.本文首先定义了一个比较规范的三角形(Triangular)模型,研究了在确定部署情况下,网络节点出错的概率与网络的覆盖概率之间的关系,然后借助"k阶子网"的概念分析了Triangular网络的连通容错性,最后通过模拟试验,对前面通过理论分析计算出的传感器网络连通概率的下界和节点出错概率的上界的可信性进行验证,同时将Triangular拓扑的网络与网格状网络进行比较. 相似文献

15.

Fault Tolerance and Recovery for Group Communication Services in Distributed Networks

下载免费PDF全文

王跃华周忠吴威《计算机科学技术学报》2012,27(2):298-312

Group communication services (GCSs) are becoming increasingly important as a wide field of promising applications has emerged to serve millions of users distributed across the world.However,it is challenging to make the service fault tolerance and scalable to fulfill the voluminous demand of users in a distributed network (DN).While many reliable group communication protocols have been dedicated to addressing such a challenge so as to accommodate the changes in the network,they are often costly or require complicated strategies to handle the service interruptions caused by node departures or link failures,which hinders the service practicability.In this paper,we present two schemes to address the challenges.The first one is a location-aware replication scheme called NS,which makes replicas in a dispersed fashion that enables the services on nodes to gain immunity of failures with different patterns (e.g.,network partition and single point failure) while keeping replication overhead low.The second one is a novel failure recovery scheme that exploits the independence between service recovery and structure recovery in time domain to achieve quick failure recovery.Our simulation results indicate that the two proposed schemes outperform the existing schemes and simple alternative schemes in service success rate,recovery latency,and communication cost. 相似文献

16.

分布式网络故障检测及恢复技术研究 总被引：2，自引：0，他引：2

下载免费PDF全文

来晓冯冬芹褚健《计算机工程与应用》2010,46(24):73-76

IEC 62439系列协议专为高可用性工业自动化网络所设计,它们各具特点,着重分析IEC 62439-6 DRP（分布式冗余协议）的通信机理以及故障检测与恢复机制。根据DRP环形网络循环周期、网络交换设备数量、报文处理时间等一系列参数,提出了一种关于故障检测至恢复所需时间的算法。经测试平台的验证,DRP网络能迅速地检测出节点和链路的故障,并能在短时间内恢复网络的通信功能,满足现代工业网络对网络高可用性的要求。相似文献

17.

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Dawei Sun Guiran Chang Changsheng Miao Xingwei Wang 《The Journal of supercomputing》2013,66(1):193-228

Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments. 相似文献

18.

一种可自维护的无线传感器网络拓扑控制算法

王艳丽侯宪春王志林王东方宋可凡《微型机与应用》2012,31(7):58-60

在温室、救灾等环境监测过程中,无线传感器网络会因频繁发生自然故障和遭受恶意攻击而引起网络可生存性问题,针对这一问题提出了一种可自维护的具有抗毁性的拓扑控制算法。仿真结果表明,该算法能够简单有效地构建并维护容错拓扑结构,在节点失效时保证网络拓扑容错抗毁,使得无线传感器网络具有可生存的能力。相似文献

19.

Topology optimization of fail-safe structures using a simplified local damage model

Jansen Miche Lombaert Geert Schevenels Mattias Sigmund Ole 《Structural and Multidisciplinary Optimization》2014,49(4):657-666

Topology optimization of mechanical structures often leads to efficient designs which resemble statically determinate structures. These economical structures are especially vulnerable to local loss of stiffness due to material failure. This paper therefore addresses local failure of continuum structures in topology optimization in order to design fail-safe structures which remain operable in a damaged state.

A simplified model for local failure in continuum structures is adopted in the robust approach. The complex phenomenon of local failure is modeled by removal of material stiffness in patches with a fixed shape. The damage scenarios are taken into account by means of a minimax formulation of the optimization problem which minimizes the worst case performance.

The detrimental influence of local failure on the nominal design is demonstrated in two representative examples: a cantilever beam optimized for minimum compliance and a compliant mechanism. The robust approach is applied successfully in the design of fail-safe alternatives for the structures in these examples.

相似文献

20.

多智能体系统的鲁棒故障估计观测器的设计

下载免费PDF全文

樊谦杨闽松严元咏《计算机测量与控制》2018,26(5):153-157

针对线性变参数多智能体系统设计了有限频域鲁棒故障估计观测器。首先,根据每个智能体的绝对可测输出和相对可测输出建立了每个节点的动力学方程,结合无向通讯拓扑图及拉普拉斯矩阵得到了多智能体系统的动力学方程,通过合适的变换对多智能体系统模型进行了解耦;然后,根据解耦后的系统动力学方程设计了故障估计观测器,并通过优化技术得到了故障估计观测器增益矩阵和优良的鲁棒性能指标;最后,通过微型飞行器纵向飞行运动的例子验证了所设计的故障估计观测器的有效性,及系统参数在一定的范围内发生变动的时候,故障估计观测器依然可以准确的估计系统所发生的故障。相似文献