共查询到20条相似文献,搜索用时 15 毫秒
1.
Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引:1,自引:0,他引:1
In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献
2.
容错CORBA系统的设计与实现 总被引:3,自引:0,他引:3
CORBA是基于对象技术的中间件平台的最流行的标准之一 .CORBA对应用屏蔽了分布式系统的异构性 .然而目前 CORBA还没有考虑容错问题 ,而容错是运行在异构环境中的分布式应用的核心问题之一 .因此 ,在最近几年许多为 CORBA应用添加可靠性和可用性的建议出现在各种文献上 .本文分析了这些建议的优、缺点 ,并提出了一个新颖的与 CORBA兼容的方法 ,这种方法不同于异步环境中可靠结构的分发方法 . 相似文献
3.
4.
5.
基于数据流分析的软件容错策略 总被引:4,自引:1,他引:4
该文就软件容错中备查点与卷回机制展开深入讨论,提出一种基于数据流分析技术的软件容错新方法.首先对软件容错进行简介,指出数据错是一切控制系统软件失效的根源与最终表现以及对数据采取强有力的容错措施的必要性.然后将数据流分析技术应用于软件容错,通过求解程序变量的到达-定值数据流方程来静态地确定任何数据在任何引用点出错时的最小充分卷回,通过求解活跃变量的数据流方程来静态地确定程序在执行各个基本块时需动态保存的变量集合,得出最小充分卷回定理与备查点数据范围定理,从而解决了时间冗余容错途径中必须回答的两个基本问题.此外,还给出了恢复块定义有效的充分条件.最后,以电信系统为应用实例,介绍了该方法的一种具体实施.该方法在简单地扩展后可被广泛应用于各种容错软件的设计中. 相似文献
6.
为了支持面向能耗优化的容错实时任务调度算法研究,提出一种频率相关的时间Petri网—FRTPN.FRTPN引入用于动态电压调整的变迁频率设置空间以及和频率相关的静态引发时域,以支持调度算法的能耗评估及优化;同时它增加一类抑制弧刻画容错故障恢复过程.通过对基于检查点的容错实时能耗优化任务调度进行建模证明了FRTPN的有效性. 相似文献
7.
通过提供高效且持续可用的容错服务以保障云应用系统的可靠运行是至关重要的.采用容错即服务的模式,提出了一种优化的云容错服务动态提供方法,从云应用组件的可靠性及响应时间等方面描述云应用容错需求,以常用的复制、检查点和NVP(N-version programming)等容错技术为基础,充分考虑容错服务动态切换开销,分别针对支撑容错服务的底层云资源是否足够的场景,给出可用容错即服务提供方案的最优化求解方法.实验结果表明,所提方法降低了云应用系统支付的容错服务费用及支撑容错服务的底层云资源的开销,提高了容错服务提供商为多个云应用实施高效、可靠容错即服务的能力. 相似文献
8.
Recent technology advances have made multimedia on-demand services feasible. One of the challenges is to provide fault-tolerant capability at system level for a practical video-on-demand system. The main concern on providing fault recovery is to minimize the consumption of system resources on the surviving servers in the event of server failure. In order to reduce the overhead on recovery, we present three schemes for recovering faulty playbacks through channel merging and sharing techniques on the surviving servers. Furthermore, to evenly distribute the recovery load among the surviving servers, we propose a balanced dispatch policy that ensures load balancing in both the normal server conditions and the presence of a server failure. 相似文献
9.
10.
为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。
相似文献
11.
黎珊珊 《计算机与数字工程》2002,30(6):61-64,31
本文提出了一种具有容错功能的实时分布式计算机系统的体系结构,同时对实时分布式计算机系统中的容错技术进行了研究,特别对实时分布式计算机系统中的节点机容错技术及实时性的实现方面做了较深入的讨论,并提出了实现方案。 相似文献
12.
一种面向图的分布式软件动态配置和容错方法 总被引:1,自引:0,他引:1
提出一种新的方法,通过动态配置对基于组件的分布式软件的容错提供支持。此方法采用面向图的GOP编程模型,将整个分布式软件的体系结构用一张逻辑图来描述,系统的动态配置可以通过执行图上预定义的一组操作来完成。检测到故障或异常的时候实施这种动态配置能够支持系统的容错。文中描述了此方法的基本模型、系统结构和基于CORBA的原型实现。 相似文献
13.
14.
Panagiotis Katsaros Lefteris Angelis Constantine Lazos 《Concurrency and Computation》2007,19(1):37-63
Checkpointing has a crucial impact on systems' performance and fault‐tolerance effectiveness: excessive checkpointing results in performance degradation, while deficient checkpointing incurs expensive recovery. In distributed systems with independent checkpoint activities there is no easy way to determine checkpoint frequencies optimizing response‐time and fault‐tolerance costs at the same time. The purpose of this paper is to investigate the potentialities of a statistical decision‐making procedure. We adopt a simulation‐based approach for obtaining performance metrics that are afterwards used for determining a trade‐off between checkpoint interval reductions and efficiency in performance. Statistical methodology including experimental design, regression analysis and optimization provides us with the framework for comparing configurations, which use possibly different fault‐tolerance mechanisms (replication‐based or message‐logging‐based). Systematic research also allows us to take into account additional design factors, such as load balancing. The method is described in terms of a standardized object replication model (OMG FT‐CORBA), but it could also be applied in other (e.g. process‐based) computational models. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献
15.
16.
介绍了一种新的仿生容错系统——胚胎型仿生硬件;它将FPGA设计成由电子细胞构成的二维胚胎阵列,使用电子细胞阵列模拟生物体多细胞结构,使硬件电路具有与生物细胞组织类似的自诊断和自修复特性;详细阐述了胚胎型仿生硬件的硬件结构、错误检测与自修复机制等关键技术,并以四位可控移位寄存器的设计为例说明了其系统设计方法;展望了仿生硬件的应用前景,指出了目前存在问题和进一步研究的重点. 相似文献
17.
18.
针对因特网应用的三层客户/服务器体系结构,本文设计并实现了一个面向冗余服务的管理系统OM。它采用分布对象技术,基于CORBA平台,为冗余服务提供了容错和负载平衡管理服务,同时它还为系统管理员提供了管理接口,为用户提供了对服务的透明访问机制。 相似文献
19.
Building reliable real-time applications on top of commercial off-the-shelf (COTS) components is not a straightforward task. Thus, it is essential to provide a simple and transparent programming model, in order to abstract programmers from the low-level implementation details of distribution and replication. However, the recent trend for incorporating pre-emptive multitasking applications in reliable real-time systems inherently increases its complexity. It is therefore important to provide a transparent programming model, enabling pre-emptive multitasking applications to be implemented without resorting to simultaneously dealing with both system requirements and distribution and replication issues. The distributed embedded architecture using COTS components (DEAR-COTS) architecture has been previously proposed as an architecture to support real-time and reliable distributed computer-controlled systems (DCCS) using COTS components. Within the DEAR-COTS architecture, the hard real-time subsystem provides a framework for the development of reliable real-time applications, which are the core of DCCS applications. This paper presents the proposed framework, and demonstrates how it can be used to support the transparent replication of software components. 相似文献
20.
分布式系统技术为采用低成本购建高性能系统提供了有效的途径,但是由于资源的分配与需求可能产生冲突,造成系统中发生死锁,导致系统运行陷入停滞.在不可靠的分布式系统中,故障会干扰正常的死锁检测,但现有的死锁检测算法不具有容错功能.对失效形式进行了归类,提出一个容错的死锁检测解除算法.算法建立在通用的AND-OR 模型基础上,采用扩散计算和集中规约方式,不仅能够检测到死锁,而且能给出死锁环的全部成员.若死锁拓扑处于静态且为环状,算法的消息复杂度的上限为e+n-1,时间复杂度为d,其中e为死锁等待图中边的个数,n和d为构成死锁环的节点的个数,分析表明算法性能等于或优于同类算法. 相似文献