期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献

2.

容错CORBA系统的设计与实现 总被引：3，自引：0，他引：3

薛文革李增智王宇陆建平《小型微型计算机系统》2002,23(10):1205-1208

CORBA是基于对象技术的中间件平台的最流行的标准之一 .CORBA对应用屏蔽了分布式系统的异构性 .然而目前 CORBA还没有考虑容错问题 ,而容错是运行在异构环境中的分布式应用的核心问题之一 .因此 ,在最近几年许多为 CORBA应用添加可靠性和可用性的建议出现在各种文献上 .本文分析了这些建议的优、缺点 ,并提出了一个新颖的与 CORBA兼容的方法 ,这种方法不同于异步环境中可靠结构的分发方法 . 相似文献

3.

分布、实时、容错一体化设计方法研究

下载免费PDF全文

李新明李艺王鹏刘东《计算机工程》2007,33(18):262-264

针对分布式航天器系统的运行环境和特点，对嵌入式系统在空间环境、实时、容错、分布上的需求进行了分析，提出了分布、实时、容错一体化的嵌入式系统设计方法，从满足实时要求下的实时容错能力、免疫与自愈相结合的综合容错能力、单节点的容错与节点间容错相结合的分布容错能力和多种容错方法集成等4个方面，对设计方法进行了阐述。相似文献

4.

异构分布式负载均衡容错算法研究

下载免费PDF全文

邓建波张立臣《计算机工程》2011,37(5):62-64

基于基/副版本技术提出一种异构分布式容错调度模型,并在该模型上提出HDL算法。该算法克服了以前算法在故障发生前后负载均衡性不稳定问题,并在一定程序上实现均衡可控性,同时在模拟实验中给出一种基于协方差反映负载均衡性的方法。实验结果证明,该算法的负载均衡性在故障发生前后是稳定的。相似文献

5.

基于数据流分析的软件容错策略 总被引：4，自引：1，他引：4

刘云龙陈俊亮《软件学报》1998,9(7):537-541

该文就软件容错中备查点与卷回机制展开深入讨论,提出一种基于数据流分析技术的软件容错新方法.首先对软件容错进行简介,指出数据错是一切控制系统软件失效的根源与最终表现以及对数据采取强有力的容错措施的必要性.然后将数据流分析技术应用于软件容错,通过求解程序变量的到达-定值数据流方程来静态地确定任何数据在任何引用点出错时的最小充分卷回,通过求解活跃变量的数据流方程来静态地确定程序在执行各个基本块时需动态保存的变量集合,得出最小充分卷回定理与备查点数据范围定理,从而解决了时间冗余容错途径中必须回答的两个基本问题.此外,还给出了恢复块定义有效的充分条件.最后,以电信系统为应用实例,介绍了该方法的一种具体实施.该方法在简单地扩展后可被广泛应用于各种容错软件的设计中. 相似文献

6.

面向能耗优化的容错实时系统任务调度模型研究

陈艾周学海王峰李曦《小型微型计算机系统》2007,28(11):2056-2061

为了支持面向能耗优化的容错实时任务调度算法研究,提出一种频率相关的时间Petri网—FRTPN.FRTPN引入用于动态电压调整的变迁频率设置空间以及和频率相关的静态引发时域,以支持调度算法的能耗评估及优化;同时它增加一类抑制弧刻画容错故障恢复过程.通过对基于检查点的容错实时能耗优化任务调度进行建模证明了FRTPN的有效性. 相似文献

7.

面向云应用系统的容错即服务优化提供方法

杨娜刘靖《软件学报》2019,30(4):1191-1202

通过提供高效且持续可用的容错服务以保障云应用系统的可靠运行是至关重要的.采用容错即服务的模式,提出了一种优化的云容错服务动态提供方法,从云应用组件的可靠性及响应时间等方面描述云应用容错需求,以常用的复制、检查点和NVP（N-version programming）等容错技术为基础,充分考虑容错服务动态切换开销,分别针对支撑容错服务的底层云资源是否足够的场景,给出可用容错即服务提供方案的最优化求解方法.实验结果表明,所提方法降低了云应用系统支付的容错服务费用及支撑容错服务的底层云资源的开销,提高了容错服务提供商为多个云应用实施高效、可靠容错即服务的能力. 相似文献

8.

Playback Dispatch and Fault Recovery for a Clustered Video System with Multiple Servers

Shyu Ing-Jye Shieh Shiuh-Pyng 《Multimedia Tools and Applications》1999,9(3):277-294

Recent technology advances have made multimedia on-demand services feasible. One of the challenges is to provide fault-tolerant capability at system level for a practical video-on-demand system. The main concern on providing fault recovery is to minimize the consumption of system resources on the surviving servers in the event of server failure. In order to reduce the overhead on recovery, we present three schemes for recovering faulty playbacks through channel merging and sharing techniques on the surviving servers. Furthermore, to evenly distribute the recovery load among the surviving servers, we propose a balanced dispatch policy that ensures load balancing in both the normal server conditions and the presence of a server failure. 相似文献

9.

一种改进的实时嵌入式系统容错优化方法

刘浩波李军义李仁发《计算机工程》2015,41(1)

容错技术中硬件冗余会产生较高的设计和生产成本.针对该问题,提出一种改进的实时嵌入式系统容错优化方法,基于检查点容错技术综合分析系统故障性能、硬实时任务时间约束和软实时任务的效用函数值.以设计的容错模型为基础,计算系统故障概率保证其在故障最大概率值内,给出硬任务截止时间确定可调度性,并应用改进的禁忌搜索算法获得软任务效用函数最佳值,算法有2种简单的邻节点结构,其禁忌准则遵循邻节点方法禁忌,优化效率明显改善.实验结果表明,该方法可进行故障分析等综合分析,并能迅速获得最大效用函数值. 相似文献

10.

分布式计算系统回卷恢复容错的仿真设计

董奇 黄斌 颜耀 李韦韦 曾玮妮 张恒 《计算机与现代化》2017,(4):48

为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。  相似文献

11.

实时分布式计算机系统的容错技术研究

黎珊珊《计算机与数字工程》2002,30(6):61-64,31

本文提出了一种具有容错功能的实时分布式计算机系统的体系结构，同时对实时分布式计算机系统中的容错技术进行了研究，特别对实时分布式计算机系统中的节点机容错技术及实时性的实现方面做了较深入的讨论，并提出了实现方案。相似文献

12.

一种面向图的分布式软件动态配置和容错方法 总被引：1，自引：0，他引：1

宋毅刘云超《计算机应用》2003,23(12):37-41

提出一种新的方法，通过动态配置对基于组件的分布式软件的容错提供支持。此方法采用面向图的GOP编程模型，将整个分布式软件的体系结构用一张逻辑图来描述，系统的动态配置可以通过执行图上预定义的一组操作来完成。检测到故障或异常的时候实施这种动态配置能够支持系统的容错。文中描述了此方法的基本模型、系统结构和基于CORBA的原型实现。相似文献

13.

面向分布式图计算作业的容错技术研究综述

张程博李影贾统《软件学报》2021,32(7):2078-2102

随着图数据规模的日益庞大和图计算作业的日益复杂,图计算的分布化成为必然趋势.然而图计算作业在运行过程中面临着分布式图计算系统内外各种来源的非确定性所带来的严峻的可靠性问题.首先分析了分布式图计算框架中不确定性因素和不同类型图计算作业的鲁棒性,并提出了基于成本、效率和质量3个维度的面向分布式图计算作业的容错技术评估框架,... 相似文献

14.

Performance and effectiveness trade‐off for checkpointing in fault‐tolerant distributed systems

Panagiotis Katsaros Lefteris Angelis Constantine Lazos 《Concurrency and Computation》2007,19(1):37-63

Checkpointing has a crucial impact on systems' performance and fault‐tolerance effectiveness: excessive checkpointing results in performance degradation, while deficient checkpointing incurs expensive recovery. In distributed systems with independent checkpoint activities there is no easy way to determine checkpoint frequencies optimizing response‐time and fault‐tolerance costs at the same time. The purpose of this paper is to investigate the potentialities of a statistical decision‐making procedure. We adopt a simulation‐based approach for obtaining performance metrics that are afterwards used for determining a trade‐off between checkpoint interval reductions and efficiency in performance. Statistical methodology including experimental design, regression analysis and optimization provides us with the framework for comparing configurations, which use possibly different fault‐tolerance mechanisms (replication‐based or message‐logging‐based). Systematic research also allows us to take into account additional design factors, such as load balancing. The method is described in terms of a standardized object replication model (OMG FT‐CORBA), but it could also be applied in other (e.g. process‐based) computational models. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

15.

双机容错系统的一种实现途径 总被引：8，自引：0，他引：8

徐立云邵惠鹤《计算机工程》2000,26(9):95-96

介绍一种基于Ｗｉｎｄｏｗｓ多线程思想设计而成的双机容错系统的一种途径,重点介绍其中的实现思想。相似文献

16.

胚胎型仿生硬件结构容错机制与设计方法研究

姚睿王友仁于盛林《计算机测量与控制》2005,13(9):973-975

介绍了一种新的仿生容错系统——胚胎型仿生硬件;它将FPGA设计成由电子细胞构成的二维胚胎阵列,使用电子细胞阵列模拟生物体多细胞结构,使硬件电路具有与生物细胞组织类似的自诊断和自修复特性;详细阐述了胚胎型仿生硬件的硬件结构、错误检测与自修复机制等关键技术,并以四位可控移位寄存器的设计为例说明了其系统设计方法;展望了仿生硬件的应用前景,指出了目前存在问题和进一步研究的重点. 相似文献

17.

使用防卫式程序设计实现软件容错

万剑怡薛锦云《计算机科学》1996,23(1):66-68

软件避错是提高软件可靠性的主要方法之一,它包含程序检验,测试,正确性证明等技术,然而,随相似文献

18.

面向因特网应用的容错和负载平衡管理 总被引：1，自引：0，他引：1

下载免费PDF全文

钱方贾焰黄杰顾晓波邹鹏《计算机工程与科学》2000,22(4):23-26

针对因特网应用的三层客户／服务器体系结构,本文设计并实现了一个面向冗余服务的管理系统ＯＭ。它采用分布对象技术,基于ＣＯＲＢＡ平台,为冗余服务提供了容错和负载平衡管理服务,同时它还为系统管理员提供了管理接口,为用户提供了对服务的透明访问机制。相似文献

19.

Replication Management in Reliable Real-Time Systems

Pinho Luís Miguel Vasques Francisco Wellings Andy 《Real-Time Systems》2004,26(3):261-296

Building reliable real-time applications on top of commercial off-the-shelf (COTS) components is not a straightforward task. Thus, it is essential to provide a simple and transparent programming model, in order to abstract programmers from the low-level implementation details of distribution and replication. However, the recent trend for incorporating pre-emptive multitasking applications in reliable real-time systems inherently increases its complexity. It is therefore important to provide a transparent programming model, enabling pre-emptive multitasking applications to be implemented without resorting to simultaneously dealing with both system requirements and distribution and replication issues. The distributed embedded architecture using COTS components (DEAR-COTS) architecture has been previously proposed as an architecture to support real-time and reliable distributed computer-controlled systems (DCCS) using COTS components. Within the DEAR-COTS architecture, the hard real-time subsystem provides a framework for the development of reliable real-time applications, which are the core of DCCS applications. This paper presents the proposed framework, and demonstrates how it can be used to support the transparent replication of software components. 相似文献

20.

容错的分布式系统通用死锁模型检测解除算法

程欣刘宏伟董剑杨孝宗《计算机研究与发展》2007,44(5):798-805

分布式系统技术为采用低成本购建高性能系统提供了有效的途径,但是由于资源的分配与需求可能产生冲突,造成系统中发生死锁,导致系统运行陷入停滞.在不可靠的分布式系统中,故障会干扰正常的死锁检测,但现有的死锁检测算法不具有容错功能.对失效形式进行了归类,提出一个容错的死锁检测解除算法.算法建立在通用的AND-OR 模型基础上,采用扩散计算和集中规约方式,不仅能够检测到死锁,而且能给出死锁环的全部成员.若死锁拓扑处于静态且为环状,算法的消息复杂度的上限为e+n-1,时间复杂度为d,其中e为死锁等待图中边的个数,n和d为构成死锁环的节点的个数,分析表明算法性能等于或优于同类算法. 相似文献