首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。   相似文献   

Fault tolerance is an important design criterion for reliable and robust video-on-demand systems. Conventional fault-tolerant designs use either a primary backup or an active replication method to provide system fault tolerance. However, these approaches suffer from low utilization of the backup or replication system. In this paper we propose two playback-recovery schemes for distributed video-on-demand systems called the forward playback-recovery scheme and the backward playback-recovery scheme. Unlike conventional fault-tolerant designs, our schemes use existing playback resources to recover faulty playbacks without allocating new resources, significantly reducing recovery overhead. To use the schemes effectively, we developed a distributed algorithm for determining the order and gap information between the playbacks on the distributed video-on-demand servers so that overhead for recovering from a server failure can be minimized. This algorithm achieves N – 1 fault-tolerant resiliency for N-server video-on-demand systems. In addition, three server-recovery policies are also presented to guide surviving servers in applying the proper scheme to recover faulty playbacks, thus reducing overall recovery costs. Simulation results show that the proposed recovery schemes are effective and useful in designing fault-tolerant multiple-server video-on-demand systems.  相似文献   

在分析DRP分布式环形网络冗余协议故障诊断和恢复机理的基础上,建立DRP故障恢复时间模型,将故障恢复的时间分为故障定位等待时间、故障报警时间和故障处理时间,分别针对交换设备管理模块故障扣通信链路故障,以及DRP方法对不同故障的探测方式,分析影响不同故障恢复时间的各种因素,并根据算法得出制约故障恢复时间提高的主要因素,并通过实验验证各种不同故障在EPA现场网络中故障恢复时间.  相似文献   

Group communication services (GCSs) are becoming increasingly important as a wide field of promising applications has emerged to serve millions of users distributed across the world.However,it is challenging to make the service fault tolerance and scalable to fulfill the voluminous demand of users in a distributed network (DN).While many reliable group communication protocols have been dedicated to addressing such a challenge so as to accommodate the changes in the network,they are often costly or require complicated strategies to handle the service interruptions caused by node departures or link failures,which hinders the service practicability.In this paper,we present two schemes to address the challenges.The first one is a location-aware replication scheme called NS,which makes replicas in a dispersed fashion that enables the services on nodes to gain immunity of failures with different patterns (e.g.,network partition and single point failure) while keeping replication overhead low.The second one is a novel failure recovery scheme that exploits the independence between service recovery and structure recovery in time domain to achieve quick failure recovery.Our simulation results indicate that the two proposed schemes outperform the existing schemes and simple alternative schemes in service success rate,recovery latency,and communication cost.  相似文献   

Checkpoint and rollback recovery is a well‐known technique for providing fault tolerance to long‐running distributed applications. Performance of a checkpoint and recovery protocol depends on the characteristics of the application and the system on which it runs. However, given an application and system environment, there is no easy way to identify which checkpoint and recovery protocol will be most suitable for it. Conventional approaches require implementing the application with all the protocols under consideration, running them on the desired system, and comparing their performances. This process can be very tedious and time consuming. This paper first presents the design and implementation of a simulation environment, distributed process simulation or dPSIM, which enables easy implementation and evaluation of checkpoint and recovery protocols. The tool enables the protocols to be simulated under a wide variety of application, system, and network characteristics. The paper then presents performance evaluation of five checkpoint and recovery protocols. These protocols are implemented and executed in dPSIM under different simulated application, system, and network characteristics. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

容错CORBA系统的设计与实现   总被引:3,自引:0,他引:3  
CORBA是基于对象技术的中间件平台的最流行的标准之一 .CORBA对应用屏蔽了分布式系统的异构性 .然而目前 CORBA还没有考虑容错问题 ,而容错是运行在异构环境中的分布式应用的核心问题之一 .因此 ,在最近几年许多为 CORBA应用添加可靠性和可用性的建议出现在各种文献上 .本文分析了这些建议的优、缺点 ,并提出了一个新颖的与 CORBA兼容的方法 ,这种方法不同于异步环境中可靠结构的分发方法 .  相似文献   

宋伟  杨学军 《软件学报》2011,22(9):2248-2262
针对事务存储系统机制下的容错问题,提出一种基于事务回退的事务存储系统的故障恢复方法.该方法利用事务存储系统自身的版本管理机制,避免了额外的检查点数据保存开销,从而实现了事务存储系统高效的故障恢复.通过对容错事务存储系统的隔离性证明了该方法的正确性.最后,使用包括4个SPLASH-2典型用例在内的5个测试程序对该方法进行了性能测试.实验结果表明,与经典的Checkpointing机制相比,该方法在避免了额外的检查点数据保存开销的同时,还具有较低的故障恢复开销.  相似文献   

The granularity of scheduling video streams can be categorized as cycle-scheduling and slot-scheduling where a time cycle is further divided into time slots. To avoid resource conflict and thereby increase throughput of clustered video servers, slot-scheduling using conflict-free scheduling and especially cycle-scheduling using full-duplex scheduling and ordered scheduling are presented in the paper. Also, the analysis of the pros and cons of applying slot-scheduling and cycle-scheduling on clustered video servers are discussed.  相似文献   

提出一种抵抗瞬时故障的自动编译容错恢复方法,用源码中的变量信息在指令级别进行冗余错误流裁剪,在LCC上加以实现,并获得良好的容错性能。实验结果表明,该方法仅增加0.043倍的时间损耗及0.69倍的空间损耗,在时空损耗上优于现有的其他方法。  相似文献   

As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark.  相似文献   

基于基/副版本技术提出一种异构分布式容错调度模型,并在该模型上提出HDL算法。该算法克服了以前算法在故障发生前后负载均衡性不稳定问题,并在一定程序上实现均衡可控性,同时在模拟实验中给出一种基于协方差反映负载均衡性的方法。实验结果证明,该算法的负载均衡性在故障发生前后是稳定的。  相似文献   

双机容错系统中最佳检查点间隔的分析   总被引:2,自引:0,他引:2       下载免费PDF全文
设置检查点是容错计算机系统进行故障恢复的重要手段。因为检查点间隔选择过大或过小都将使系统性能受到影响,所以检查点间隔的适当选定是系统性能优化的一个重要指标。该文针对双机容错系统,采用检查点设置与回卷恢复的方法提出了一种系统模型,利用马尔科夫链得到了最佳检查点间隔的求解等式,通过实验证实了求解等式的正确性。  相似文献   

高可用实时系统中故障检测及故障恢复技术的研究   总被引:3,自引:2,他引:3       下载免费PDF全文
随着国防、航天等今天对系统的可用性和实时性的要求不断提高,如何保证这些应用系统的高可用及强实时,成为一个亟待解决的问题。本文论述了高可用实时系统听故障检 测及故障恢复技术。  相似文献   

Checkpointing has a crucial impact on systems' performance and fault‐tolerance effectiveness: excessive checkpointing results in performance degradation, while deficient checkpointing incurs expensive recovery. In distributed systems with independent checkpoint activities there is no easy way to determine checkpoint frequencies optimizing response‐time and fault‐tolerance costs at the same time. The purpose of this paper is to investigate the potentialities of a statistical decision‐making procedure. We adopt a simulation‐based approach for obtaining performance metrics that are afterwards used for determining a trade‐off between checkpoint interval reductions and efficiency in performance. Statistical methodology including experimental design, regression analysis and optimization provides us with the framework for comparing configurations, which use possibly different fault‐tolerance mechanisms (replication‐based or message‐logging‐based). Systematic research also allows us to take into account additional design factors, such as load balancing. The method is described in terms of a standardized object replication model (OMG FT‐CORBA), but it could also be applied in other (e.g. process‐based) computational models. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

为降低多媒体传感器网络中视频压缩感知的计算复杂度,提出一种基于帧分类的多媒体传感器网络视频联合重构算法。依据视频数据的联合稀疏模型将视频帧分为关键帧和非关键帧。对于压缩感知重构中欠定线性方程组,可利用关键帧和非关键帧之间的相关边信息进行重构初始化,同时运用有界约束二次规划对其进行求解。从仿真结果可知,相对于传统的视频压缩感知算法而言,在保证视频重构质量的前提下,所提方法在重构算法复杂度上不但能有效降低,同时,在视频重构上能提高其实时性。  相似文献   

本文提出了一种具有容错功能的实时分布式计算机系统的体系结构,同时对实时分布式计算机系统中的容错技术进行了研究,特别对实时分布式计算机系统中的节点机容错技术及实时性的实现方面做了较深入的讨论,并提出了实现方案。  相似文献   

结合轨道交通指挥调度通信需求,提出了一种基于陆地集群无线电(TETRA)数字集群无线通信技术的指挥调度系统解决方案。描述了TETRA指挥调度系统的硬件和软件体系结构,通过与其他相关系统互联互通实现了信息共享。重点介绍了冗余、容错、数据传输控制等设计手段,增强系统可靠性和可用性,提高了关键数据传输性能,满足长时间、不间断运行使用需求。  相似文献   

介绍了一种新的仿生容错系统——胚胎型仿生硬件;它将FPGA设计成由电子细胞构成的二维胚胎阵列,使用电子细胞阵列模拟生物体多细胞结构,使硬件电路具有与生物细胞组织类似的自诊断和自修复特性;详细阐述了胚胎型仿生硬件的硬件结构、错误检测与自修复机制等关键技术,并以四位可控移位寄存器的设计为例说明了其系统设计方法;展望了仿生硬件的应用前景,指出了目前存在问题和进一步研究的重点.  相似文献   

文章简要介绍了分布式多媒体通信系统中实时多媒体同步问题,以及自适应同步算法的特点和良好的自适应能力:可以适应各种网络变化,各种延迟特性,并利用该算法实现音频和视频内外同步。  相似文献   

通过融合COTS技术和传统容错技术设计实现了一种高性能、高可靠的容错服务器,该服务器基于成熟的软硬件,开放性好,容错对用户透明,成本低,扩展能力强。该文详细介绍了服务器的体系结构、工作模型、容错机制及其前向故障恢复技术。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号