共查询到20条相似文献,搜索用时 15 毫秒
1.
为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。
相似文献
2.
Fault tolerance is an important design criterion for reliable and robust video-on-demand systems. Conventional fault-tolerant designs use either a primary backup or an active replication method to provide system fault tolerance. However, these approaches suffer from low utilization of the backup or replication system. In this paper we propose two playback-recovery schemes for distributed video-on-demand systems called the forward playback-recovery scheme and the backward playback-recovery scheme. Unlike conventional fault-tolerant designs, our schemes use existing playback resources to recover faulty playbacks without allocating new resources, significantly reducing recovery overhead. To use the schemes effectively, we developed a distributed algorithm for determining the order and gap information between the playbacks on the distributed video-on-demand servers so that overhead for recovering from a server failure can be minimized. This algorithm achieves N – 1 fault-tolerant resiliency for N-server video-on-demand systems. In addition, three server-recovery policies are also presented to guide surviving servers in applying the proper scheme to recover faulty playbacks, thus reducing overall recovery costs. Simulation results show that the proposed recovery schemes are effective and useful in designing fault-tolerant multiple-server video-on-demand systems. 相似文献
3.
4.
Group communication services (GCSs) are becoming increasingly important as a wide field of promising applications has emerged to serve millions of users distributed across the world.However,it is challenging to make the service fault tolerance and scalable to fulfill the voluminous demand of users in a distributed network (DN).While many reliable group communication protocols have been dedicated to addressing such a challenge so as to accommodate the changes in the network,they are often costly or require complicated strategies to handle the service interruptions caused by node departures or link failures,which hinders the service practicability.In this paper,we present two schemes to address the challenges.The first one is a location-aware replication scheme called NS,which makes replicas in a dispersed fashion that enables the services on nodes to gain immunity of failures with different patterns (e.g.,network partition and single point failure) while keeping replication overhead low.The second one is a novel failure recovery scheme that exploits the independence between service recovery and structure recovery in time domain to achieve quick failure recovery.Our simulation results indicate that the two proposed schemes outperform the existing schemes and simple alternative schemes in service success rate,recovery latency,and communication cost. 相似文献
5.
Checkpoint and rollback recovery is a well‐known technique for providing fault tolerance to long‐running distributed applications. Performance of a checkpoint and recovery protocol depends on the characteristics of the application and the system on which it runs. However, given an application and system environment, there is no easy way to identify which checkpoint and recovery protocol will be most suitable for it. Conventional approaches require implementing the application with all the protocols under consideration, running them on the desired system, and comparing their performances. This process can be very tedious and time consuming. This paper first presents the design and implementation of a simulation environment, distributed process simulation or dPSIM, which enables easy implementation and evaluation of checkpoint and recovery protocols. The tool enables the protocols to be simulated under a wide variety of application, system, and network characteristics. The paper then presents performance evaluation of five checkpoint and recovery protocols. These protocols are implemented and executed in dPSIM under different simulated application, system, and network characteristics. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献
6.
容错CORBA系统的设计与实现 总被引:3,自引:0,他引:3
CORBA是基于对象技术的中间件平台的最流行的标准之一 .CORBA对应用屏蔽了分布式系统的异构性 .然而目前 CORBA还没有考虑容错问题 ,而容错是运行在异构环境中的分布式应用的核心问题之一 .因此 ,在最近几年许多为 CORBA应用添加可靠性和可用性的建议出现在各种文献上 .本文分析了这些建议的优、缺点 ,并提出了一个新颖的与 CORBA兼容的方法 ,这种方法不同于异步环境中可靠结构的分发方法 . 相似文献
7.
针对事务存储系统机制下的容错问题,提出一种基于事务回退的事务存储系统的故障恢复方法.该方法利用事务存储系统自身的版本管理机制,避免了额外的检查点数据保存开销,从而实现了事务存储系统高效的故障恢复.通过对容错事务存储系统的隔离性证明了该方法的正确性.最后,使用包括4个SPLASH-2典型用例在内的5个测试程序对该方法进行了性能测试.实验结果表明,与经典的Checkpointing机制相比,该方法在避免了额外的检查点数据保存开销的同时,还具有较低的故障恢复开销. 相似文献
8.
The granularity of scheduling video streams can be categorized as cycle-scheduling and slot-scheduling where a time cycle is further divided into time slots. To avoid resource conflict and thereby increase throughput of clustered video servers, slot-scheduling using conflict-free scheduling and especially cycle-scheduling using full-duplex scheduling and ordered scheduling are presented in the paper. Also, the analysis of the pros and cons of applying slot-scheduling and cycle-scheduling on clustered video servers are discussed. 相似文献
9.
10.
Haines Joshua Lakamraju Vijay Koren Israel Krishna C. Mani 《The Journal of supercomputing》2000,16(1-2):53-68
As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark. 相似文献
11.
12.
13.
随着国防、航天等今天对系统的可用性和实时性的要求不断提高,如何保证这些应用系统的高可用及强实时,成为一个亟待解决的问题。本文论述了高可用实时系统听故障检 测及故障恢复技术。 相似文献
14.
Panagiotis Katsaros Lefteris Angelis Constantine Lazos 《Concurrency and Computation》2007,19(1):37-63
Checkpointing has a crucial impact on systems' performance and fault‐tolerance effectiveness: excessive checkpointing results in performance degradation, while deficient checkpointing incurs expensive recovery. In distributed systems with independent checkpoint activities there is no easy way to determine checkpoint frequencies optimizing response‐time and fault‐tolerance costs at the same time. The purpose of this paper is to investigate the potentialities of a statistical decision‐making procedure. We adopt a simulation‐based approach for obtaining performance metrics that are afterwards used for determining a trade‐off between checkpoint interval reductions and efficiency in performance. Statistical methodology including experimental design, regression analysis and optimization provides us with the framework for comparing configurations, which use possibly different fault‐tolerance mechanisms (replication‐based or message‐logging‐based). Systematic research also allows us to take into account additional design factors, such as load balancing. The method is described in terms of a standardized object replication model (OMG FT‐CORBA), but it could also be applied in other (e.g. process‐based) computational models. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献
15.
为降低多媒体传感器网络中视频压缩感知的计算复杂度,提出一种基于帧分类的多媒体传感器网络视频联合重构算法。依据视频数据的联合稀疏模型将视频帧分为关键帧和非关键帧。对于压缩感知重构中欠定线性方程组,可利用关键帧和非关键帧之间的相关边信息进行重构初始化,同时运用有界约束二次规划对其进行求解。从仿真结果可知,相对于传统的视频压缩感知算法而言,在保证视频重构质量的前提下,所提方法在重构算法复杂度上不但能有效降低,同时,在视频重构上能提高其实时性。 相似文献
16.
黎珊珊 《计算机与数字工程》2002,30(6):61-64,31
本文提出了一种具有容错功能的实时分布式计算机系统的体系结构,同时对实时分布式计算机系统中的容错技术进行了研究,特别对实时分布式计算机系统中的节点机容错技术及实时性的实现方面做了较深入的讨论,并提出了实现方案。 相似文献
17.
结合轨道交通指挥调度通信需求,提出了一种基于陆地集群无线电(TETRA)数字集群无线通信技术的指挥调度系统解决方案。描述了TETRA指挥调度系统的硬件和软件体系结构,通过与其他相关系统互联互通实现了信息共享。重点介绍了冗余、容错、数据传输控制等设计手段,增强系统可靠性和可用性,提高了关键数据传输性能,满足长时间、不间断运行使用需求。 相似文献
18.
介绍了一种新的仿生容错系统——胚胎型仿生硬件;它将FPGA设计成由电子细胞构成的二维胚胎阵列,使用电子细胞阵列模拟生物体多细胞结构,使硬件电路具有与生物细胞组织类似的自诊断和自修复特性;详细阐述了胚胎型仿生硬件的硬件结构、错误检测与自修复机制等关键技术,并以四位可控移位寄存器的设计为例说明了其系统设计方法;展望了仿生硬件的应用前景,指出了目前存在问题和进一步研究的重点. 相似文献
19.
文章简要介绍了分布式多媒体通信系统中实时多媒体同步问题,以及自适应同步算法的特点和良好的自适应能力:可以适应各种网络变化,各种延迟特性,并利用该算法实现音频和视频内外同步。 相似文献