共查询到20条相似文献,搜索用时 15 毫秒
1.
Fault tolerance is an important design criterion for reliable and robust video-on-demand systems. Conventional fault-tolerant designs use either a primary backup or an active replication method to provide system fault tolerance. However, these approaches suffer from low utilization of the backup or replication system. In this paper we propose two playback-recovery schemes for distributed video-on-demand systems called the forward playback-recovery scheme and the backward playback-recovery scheme. Unlike conventional fault-tolerant designs, our schemes use existing playback resources to recover faulty playbacks without allocating new resources, significantly reducing recovery overhead. To use the schemes effectively, we developed a distributed algorithm for determining the order and gap information between the playbacks on the distributed video-on-demand servers so that overhead for recovering from a server failure can be minimized. This algorithm achieves N – 1 fault-tolerant resiliency for N-server video-on-demand systems. In addition, three server-recovery policies are also presented to guide surviving servers in applying the proper scheme to recover faulty playbacks, thus reducing overall recovery costs. Simulation results show that the proposed recovery schemes are effective and useful in designing fault-tolerant multiple-server video-on-demand systems. 相似文献
2.
3.
Group communication services (GCSs) are becoming increasingly important as a wide field of promising applications has emerged to serve millions of users distributed across the world.However,it is challenging to make the service fault tolerance and scalable to fulfill the voluminous demand of users in a distributed network (DN).While many reliable group communication protocols have been dedicated to addressing such a challenge so as to accommodate the changes in the network,they are often costly or require complicated strategies to handle the service interruptions caused by node departures or link failures,which hinders the service practicability.In this paper,we present two schemes to address the challenges.The first one is a location-aware replication scheme called NS,which makes replicas in a dispersed fashion that enables the services on nodes to gain immunity of failures with different patterns (e.g.,network partition and single point failure) while keeping replication overhead low.The second one is a novel failure recovery scheme that exploits the independence between service recovery and structure recovery in time domain to achieve quick failure recovery.Our simulation results indicate that the two proposed schemes outperform the existing schemes and simple alternative schemes in service success rate,recovery latency,and communication cost. 相似文献
4.
容错CORBA系统的设计与实现 总被引:3,自引:0,他引:3
CORBA是基于对象技术的中间件平台的最流行的标准之一 .CORBA对应用屏蔽了分布式系统的异构性 .然而目前 CORBA还没有考虑容错问题 ,而容错是运行在异构环境中的分布式应用的核心问题之一 .因此 ,在最近几年许多为 CORBA应用添加可靠性和可用性的建议出现在各种文献上 .本文分析了这些建议的优、缺点 ,并提出了一个新颖的与 CORBA兼容的方法 ,这种方法不同于异步环境中可靠结构的分发方法 . 相似文献
5.
The granularity of scheduling video streams can be categorized as cycle-scheduling and slot-scheduling where a time cycle is further divided into time slots. To avoid resource conflict and thereby increase throughput of clustered video servers, slot-scheduling using conflict-free scheduling and especially cycle-scheduling using full-duplex scheduling and ordered scheduling are presented in the paper. Also, the analysis of the pros and cons of applying slot-scheduling and cycle-scheduling on clustered video servers are discussed. 相似文献
6.
Checkpoint and rollback recovery is a well‐known technique for providing fault tolerance to long‐running distributed applications. Performance of a checkpoint and recovery protocol depends on the characteristics of the application and the system on which it runs. However, given an application and system environment, there is no easy way to identify which checkpoint and recovery protocol will be most suitable for it. Conventional approaches require implementing the application with all the protocols under consideration, running them on the desired system, and comparing their performances. This process can be very tedious and time consuming. This paper first presents the design and implementation of a simulation environment, distributed process simulation or dPSIM, which enables easy implementation and evaluation of checkpoint and recovery protocols. The tool enables the protocols to be simulated under a wide variety of application, system, and network characteristics. The paper then presents performance evaluation of five checkpoint and recovery protocols. These protocols are implemented and executed in dPSIM under different simulated application, system, and network characteristics. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献
7.
Haines Joshua Lakamraju Vijay Koren Israel Krishna C. Mani 《The Journal of supercomputing》2000,16(1-2):53-68
As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark. 相似文献
8.
结合轨道交通指挥调度通信需求,提出了一种基于陆地集群无线电(TETRA)数字集群无线通信技术的指挥调度系统解决方案。描述了TETRA指挥调度系统的硬件和软件体系结构,通过与其他相关系统互联互通实现了信息共享。重点介绍了冗余、容错、数据传输控制等设计手段,增强系统可靠性和可用性,提高了关键数据传输性能,满足长时间、不间断运行使用需求。 相似文献
9.
黎珊珊 《计算机与数字工程》2002,30(6):61-64,31
本文提出了一种具有容错功能的实时分布式计算机系统的体系结构,同时对实时分布式计算机系统中的容错技术进行了研究,特别对实时分布式计算机系统中的节点机容错技术及实时性的实现方面做了较深入的讨论,并提出了实现方案。 相似文献
10.
介绍了一种新的仿生容错系统——胚胎型仿生硬件;它将FPGA设计成由电子细胞构成的二维胚胎阵列,使用电子细胞阵列模拟生物体多细胞结构,使硬件电路具有与生物细胞组织类似的自诊断和自修复特性;详细阐述了胚胎型仿生硬件的硬件结构、错误检测与自修复机制等关键技术,并以四位可控移位寄存器的设计为例说明了其系统设计方法;展望了仿生硬件的应用前景,指出了目前存在问题和进一步研究的重点. 相似文献
11.
文章简要介绍了分布式多媒体通信系统中实时多媒体同步问题,以及自适应同步算法的特点和良好的自适应能力:可以适应各种网络变化,各种延迟特性,并利用该算法实现音频和视频内外同步。 相似文献
12.
Fault Prediction and Compensation Functions in a Diagnostic Knowledge-Based System for Hydraulic Systems 总被引:3,自引:0,他引:3
Fault prediction and fault compensation are beneficial for the production technology and give a new dimension to fault diagnosis in technical systems. The overall goal of this paper is the presentation of fault prediction and fault compensation procedures as they are studied, implemented and embedded in a real time expert system. This expert system detects and diagnoses faults in hydraulic systems. For this purpose dynamic modelling information, on-line sensor information, special features of the domain of hydraulic systems and expert systems technology are used co-operatively. 相似文献
13.
传统容灾系统后台存储采用灾备中心直接磁盘存储的方法,该方法存在集中存储带来的一系列问题,如存储数据易损坏、存储能力无法在线扩展以及随着磁盘容量的增大存储性能会不断下降等。针对这些问题,设计并实现了一种支持集群存储的容灾系统。通过在容灾系统后台部署GlusterFS分布式文件系统,实现了数据的分布式和副本化存储,数据存储的安全性、可扩展性和存储性能均得到了极大的改进,很好地克服了上述问题。 相似文献
14.
一种面向图的分布式软件动态配置和容错方法 总被引:1,自引:0,他引:1
提出一种新的方法,通过动态配置对基于组件的分布式软件的容错提供支持。此方法采用面向图的GOP编程模型,将整个分布式软件的体系结构用一张逻辑图来描述,系统的动态配置可以通过执行图上预定义的一组操作来完成。检测到故障或异常的时候实施这种动态配置能够支持系统的容错。文中描述了此方法的基本模型、系统结构和基于CORBA的原型实现。 相似文献
15.
A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead
because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network.
For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during
failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a
node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also
introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution,
can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing
overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing.
This research is supported by KISTEP under the National Research Laboratory program. 相似文献
16.
Dimitris Th. Askounis Vassilis Assimakopoulos John Psarras 《Journal of Intelligent Manufacturing》1994,5(5):323-331
Fault tolerance in computerized systems involved in production has become an ever more important requirement. Existing fault tolerance approaches, wherever used, deal mainly with hardware faults. Nevertheless, the vast majority of contemporary system failures are software related. This paper introduces a knowledge-based approach to handling software related faults occurring in supervisory control systems. These systems are event driven and use data, stored in complex databases, to react to events coming from different kinds of devices by identifying, scheduling, initiating and monitoring operations. Failure of part of the supervisory control system's software to behave rationally when unexpected events occur is called an application fault. The approach introduced in this paper is based on a supervisory control system reference model which reveals the set of all possible application faults together with the major functions of the recovery processes associated with each fault, and leads to a high-level knowledge-based system architecture capable of handling every fault-related condition. This system is called PROFIT (Intelligent PROduction systems Fault Tolerance) and consists of three main components: the fault diagnosis module, the instant fault correction module and the learning module, co-ordinated by a PROFIT meta-level module. The prototype version of PROFIT is analysed and the development as well as the run-time environment that prove the applicability and effectiveness of the system are presented. 相似文献
17.
刘灵辉 《数字社区&智能家居》2009,(24)
视频点播服务是一种新的信息服务。作为信息服务运营商,电信需要构建电信级的视频点播系统以便有效提供电信级视频点播服务。针对电信级视频点播服务的设计需求,本文提出了一个电信级视频点播层次型分布式系统架构,并在该架构下,整合视频服务子系统、容错子系统、节目管理子系统、节目分发子系统、节目采编子系统、用户认证计费系统和网管子系统等七大功能系统,完整地实现了一个电信级视频点播系统,最终实现了城域范围内的视频点播服务。 相似文献
18.
19.
Fault tolerance is especially important for computer systems that require a high degree of confidence. Computer Integrated Manufacturing (CIM) is an area where computer systems must not be disturbed by uncontrolled failures. This article deals with two problems that are related to fault tolerance and network partitions in automated manufacturing systems.The first problem relates to the distribution of information in partitioned data networks in CIM systems. We indicate how to overcome this problem by using the material network as a redundant data network:The second problem relates to fault detection and diagnosis in manufacturing systems. The problem is whether the indication of a fault means that a production unit itself has actually broken down, or that the indication is instead due to disturbances in the transmission of material. That is, the production unit continues to operate propcrly despite indications to the contrary. We describe how the material network can be used for detection and diagnosis. 相似文献
20.
GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world’s fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens. 相似文献