期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Distributed Fault-Tolerant Design for Multiple-Server VOD Systems

Shyu Ing-Jye Shieh Shiuh-Pyng 《Multimedia Tools and Applications》1999,8(2):219-247

Fault tolerance is an important design criterion for reliable and robust video-on-demand systems. Conventional fault-tolerant designs use either a primary backup or an active replication method to provide system fault tolerance. However, these approaches suffer from low utilization of the backup or replication system. In this paper we propose two playback-recovery schemes for distributed video-on-demand systems called the forward playback-recovery scheme and the backward playback-recovery scheme. Unlike conventional fault-tolerant designs, our schemes use existing playback resources to recover faulty playbacks without allocating new resources, significantly reducing recovery overhead. To use the schemes effectively, we developed a distributed algorithm for determining the order and gap information between the playbacks on the distributed video-on-demand servers so that overhead for recovering from a server failure can be minimized. This algorithm achieves N – 1 fault-tolerant resiliency for N-server video-on-demand systems. In addition, three server-recovery policies are also presented to guide surviving servers in applying the proper scheme to recover faulty playbacks, thus reducing overall recovery costs. Simulation results show that the proposed recovery schemes are effective and useful in designing fault-tolerant multiple-server video-on-demand systems. 相似文献

2.

分布式环形网络故障检测及恢复技术性能分析

阮伟张帅勇刘国安《工业控制计算机》2013,(10):64-68

在分析DRP分布式环形网络冗余协议故障诊断和恢复机理的基础上,建立DRP故障恢复时间模型,将故障恢复的时间分为故障定位等待时间、故障报警时间和故障处理时间,分别针对交换设备管理模块故障扣通信链路故障,以及DRP方法对不同故障的探测方式,分析影响不同故障恢复时间的各种因素,并根据算法得出制约故障恢复时间提高的主要因素,并通过实验验证各种不同故障在EPA现场网络中故障恢复时间. 相似文献

3.

Fault Tolerance and Recovery for Group Communication Services in Distributed Networks

下载免费PDF全文

王跃华周忠吴威《计算机科学技术学报》2012,27(2):298-312

Group communication services (GCSs) are becoming increasingly important as a wide field of promising applications has emerged to serve millions of users distributed across the world.However,it is challenging to make the service fault tolerance and scalable to fulfill the voluminous demand of users in a distributed network (DN).While many reliable group communication protocols have been dedicated to addressing such a challenge so as to accommodate the changes in the network,they are often costly or require complicated strategies to handle the service interruptions caused by node departures or link failures,which hinders the service practicability.In this paper,we present two schemes to address the challenges.The first one is a location-aware replication scheme called NS,which makes replicas in a dispersed fashion that enables the services on nodes to gain immunity of failures with different patterns (e.g.,network partition and single point failure) while keeping replication overhead low.The second one is a novel failure recovery scheme that exploits the independence between service recovery and structure recovery in time domain to achieve quick failure recovery.Our simulation results indicate that the two proposed schemes outperform the existing schemes and simple alternative schemes in service success rate,recovery latency,and communication cost. 相似文献

4.

容错CORBA系统的设计与实现 总被引：3，自引：0，他引：3

薛文革李增智王宇陆建平《小型微型计算机系统》2002,23(10):1205-1208

CORBA是基于对象技术的中间件平台的最流行的标准之一 .CORBA对应用屏蔽了分布式系统的异构性 .然而目前 CORBA还没有考虑容错问题 ,而容错是运行在异构环境中的分布式应用的核心问题之一 .因此 ,在最近几年许多为 CORBA应用添加可靠性和可用性的建议出现在各种文献上 .本文分析了这些建议的优、缺点 ,并提出了一个新颖的与 CORBA兼容的方法 ,这种方法不同于异步环境中可靠结构的分发方法 . 相似文献

5.

Efficient Algorithms for Slot-Scheduling and Cycle-Scheduling of Video Streams on Clustered Video Servers

Chow-Sing Lin Min-You Wu Wei Shu 《Multimedia Tools and Applications》2001,13(2):213-227

The granularity of scheduling video streams can be categorized as cycle-scheduling and slot-scheduling where a time cycle is further divided into time slots. To avoid resource conflict and thereby increase throughput of clustered video servers, slot-scheduling using conflict-free scheduling and especially cycle-scheduling using full-duplex scheduling and ordered scheduling are presented in the paper. Also, the analysis of the pros and cons of applying slot-scheduling and cycle-scheduling on clustered video servers are discussed. 相似文献

6.

Himadri Sekhar Paul Arobinda Gupta R. Badrinath 《Concurrency and Computation》2003,15(15):1363-1386

Checkpoint and rollback recovery is a well‐known technique for providing fault tolerance to long‐running distributed applications. Performance of a checkpoint and recovery protocol depends on the characteristics of the application and the system on which it runs. However, given an application and system environment, there is no easy way to identify which checkpoint and recovery protocol will be most suitable for it. Conventional approaches require implementing the application with all the protocols under consideration, running them on the desired system, and comparing their performances. This process can be very tedious and time consuming. This paper first presents the design and implementation of a simulation environment, distributed process simulation or dPSIM, which enables easy implementation and evaluation of checkpoint and recovery protocols. The tool enables the protocols to be simulated under a wide variety of application, system, and network characteristics. The paper then presents performance evaluation of five checkpoint and recovery protocols. These protocols are implemented and executed in dPSIM under different simulated application, system, and network characteristics. Copyright © 2003 John Wiley & Sons, Ltd. 相似文献

7.

Application-Level Fault Tolerance as a Complement to System-Level Fault Tolerance

Haines Joshua Lakamraju Vijay Koren Israel Krishna C. Mani 《The Journal of supercomputing》2000,16(1-2):53-68

As multiprocessor systems become more complex, their reliability will need to increase as well. In this paper we propose a novel technique which is applicable to a wide variety of distributed real-time systems, especially those exhibiting data parallelism. System-level fault tolerance involves reliability techniques incorporated within the system hardware and software whereas application-level fault tolerance involves reliability techniques incorporated within the application software. We assert that, for high reliability, a combination of system-level fault tolerance and application-level fault tolerance works best. In many systems, application-level fault tolerance can be used to bridge the gap when system-level fault tolerance alone does not provide the required reliability. We exemplify this with the RTHT target tracking benchmark and the ABF beamforming benchmark. 相似文献

8.

轨道交通领域TETRA指挥调度系统设计

蒋国华《计算机与网络》2011,(13):40-43

结合轨道交通指挥调度通信需求,提出了一种基于陆地集群无线电（TETRA）数字集群无线通信技术的指挥调度系统解决方案。描述了TETRA指挥调度系统的硬件和软件体系结构,通过与其他相关系统互联互通实现了信息共享。重点介绍了冗余、容错、数据传输控制等设计手段,增强系统可靠性和可用性,提高了关键数据传输性能,满足长时间、不间断运行使用需求。相似文献

9.

实时分布式计算机系统的容错技术研究

黎珊珊《计算机与数字工程》2002,30(6):61-64,31

本文提出了一种具有容错功能的实时分布式计算机系统的体系结构,同时对实时分布式计算机系统中的容错技术进行了研究,特别对实时分布式计算机系统中的节点机容错技术及实时性的实现方面做了较深入的讨论,并提出了实现方案。相似文献

10.

胚胎型仿生硬件结构容错机制与设计方法研究

姚睿王友仁于盛林《计算机测量与控制》2005,13(9):973-975

介绍了一种新的仿生容错系统——胚胎型仿生硬件;它将FPGA设计成由电子细胞构成的二维胚胎阵列,使用电子细胞阵列模拟生物体多细胞结构,使硬件电路具有与生物细胞组织类似的自诊断和自修复特性;详细阐述了胚胎型仿生硬件的硬件结构、错误检测与自修复机制等关键技术,并以四位可控移位寄存器的设计为例说明了其系统设计方法;展望了仿生硬件的应用前景,指出了目前存在问题和进一步研究的重点. 相似文献

11.

分布式多媒体通信系统中音频和视频同步算法

胡毅胡咏梅《计算机工程与应用》2001,37(17):135-137

文章简要介绍了分布式多媒体通信系统中实时多媒体同步问题,以及自适应同步算法的特点和良好的自适应能力：可以适应各种网络变化,各种延迟特性,并利用该算法实现音频和视频内外同步。相似文献

12.

Fault Prediction and Compensation Functions in a Diagnostic Knowledge-Based System for Hydraulic Systems 总被引：3，自引：0，他引：3

Chr. Angeli A. Chatzinikolaou 《Journal of Intelligent and Robotic Systems》1999,25(2):153-165

Fault prediction and fault compensation are beneficial for the production technology and give a new dimension to fault diagnosis in technical systems. The overall goal of this paper is the presentation of fault prediction and fault compensation procedures as they are studied, implemented and embedded in a real time expert system. This expert system detects and diagnoses faults in hydraulic systems. For this purpose dynamic modelling information, on-line sensor information, special features of the domain of hydraulic systems and expert systems technology are used co-operatively. 相似文献

13.

支持集群存储容灾系统的设计与实现

刘新国乐洪超《计算机安全》2013,(10):12-14

传统容灾系统后台存储采用灾备中心直接磁盘存储的方法,该方法存在集中存储带来的一系列问题,如存储数据易损坏、存储能力无法在线扩展以及随着磁盘容量的增大存储性能会不断下降等。针对这些问题,设计并实现了一种支持集群存储的容灾系统。通过在容灾系统后台部署GlusterFS分布式文件系统,实现了数据的分布式和副本化存储,数据存储的安全性、可扩展性和存储性能均得到了极大的改进,很好地克服了上述问题。相似文献

14.

一种面向图的分布式软件动态配置和容错方法 总被引：1，自引：0，他引：1

宋毅刘云超《计算机应用》2003,23(12):37-41

提出一种新的方法,通过动态配置对基于组件的分布式软件的容错提供支持。此方法采用面向图的GOP编程模型,将整个分布式软件的体系结构用一张逻辑图来描述,系统的动态配置可以通过执行图上预定义的一组操作来完成。检测到故障或异常的时候实施这种动态配置能够支持系统的容错。文中描述了此方法的基本模型、系统结构和基于CORBA的原型实现。相似文献

15.

Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

Soyeon Park Seung Ryoul Maeng 《The Journal of supercomputing》2006,35(2):141-154

A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing. This research is supported by KISTEP under the National Research Laboratory program. 相似文献

16.

Fault tolerance in supervisory control systems: a knowledge-based approach

Dimitris Th. Askounis Vassilis Assimakopoulos John Psarras 《Journal of Intelligent Manufacturing》1994,5(5):323-331

Fault tolerance in computerized systems involved in production has become an ever more important requirement. Existing fault tolerance approaches, wherever used, deal mainly with hardware faults. Nevertheless, the vast majority of contemporary system failures are software related. This paper introduces a knowledge-based approach to handling software related faults occurring in supervisory control systems. These systems are event driven and use data, stored in complex databases, to react to events coming from different kinds of devices by identifying, scheduling, initiating and monitoring operations. Failure of part of the supervisory control system's software to behave rationally when unexpected events occur is called an application fault. The approach introduced in this paper is based on a supervisory control system reference model which reveals the set of all possible application faults together with the major functions of the recovery processes associated with each fault, and leads to a high-level knowledge-based system architecture capable of handling every fault-related condition. This system is called PROFIT (Intelligent PROduction systems Fault Tolerance) and consists of three main components: the fault diagnosis module, the instant fault correction module and the learning module, co-ordinated by a PROFIT meta-level module. The prototype version of PROFIT is analysed and the development as well as the run-time environment that prove the applicability and effectiveness of the system are presented. 相似文献

17.

电信级视频点播系统的设计与实现

刘灵辉《数字社区&智能家居》2009,(24)

视频点播服务是一种新的信息服务。作为信息服务运营商,电信需要构建电信级的视频点播系统以便有效提供电信级视频点播服务。针对电信级视频点播服务的设计需求,本文提出了一个电信级视频点播层次型分布式系统架构,并在该架构下,整合视频服务子系统、容错子系统、节目管理子系统、节目分发子系统、节目采编子系统、用户认证计费系统和网管子系统等七大功能系统,完整地实现了一个电信级视频点播系统,最终实现了城域范围内的视频点播服务。相似文献

18.

一种高效的闪存数据库故障恢复方法MMR

王曼丽邢玉钢王翰虎马丹陈梅《微机发展》2012,(1):40-44

故障发生后,迅速而有效的恢复对闪存数据库而言是至关重要的。目前,相关研究者已提出了一些基于闪存数据库的故障恢复方法,但是这些方法都存在一些不足,如事务提交代价高、系统运行开销大等。文中针对闪存的特征,结合存储管理中基于日志更新方法的页内日志,讨论闪存数据库的恢复处理及其实现机制。通过记录内存日志实现事务故障恢复,建立镜像目录实现系统故障恢复。最后,通过实验验证了MMR在恢复时间和写操作数上都比传统的方法低。相似文献

19.

Fault tolerance in partitioned manufacturing networks

Anders Adlemo Sven-Arne Andréasson 《Journal of Systems Integration》1993,3(1):63-84

Fault tolerance is especially important for computer systems that require a high degree of confidence. Computer Integrated Manufacturing (CIM) is an area where computer systems must not be disturbed by uncontrolled failures. This article deals with two problems that are related to fault tolerance and network partitions in automated manufacturing systems.The first problem relates to the distribution of information in partitioned data networks in CIM systems. We indicate how to overcome this problem by using the material network as a redundant data network:The second problem relates to fault detection and diagnosis in manufacturing systems. The problem is whether the indication of a fault means that a production unit itself has actually broken down, or that the indication is instead due to disturbances in the transmission of material. That is, the production unit continues to operate propcrly despite indications to the contrary. We describe how the material network can be used for detection and diagnosis. 相似文献

20.

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

下载免费PDF全文

徐新海杨学军薛京灵林宇斐林一松《计算机科学技术学报》2012,27(2):240-255

GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world’s fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability guarantees required by many HPC applications. By analyzing the SIMT (single-instruction, multiple-thread) characteristics of programs running on GPGPUs, we have developed PartialRC, a new checkpoint-based compiler-directed partial recomputing method, for achieving efficient fault recovery by leveraging the phenomenal computing power of GPGPUs. In this paper, we introduce our PartialRC method that recovers from errors detected in a code region by partially re-computing the region, describe a checkpoint-based faulttolerance framework developed on PartialRC, and discuss an implementation on the CUDA platform. Validation using a range of representative CUDA programs on NVIDIA GPGPUs against FullRC (a traditional full-recomputing Checkpoint-Rollback-Restart fault recovery method for CPUs) shows that PartialRC reduces significantly the fault recovery overheads incurred by FullRC, by 73.5% when errors occur earlier during execution and 74.6% when errors occur later on average. In addition, PartialRC also reduces error detection overheads incurred by FullRC during fault recovery while incurring negligible performance overheads when no fault happens. 相似文献