期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

In this paper we consider the queueing analysis of a fault-tolerant computer system. The failure/repair behavior of the server is modeled by an irreducible continuous-time Markov chain. Jobs arrive in a Poisson fashion to the system and are serviced according to FCFS discipline. A failure may cause the loss of the work already done on the job in service, if any; in this case the interrupted job is repeated as soon as the server is ready to deliver service. In addition to the delays due to failures and repairs, jobs suffer delays due to queueing. We present an exact queueing analysig of the system and study the steady-state behavior of the number of jobs in the system. As a numerical example, we consider a system with two processors subject to failures and repairs. 相似文献

7.

完全自校验四余度容错系统设计 总被引：1，自引：0，他引：1

李洪波车明《微处理机》2008,29(3)

完全自校验四余度容错系统是由完全自校验电路管理一个传统的四余度容错系统组成,其中完全自校验电路的功能是用来检测冗余模块错误信息和校验电路本身的错误。错误信息指示主要依赖于错误信息输出,它可以用来产生停止信号来阻止错误的传播。校验电路内部错误产生的指示码和冗余模块错误信息无关,但是可以屏蔽冗余模块和完全自校验电路的错误。此系统具有很高的可用性和可维护性。相似文献

8.

鲁棒容错控制系统设计 总被引：32，自引：3，他引：32

孙金生李军冯缵刚胡寿松《控制理论与应用》1994,11(3):376-380

本文考虑了观测器状态反馈控制系统的容错控制问题，提出了一种对传感器失效具有完整性的控制器设计方法，进而讨论了存在参数摄动的情况，给出了鲁棒容错控制器的设计步骤并用设计实例及仿真结果验证了该方法的有效性。相似文献

9.

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

Jing Mei Kenli Li Xu Zhou Keqin Li 《Journal of Grid Computing》2015,13(4):507-525

As the scale and complexity of heterogeneous computing systems grow, failures occur frequently and have an adverse effect on solving large-scale applications. Hence, fault-tolerant scheduling is an imperative step for large-scale computing systems. The existing fault-tolerant scheduling algorithms belong to static scheduling, and they allocate multiple copies of each task to several processors no matter whether processor failures affect the execution of tasks. Such active replication strategies not only waste resource but also sacrifice the makespan. What is more, they cannot guarantee the successful execution of applications. In this paper, we propose a fault-tolerant dynamic rescheduling algorithm named FTDR, which can overcome above drawbacks. FTDR keeps listening to the processor failure, and reschedules the suspended tasks once failures occur. Because FTDR reschedules the tasks that are suspended because of failures, it can tolerate an arbitrary number of failures. Randomly generated DAGs are tested in our experiments. Experimental results show that the proposed algorithm achieves good performance in terms of makespan and resource consumption compared with its direct competitors. 相似文献

10.

时滞不确定系统的鲁棒容错控制 总被引：45，自引：1，他引：45

孙金生李军王执铨《控制理论与应用》1998,15(2):267-271

本文考虑了线性时滞系统的容错控制问题，给出了时滞系统对传感器失效具有完整性的一个充分条件，并推广到执行器失效的情况，进而考虑了参数不确定系统的鲁棒容错控制问题，给出了鲁棒容错控制时滞系统的设计方法及步骤，并用设计实例及仿真结果验证了这种方法的有效性．相似文献

11.

容错飞行控制系统的可用度分析

下载免费PDF全文

王少萍孔德良《计算机工程与科学》2001,23(5):84-86

在飞行控制系统设计中,大量采用容错技术以提高系统的可靠性和可用性。针对复杂容错飞控系统可用度预计困难的问题,本文采用动态故障树与Markov过程综合的方法,将独立子树转换为事件的等效故障率和等效维修率,递归各事件即可得到复杂系统的可用度,从而实现容错飞控系统的可用度分析。相似文献

12.

Units of Computation in Fault-Tolerant Distributed Systems

Mohan Ahuja Shivakant Mishra 《Journal of Parallel and Distributed Computing》1997,40(2):194

We develop a framework that helps in understanding a fault-tolerant distributed system and so aids in designing such systems. We illustrate the uses of the developed work in application areas such as checkpointing and recovery, phase termination detection, stable property detection, implementing membership protocols, debugging, and design of programming languages. We define a unit of computation, and refer to it as a molecule. A molecule has a well defined interface with other molecules. The smallest such unit—an indivisible molecule—is termed an atom. We show that any execution of a fault-tolerant distributed computation can be seen as an execution of molecules/atoms in a partial order, and such a view provides insights into understanding the computation, particularly for a fault-tolerant system where it is important to guarantee that a unit of computation is either completely executed or not at all and system designers need to reason about the states after execution of such units. Molecules are essentially a generalization of atomic actions. 相似文献

13.

不确定离散系统的D稳定鲁棒容错控制 总被引：10，自引：2，他引：10

孙金生李军王执铨《控制理论与应用》1998,15(4):636-641

本文基于Ｌｙａｐｕｎｏｖ稳定性理论和线性变换技术，研究了离散系统的Ｄ稳定容错控制问题，给出了对传感器失效具有完整性Ｄ稳定系统需满足的一个充分条件，进而讨论了不确定离散系统的Ｄ稳定鲁棒容错控制问题，并把结果推广到执行器失效的情况，给出了Ｄ稳定鲁棒容错控制系统的设计方法。最后用设计实例及仿真结果验证了这种方法的有效性。相似文献

14.

时滞不确定系统的鲁棒容错控制 总被引：1，自引：0，他引：1

邵克勇宋金波王建智张会珍宋衍茹《自动化技术与应用》2006,25(9):1-3

研究了线性时滞不确定系统的鲁棒容错控制问题,基于Lyapunov理论,证明了系统在不确定性存在的情况下,采用一种有记忆的状态反馈控制律时,对于执行器故障具有完整性,并且运用MATLAB的LMI工具箱求解控制器参数.仿真说明了该方法的有效性. 相似文献

15.

Fault-Tolerant File-I/O for Portable Checkpointing Systems

Lyubashevskiy Igor Strumpen Volker 《The Journal of supercomputing》2000,16(1-2):69-92

The ftIO-system provides portable and fault-tolerant file-I/O by enhancing the functionality of the ANSI C file system without changing its application programmer interface and without depending on system-specific implementations of the standard file operations. The ftIO-system is an extension of the porch compiler and its runtime system. The porch compiler automatically generates code to save bookkeeping information about ftIO's transactional file operations in portable checkpoints. These portable checkpoints can be recovered on a binary incompatible architecture. We developed a new algorithm for supporting transactional file operations in ftIO. Rather than using the well-known two-phase commit protocol, this algorithm uses only a single bit of information and an atomic rename file operation to guarantee fault tolerance. In this paper, we describe our new ftIO algorithm, discuss design choices for ftIO, and provide experimental data of our ftIO prototype. 相似文献

16.

A Distributed Fault-Tolerant Design for Multiple-Server VOD Systems

Shyu Ing-Jye Shieh Shiuh-Pyng 《Multimedia Tools and Applications》1999,8(2):219-247

Fault tolerance is an important design criterion for reliable and robust video-on-demand systems. Conventional fault-tolerant designs use either a primary backup or an active replication method to provide system fault tolerance. However, these approaches suffer from low utilization of the backup or replication system. In this paper we propose two playback-recovery schemes for distributed video-on-demand systems called the forward playback-recovery scheme and the backward playback-recovery scheme. Unlike conventional fault-tolerant designs, our schemes use existing playback resources to recover faulty playbacks without allocating new resources, significantly reducing recovery overhead. To use the schemes effectively, we developed a distributed algorithm for determining the order and gap information between the playbacks on the distributed video-on-demand servers so that overhead for recovering from a server failure can be minimized. This algorithm achieves N – 1 fault-tolerant resiliency for N-server video-on-demand systems. In addition, three server-recovery policies are also presented to guide surviving servers in applying the proper scheme to recover faulty playbacks, thus reducing overall recovery costs. Simulation results show that the proposed recovery schemes are effective and useful in designing fault-tolerant multiple-server video-on-demand systems. 相似文献

17.

Cooperative Diagnosis and Routing in Fault-Tolerant Multiprocessor Systems

《Journal of Parallel and Distributed Computing》1995,27(2):205-211

In this note, we consider the problem of fault-tolerant routing in multiprocessor systems when incomplete, or partial, diagnostic information is available. We first define a new type of partial diagnosis, known as k-reachability diagnosis. The overhead for k-reachability diagnosis increases with k, which specifies the radius of diagnostic information maintained by each node. We then present a routing algorithm, known as Algorithm Partial Route, that makes use of k-reachability diagnostic information and allows a trade-off between the amount of diagnostic information and the quality of routing. Partial Route is the first algorithm capable of handling systems of arbitrary topology containing an arbitrary number of faults. The worst-case performance of the algorithm in an n-node system, is shown to be optimal when k = n − 1 and within a factor of 2 of optimal when k = 1. Simulation results on meshes and hypercubes are also presented that show, in the average case, Algorithm Partial Route is nearly optimal for relatively small values of k. 相似文献

18.

并行计算机系统容错设计 总被引：1，自引：1，他引：1

屈婉霞蒋句平杨晓东徐炜遐《计算机工程与科学》2005,27(9):69-70

容错设计是提高计算机系统可靠性的有效手段。本文提出了一种分布共享主存的并行计算机系统的容错结构,着重分析了结构采用的故障诊断机制,提出了系统中备份节点机配置的优化策略。相似文献

19.

一种新的异构实时分布式系统的容错调度算法

刘怀郑世友费树岷《小型微型计算机系统》2005,26(12):2154-2159

一般来说,异构分布式实时系统中任务的周期并不完全相同且任务的时限不等于它们的周期,同时系统中还有一些无容错需求的任务.因此现有的任务调度算法一般不能满足这些要求.针对这类系统,在结合基版本/副版本技术和EDF算法的基础上,给出了一种新的容错调度算法.该算法由两部分组成：任务分配调度算法和单处理器调度算法.对于单处理器调度算法,本文采用了EDF算法;在此基础上,给出一种启发式静态任务分配算法.分析了系统的可调度性,给出了任务可调度条件和基版本/副版本时限的设置方法.仿真结果表明,这种算法是有效的. 相似文献

20.

带状态观测器系统的鲁棒容错控制 总被引：14，自引：0，他引：14

黄书鹏王诗宓《控制理论与应用》2001,18(2):249-252

对具有不确定性参数的状态观在反馈系统,提出了传感器失效的鲁棒完整性控制方法,并对一定的二次性能指标具有最优性,设计方法简单、实用。相似文献