首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
双机容错系统中最佳检查点间隔的分析   总被引:2,自引:0,他引:2  
设置检查点是容错计算机系统进行故障恢复的重要手段。因为检查点间隔选择过大或过小都将使系统性能受到影响,所以检查点间隔的适当选定是系统性能优化的一个重要指标。该文针对双机容错系统,采用检查点设置与回卷恢复的方法提出了一种系统模型,利用马尔科夫链得到了最佳检查点间隔的求解等式,通过实验证实了求解等式的正确性。  相似文献   

2.
为降低设置检查点的开销,提出一种高效的异步存储非阻塞协调式检查点算法。该算法允许多个进程并发地在进程状态信息量较小时设置检查点,只在稳固存储器空闲时进行异步存储,并可同时进行检查点设置及进程执行。实验结果表明,该算法能降低设置检查点的开销,提高系统性能。  相似文献   

3.
容错分布式系统的维修策略研究   总被引:1,自引:0,他引:1  
一、引言 许多实际系统在使用过程中,往往由于对维修性问题考虑不周,致使系统的维费用增加;另一方面如果对系统进行过多的维修,不仅不能提高系统的可靠性和可用度,反而使统的性能降低。因此系  相似文献   

4.
赵毅  曹宗雁  朱鹏  迟学斌 《软件学报》2013,24(S2):89-98
中国科学院超级计算环境是整合了包括总中心、分中心和所级中心计算资源的3层架构超级计算环境.为提升超级计算环境的可靠性,提供稳定、可靠的计算服务,其容错机制的研究成为超级计算环境的一个研究重点.在对容错基本思想及各类计算机容错技术进行充分调研的基础上,提出一种适用于超级计算环境的容错框架,依据该框架给出了不同层次的容错方案,并对不同层次的容错开销进行了分析和比较,验证了不同层次容错方案对应用程序所带来的影响.  相似文献   

5.
6.
分布式系统中的检查点算法   总被引:12,自引:0,他引:12  
检查点能够保存和恢复程序的运行状态.它在进程迁移、容错、卷回调试等领域都有重要的应用.本文对分布式系统中的检查点算法进行了详细的分类评述.检查点算法可分为单进程和分布式程序检查点算法,分布式程序检查点算法又可分为异步检查点算法和一致检查点算法.同时本文系统介绍了改进检查点算法性能的典型方法.这些改进算法主要采用两个策略来减少算法的开销与延迟:一是减少检查点文件中需要存储的信息量,如增量算法等;二是提高检查点操作与目标程序运行的并行性,如主存算法等.最后,文章讨论了目前检查点算法的局限性和进一步的工作.  相似文献   

7.
分布式系统检查点算法中程序卷回时文件系统的状态恢复   总被引:3,自引:0,他引:3  
检查点技术,也称为“回溯恢复”,是软件容错的重要手段,它主要用于保存和恢复程序的运行状态。在分布式计算和并行计算系统中有十分重要的作用。该文从减少检查点的开销角度,对分布式系统检查点算法中关于程序卷回时文件系统状态的恢复问题进行了分析讨论和进一步的研究。  相似文献   

8.
一种基于检查点的卷回恢复与进程迁移系统*   总被引:12,自引:2,他引:12       下载免费PDF全文
ChaRM是一种并行程序后向故障恢复与进程迁移系统.它不仅实现了对工作站机群系统瞬时故障的恢复,而且通过检查点设置时的Mirror存储技术和进程迁移技术,实现了对机群系统结点永久故障的恢复,并支持系统软硬件的在线维护、处理机资源的排他/限时使用和动态负载平衡等功能.文章主要介绍ChaRM系统的检查点设置与回卷恢复、进程迁移等实现技术,并给出了部分性能评测结果.  相似文献   

9.
基于异构分布式系统的实时容错调度算法   总被引:26,自引:1,他引:26  
目前文献中研究的实时容错调度算法都是基于同构分布式系统,系统中的所有处理机完全相同。该文首先建立了一个基于异构分布式系统实时容错调度模型,异构分布式系统中的各个处理机均不相同。基于该异构分布式系统模型,该文引入了可靠性代价(reliability cost)概念,并提出两种静态实时容错调度算法(RTFTNO和RTFTRC)用于调度周期性实时容错任务。算法RTFTRC在调度任务时,尽量使系统的可靠性代价最小;而算法RTFTNO在调度实时任务时,没有考虑系统的可靠性代价。该文详细讨论了两种调度算法的性能。性能模拟实验分别比较了两个算法的可靠性代价,超时比率和可调度性;并研究了任务的计算时间与可靠性代价的关系以及调度长度阈值与最小处理机个数的关系。实验结果表明,算法RTFTRC的性能优于算法RTFTNO。  相似文献   

10.
耿技  陈非  聂鹏  陈伟  秦志光 《计算机应用》2012,32(10):2748-2751
基于检查点的协同式回滚恢复机制是一种针对分布式系统生存性保障的有效机制,现有分布式系统中基于检查点的回滚恢复机制以分布式信道可靠作为假设前提,而实际应用场景中,该假设并不总是成立。针对分布式系统实际的应用环境,提出了适用于信道不可靠的分布式计算环境的协同式系统生存性保障模型。该模型在保留检查点回滚恢复机制优点的基础上,通过建立冗余通信链路和进程迁移来保障不可靠通信信道环境下分布式系统的生存性。  相似文献   

11.
建立了一个异构分布式系统实时调度模型,对异构分布式系统中的任务及不同处理机资源进行了形式化描述.结合基版本/副版本技术,给出了用于异构分布式系统的实时任务轮转式容错调度算法.实例分析表明,该算法有效提高了异构处理机环境下的资源利用率以及整体计算性能.  相似文献   

12.
SIP从20世纪90年起一经使用,就彻底改进了人们使用融合服务彼此进行通信的方式。会话初始协议提供了在网络上无缝透明传递声音、视频、数据和无线服务的框架结构。但SIP应用的可靠性的研究还处于初级阶段。文章阐述一种分布式容错SIP协议栈的实现方式,以方便可靠的SIP网络的设计和构建,从而使得SIP服务的用户得到更好的服务体验。  相似文献   

13.
Communication-Induced Checkpointing (CIC) protocols are classified into two categories in the literature: Index-based and Model-based. In this paper, we discuss two data structures being used in these two kinds of CIC protocols, and their different roles in helping the checkpointing algorithms to enforce Z-cycle Free (ZCF) property. Then, we present our Fully Informed aNd Efficient (FINE) communication-induced checkpointing algorithm, which not only has less checkpointing overhead than the well-known Fully Informed (FI) CIC protocol proposed by Helary et al. but also has less message overhead. Performance evaluation indicates that our protocol performs better than many of the other existing CIC protocols.  相似文献   

14.
Distributed Problem Solving Networks (DPSN) provide a means for interconnecting intelligent problem solver nodes that can solve only a part of a problem depending on their ability in the problem domain. The decomposition of a problem into subproblems, and the selection of nodes to solve them can be regarded as the generation of an AND/OR tree, and the solution of the problem as a search for a solution tree. Introducing measurements for the cost of a solution tree, we present an algorithm to find one having minimal cost under certain conditions. A Flexible Manufacturing System consisting of a network of flexible workcells is used as an example.  相似文献   

15.
Pierre Sens  Bertil Folliot 《Software》1998,28(10):1079-1099
This paper presents the design, implementation and performance evaluation of a software fault manager for distributed applications. Dubbed Star, it uses the natural redundancy existing in networks of workstations to offer a high level of fault tolerance. Fault management is transparent to the supported parallel applications. To improve the response time of fault-tolerant applications, Star implements non-blocking and incremental checkpointing to perform an efficient backup of process state. Moreover, Star is application independent, highly configurable. Star actually runs on top of SunOs and is easily portable to UNIX™-like operating systems. The current implementation is based on independent checkpointing and message logging. Measurements show the efficiency and the limits of this implementation. The challenge is to show that a software approach to fault tolerance can efficiently be implemented in a standard networked environment. © 1998 John Wiley & Sons, Ltd.  相似文献   

16.
The concepts of semistability and exponential semistability are well-developed for finite-dimensional systems with nonisolated equilibrium points, where asymptotic or exponential stability is not possible. Definitions of semistability and exponential semistability have recently been formulated for networks with time-delays. This paper further extends the semistability theory to continuous and discrete spatially distributed systems. This requires the definition of the notions of exact and approximate semicontrollability and semiobservability, and discrete approximate semicontrollability and semiobservability. Also introduced is the property of weak semistability. Necessary and sufficient conditions are given for exponential semistability and weak semistability, and sufficient conditions are given for semistability.  相似文献   

17.
详细论述构件设计的思想,探讨一种基于构件技术的分布式信息系统开发方法,给出开发模型和具体的开发步骤.设计实现消防文书管理子系统,表明利用构件技术开发分布式信息系统是高质、高效的.  相似文献   

18.
Often hard real-time systems require results that are produced on time despite the occurrence of processor failures. This paper considers a distributed system where tasks are periodic and each task occurs in multiple copies which are periodically synchronized in order to handle failures. The problem of preemptively scheduling a set of such tasks is discussed where every occurrence of a task has to be completely executed before the next occurrence of the same task. First, a static scheduling algorithm is proposed which uses periodic checkpoints to tolerate processor failures. Then, the performance of the algorithm is substancially improved employing a mixed strategy which constructs a schedule where high frequency tasks are duplicated, and low frequency tasks are periodically checkpointed. The performance of the solution proposed is evaluated in terms of the minimum achievable processor utilization due to the useful computation of the tasks. Moreover, analytical and simulation studies are used to reveal interesting trade-offs associated with the scheduling algorithm. In particular, if high frequency tasks are less than 70 percent of the total number of tasks then the mixed strategy yields a higher processor utilization than the task duplication scheme.  相似文献   

19.
传统的计算机应用系统体系是Client/Server结构模式,这种多对一的信息共享方式,造成了服务器系统的异常复杂和难以维护。随着网络服务需求的剧增,传统的信息共享模式已无法再满足新业务的需要,于是分布式应用系统技术(B/S结构)慢慢成为网络服务系统的首选结构。  相似文献   

20.
A network partition can break a distributed computing system into groups of isolated nodes. When this occurs, a mutual exclusion mechanism may be required to ensure that isolated groups do not concurrently perform conflicting operations. We study and formalize these mechanisms in three basic scenarios: where there is a single conflicting type of action; where there are two conflicting types, but operations of the same type do not conflict; and where there are two conflicting types, but operations of one type do not conflict among themselves. For each scenario, we present applications that require mutual exclusion (e.g., name servers, termination protocols, concurrency control). In each case, we also present mutual exclusion mechanisms that are more general and that may provide higher reliability than the voting mechanisms that have been proposed as solutions to this problem. Daniel Barbara is a graduate student in the Computer Science Department at Princeton University and expects to receive his Ph.D. Degree by July 1985. He obtained his BS in Electrical Engineering at the Universidad Metropolitana, Caracas, Venezuela in 1975. His research interests are Distributed Systems, Databases and Computer Networks. He is a member of IEEE and ACM. Hector Garcia-Molina is associate professor in the Department of Computer Science at Princeton University, Princeton, New Jersey. His research interests include distributed computing systems and database systems. He received a BS in electrical engineering from the Instituto Tecnologico de Monterrey. Mexico, in 1974. From Stanford University, Stanford, California, he received in 1975 a MS in electrical engineering and a PhD in computer science in 1979. Garcia-Molina is a member of the ACM and IEEE.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号