首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 171 毫秒
1.
一种基于索引的准同步检查点协议   总被引:3,自引:0,他引:3  
在基于索引的分布式检查点算法中,尽量减少全局一致性检查点和强制检查点的数目对提高计算效率具有重要意义.该文在已有的基于索引的检查点算法的基础上,提出了一种新的检查点协议,既减少检查点的数目,又使各个进程的检查点之间实时同步,以免程序出错后回卷执行的开销太大,丢失过多有效计算.模拟实验表明,按该文所提协议,平均每条消息导致的强制检查点数比传统方法平均减少23.2%.  相似文献   

2.
考虑到移动Ad Hoc网络无固定中心节点、多跳路由和资源有限等特点,基于分簇移动Ad Hoc网络结构,提出了一种结合同步和异步检查点技术的混合检查点策略,即同簇终端检查点必须保持同步,而异簇终端检查点保持独立.首先讨论了混合检查点模型及其正确性准则.然后,基于簇内及簇间检查点依赖图,讨论了不同类型检查点清除规则.最后,给出了相应的检查点及回滚恢复算法,并证明了回滚恢复的正确性.所提出的混合检查点策略既能避免同簇进程级联回滚所引起的资源浪费、又能避免异簇终端之间过多跨簇消息传递及减少无线通信延迟.实验结果表明,与单纯的同步及异步检查点策略相比,所提出的检查点策略是一种综合考虑移动Ad Hoc网络各种资源约束的较好折中方案,且具有恢复时间短、对簇头依赖小、灵活性好等优点.  相似文献   

3.
检查点算法作为一种有效的故障技术及容错手段,已广泛地运用在网格、分布式和云计算系统中。该文提出了一种非阻塞协调检查点算法,该算法增加了系统的可靠性,并允许检查点灵活设置,充分缩减了同步信息数量,加速了检查点形成时间。和典型的相关算法比较,该文提出的算法使用更少的同步控制消息,具有更低的费用,引入同步控制消息的时间复杂度由一般的O(n2)降到O(n),且同步消息数仅仅为n-1。  相似文献   

4.
检查点算法作为一种有效的故障技术及容错手段,已广泛地运用在网格、分布式和云计算系统中。该文提出了一种非阻塞协调检查点算法,该算法增加了系统的可靠性,并允许检查点灵活设置,充分缩减了同步信息数量,加速了检查点形成时间。和典型的相关算法比较,该文提出的算法使用更少的同步控制消息,具有更低的费用,引入同步控制消息的时间复杂度由一般的O(n2)降到O(n),且同步消息数仅仅为n-1。  相似文献   

5.
一种改进的同步检查点设置算法   总被引:1,自引:0,他引:1  
检查点设置与卷回恢复是集群系统中容错计算的重要手段.同步检查点方法在集群系统中得到了广泛应用.为了提高集群计算系统的工作效率,降低系统的容错开销,根据基于消息驱赶的同步检查点设置算法的性质和在实际应用中并行应用程序的通信特征,通过减小协同过程中的阻塞时间,降低系统中控制消息的数量,对基于消息驱赶的Syncand-Stop算法进行优化.改进的算法有效降低检查点设置的时间和空间开销,减小在系统应用中检查点设置的代价,进一步提高系统可扩展性和应用可靠性.  相似文献   

6.
分布式系统中的检查点算法   总被引:12,自引:0,他引:12  
检查点能够保存和恢复程序的运行状态.它在进程迁移、容错、卷回调试等领域都有重要的应用.本文对分布式系统中的检查点算法进行了详细的分类评述.检查点算法可分为单进程和分布式程序检查点算法,分布式程序检查点算法又可分为异步检查点算法和一致检查点算法.同时本文系统介绍了改进检查点算法性能的典型方法.这些改进算法主要采用两个策略来减少算法的开销与延迟:一是减少检查点文件中需要存储的信息量,如增量算法等;二是提高检查点操作与目标程序运行的并行性,如主存算法等.最后,文章讨论了目前检查点算法的局限性和进一步的工作.  相似文献   

7.
检查点是并行系统中实现容错的重要手段,同步检查点方法已广泛应用在工作站机群系统中。PVM所提供的消息传递机制支持高效的异构网络计算,但不支持客错功能。为了降低同步检查点设置的时间开销,提出了一种基于PVM的准同步检查点设置方法,它吸取了同步检查点方法的优点,又通过消息记录方式实现各节点间独立进行状态保存,大大降低了检查点的同步开销,提高了检查点操作效率,该方法在PVM环境下得以实现,实验结果表明所提出的方法具有较好的客错性能。  相似文献   

8.
具有O(n)消息复杂度的协调检查点设置算法   总被引:3,自引:0,他引:3  
协调检查点设置及回卷恢复技术作为一种有效的容错手段,已广泛地运用在集群等并行/分布计算机系统中.为了进一步降低协调检查点设置的时间和空间开销,提出了一种基于消息计数的协调检查点设置算法.该算法无须对底层消息通道的FIFO特性进行假设,并使同步阶段引入的控制消息复杂度由通常的O(n2)降低到O(n),有效地提高了系统的效率和扩展性.  相似文献   

9.
针对典型的云平台下虚拟化系统的特点,提出了一种结合选择性日志的准同步检查点算法VM_QSC:保持不同虚拟机节点固有的优化检查点周期,通过物理节点Hypervisor选择性地进行虚拟机的消息日志的稳定存储,在全局监控节点维护虚拟机一致线信息,保持全局的一致性。与传统的准同步检查点和同步检查点相比,该算法维持了虚拟机检查点设置的自主性,并显著降低了虚拟化系统的容错开销,可以有效应用于云计算环境下的虚拟资源管理和动态迁移。  相似文献   

10.
在分布式计算环境中经常使用检查点/恢复策略来进行容错。文中主要研究在信道不可靠的环境中通过协调使相互通信的各进程所做的检查点保持全局一致性的方法。通过分析中途消息与信道可靠性之闯的关系以及已有检查点协议对于中途消息处理方法,提出了一种应用于信道不可靠环境下的协调式检查点方法,其消息复杂度为O(N)且不引入其他的计算负担,只通过一次同步即可达到全局一致性状态,相比于以往的协调式检查点协议大大减小了时间开销,提高了在不可靠信道环境中做全局一致检查点的效率。  相似文献   

11.
Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class.  相似文献   

12.
This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation defined between pairs of successive checkpoints of a process which allows us, in some cases, to advance the recovery line of the computation without forcing checkpoints in other processes. The algorithm is well-suited for autonomous and heterogeneous environments, where each process does not know any private information about other processes and private information of the same type of distinct processes is not related (e.g., clock granularity, local checkpointing strategy, etc.). We also present a simulation study which compares the checkpointing-recovery overhead of this algorithm to the ones of previous solutions  相似文献   

13.
Determining consistent global checkpoints is common to many distributed problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addresses a third technique, namely adaptive checkpointing, that has recently emerged. This technique assumes processes take local checkpoints independently and requires them to take additional local checkpoints in order that all local checkpoints be members of some consistent global checkpoint. We first study the characteristics of such adaptive algorithms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the following question: ‘does a given local checkpoint belong to a consistent global checkpoint’' (such a local checkpoint is not useless). The resulting algorithm has the nice property to reduce the number of additional local checkpoints taken to ensure the property ‘no local checkpoint is useless’. Futhermore, it provides each local checkpoint with a consistent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks.  相似文献   

14.
Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems  相似文献   

15.
Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm combined them. However, we found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper; we introduce the concept of “mutable checkpoint,” which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage  相似文献   

16.
Checkpointing algorithms are classified as synchronous and asynchronous in the literature. In synchronous checkpointing, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system. Synchronizing checkpointing activity involves message overhead and process execution may have to be suspended during the checkpointing coordination, resulting in performance degradation. In asynchronous checkpointing, processes take checkpoints without any coordination with others. Asynchronous checkpointing provides maximum autonomy for processes to take checkpoints; however, some of the checkpoints taken may not lie on any consistent global checkpoint, thus making the checkpointing efforts useless. Asynchronous checkpointing algorithms in the literature can reduce the number of useless checkpoints by making processes take communication induced checkpoints besides asynchronous checkpoints. We call such algorithms quasi-synchronous. In this paper, we present a theoretical framework for characterizing and classifying such algorithms. The theory not only helps to classify and characterize the quasi-synchronous checkpointing algorithms, but also helps to analyze the properties and limitations of the algorithms belonging to each class. It also provides guidelines for designing and evaluating such algorithms  相似文献   

17.
A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery.  相似文献   

18.
A consistent checkpointing algorithm saves a consistent view of a distributed application's state on stable storage. The traditional consistent checkpointing algorithms require different processes to save their state at about the same time. This causes contention for the stable storage, potentially resulting in large overheads. Staggering the checkpoints taken by various processes can reduce checkpoint overhead. This paper presents a simple approach to arbitrarily stagger the checkpoints. Our approach requires that the processes take consistent logical checkpoints, as compared to consistent physical checkpoints enforced by existing algorithms. Experimental results on nCube-2 are presented  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号