首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
基于共享内存的机群服务检查点机制研究   总被引:1,自引:0,他引:1  
针对既有基于稳定存储的机群服务检查点存在的系统成本高、恢复时间长的问题,提出了一种基于共享内存的机群服务检查点机制;设计了一套面向基于共享内存的检查点信息主-备存储模式的检查点信息管理协议,确保机群服务检查点信息一致性;设计了一套基于单向逻辑环的检查点组管理协议,确保检查点逻辑备份环中检查点进程的成员视图一致性.性能实验结果表明,该检查点机制具有较好的检查点信息读写性能,组管理协议系统开销小,较好地满足了机群服务检查点需求.  相似文献   

2.
针对Spark检查点缓存数据清理需要等待作业运行完成后由编程人员清理,可能导致产生失效数据累积占用内存问题,本文分析检查点执行机制,建模推导出随着检查点数量增多,检查点缓存清理方法不可扩展,提出使用检查点缓存效用熵模型感知检查点缓存和内存槽的匹配度,并利用效用最佳匹配原则,推导出最佳检查点缓存清理最佳时机.基于效用熵的...  相似文献   

3.
针对大规模高性能计算(HPC)系统中检查点效率提升问题,提出一种面向分层检查点近似最优周期计算模型。首先,通过分析一个HPC系统中应用程序的执行过程,将检查点周期优化抽象为一个非线性的检查点成本模型;其次,通过分析可能故障位置推导出分层检查点成本公式,并引入两个减速因子和一个加速因子来模拟消息日志对分层检查点造成的影响。仿真实验结果表明,所提模型与理论近似最优周期检查点成本平均误差在5%以下,相对传统检查点周期优化模型的平均误差降低了20%,能够有效提高检查点的效率,提升HPC系统可用性。  相似文献   

4.
李春江  肖侬  杨学军 《计算机工程》2005,31(10):57-59,102
分析了计算网格环境中实现检查点机制的特殊性,提出了一种新的应用级检查点方法:基于作业进展描述的检查点方法。介绍了这种检查点方法的基本思想,定义了构成作业进展描述的作业进展状态对象和作业进展描述对象,这些对象的方法构成了检查点API;讨论了检查点作业的构建。  相似文献   

5.
分布式系统中的检查点算法   总被引:12,自引:0,他引:12  
检查点能够保存和恢复程序的运行状态.它在进程迁移、容错、卷回调试等领域都有重要的应用.本文对分布式系统中的检查点算法进行了详细的分类评述.检查点算法可分为单进程和分布式程序检查点算法,分布式程序检查点算法又可分为异步检查点算法和一致检查点算法.同时本文系统介绍了改进检查点算法性能的典型方法.这些改进算法主要采用两个策略来减少算法的开销与延迟:一是减少检查点文件中需要存储的信息量,如增量算法等;二是提高检查点操作与目标程序运行的并行性,如主存算法等.最后,文章讨论了目前检查点算法的局限性和进一步的工作.  相似文献   

6.
刘勇燕  刘勇鹏  冯华  迟万庆 《计算机科学》2011,38(5):287-289,305
检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。  相似文献   

7.
本文运用缓冲区和增量有盘检查点相结合的技术提出了一个快速可靠的改进N+1奇偶校验检查点方案。在N个应用进程运行时,通过设置一个专用的检查点进程来实现N+1的奇偶校验,并且利用检查点机在检查点间隔的空闲时间将增量的奇偶校验检查点信息保存到稳定的存储器中。改进的算法利用了无盘检查点方案的快速及磁盘检查点的高可靠性,减少了一台备份处理机,并且可容忍一个应用进程及一个检查点进程的两个并发错误。  相似文献   

8.
本文详细地介绍condor检查点机制和condor的工作原理,对condor的检查点机制进行了配置。通过一个具体的作业调度程序成功地测试了condor的检查点的正确性、检查点功能的可用性和检查点库提供的一些编程接口API的可用性。  相似文献   

9.
考虑到移动Ad Hoc网络无固定中心节点、多跳路由和资源有限等特点,基于分簇移动Ad Hoc网络结构,提出了一种结合同步和异步检查点技术的混合检查点策略,即同簇终端检查点必须保持同步,而异簇终端检查点保持独立.首先讨论了混合检查点模型及其正确性准则.然后,基于簇内及簇间检查点依赖图,讨论了不同类型检查点清除规则.最后,给出了相应的检查点及回滚恢复算法,并证明了回滚恢复的正确性.所提出的混合检查点策略既能避免同簇进程级联回滚所引起的资源浪费、又能避免异簇终端之间过多跨簇消息传递及减少无线通信延迟.实验结果表明,与单纯的同步及异步检查点策略相比,所提出的检查点策略是一种综合考虑移动Ad Hoc网络各种资源约束的较好折中方案,且具有恢复时间短、对簇头依赖小、灵活性好等优点.  相似文献   

10.
现有的检查点技术不支持socket连接的恢复,也没有将进程恢复和数据恢复结合起来,因此不能支持含有数据库访问的应用程序.本文提出一种支持含有数据库访问的进程检查点技术.对于含有数据库访问的应用程序,在设置进程检查点之前,先设置数据库检查点,获取当前数据库的系统改变号SCN,然后生成进程检查点.当程序从进程检查点处恢复运...  相似文献   

11.
一种新的优化的检查点间隔的求解模型   总被引:1,自引:0,他引:1  
在具有容错功能的高性能计算环境中,由于加入检查点机制会给系统引入额外负载,检查点间隔的适当选定能使系统性能优化,Vaidya的贡献是用他的模型得出的检查点间隔的求解等式独立于检查点潜伏时间(L)及检查点恢复时间(R),本文介绍了一种新的基于时间分段的模型NSBM,引入了系统平均利用率这一容错领域更易理解的概念代替Vaidya模型中的平均负载率并推导出了也是独立于LR的求解等等式,实验结果表明NSBM的求解模型比Vaidya的求解模型更优化。  相似文献   

12.
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm.  相似文献   

13.
Checkpoint and recovery protocols are commonly used in distributed applications for providing fault tolerance. The performance of a checkpoint and recovery protocol is judged by the amount of computation it can save against the amount of overhead it incurs. This performance depends on different system and application characteristics, as well as protocol specific parameters. Hence, no single checkpoint and recovery protocol works equally well for all applications, and given a distributed application and a system it will run on, it is important to choose a protocol that will give the best performance for that system and application. In this paper, we present a scheme to automatically identify a suitable checkpoint and recovery protocol for a given distributed application running on a given system. The scheme involves a novel technique for finding the similarity between the communication pattern of two distributed applications that is of independent interest also. The similarity measure is based on a graph similarity problem. We present a heuristic for the graph similarity problem. Extensive experimental results are shown both for the graph similarity heuristic and the automatic identification scheme to show that an appropriate checkpoint and recovery protocol can be chosen automatically for a given application.  相似文献   

14.
并行离散事件仿真对复杂大规模动态系统的研究以及探索其长远的应用提供了便利,近年来日益成为研究的热点。然而时间同步管理是影响并行离散事件仿真系统高效运行的重要因素之一。乐观的同步是采用检测和回退机制,允许逻辑进程积极的处理本地事件。一旦出现同步错误则利用回退机制从错误中恢复到较早状态,然后再恢复执行。这一切都是通过基于检查点状态保存重建机制来实现的,因而状态保存及状态重建必然伴随着时间和空间的损耗。该文深入研究了在乐观同步机制下,仿真执行时间和内存空间的损耗与检查点间隔之间的关系,并通过推理计算给出了检查点间隔的最优取值范围。  相似文献   

15.
A global checkpoint is a set of local checkpoints, one per process. The traditional consistency criterion for global checkpoints states that a global checkpoint is consistent if it does not include messages received and not sent. The paper investigates other consistency criteria, transitlessness, and strong consistency. A global checkpoint is transitless if it does not exhibit messages sent and not received. Transitlessness can be seen as a dual of traditional consistency. Strong consistency is the addition of transitlessness to traditional consistency. The main result of the paper is a statement of the necessary and sufficient condition answering the following question: “given an arbitrary set of local checkpoints, can this set be extended to a global checkpoint that satisfies P” (where P is traditional consistency, transitlessness, or strong consistency). From a practical point of view, this condition, when applied to transitlessness, is particularly interesting as it helps characterize which messages do not need to be recorded by checkpointing protocols  相似文献   

16.
Determining consistent global checkpoints is common to many distributed problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addresses a third technique, namely adaptive checkpointing, that has recently emerged. This technique assumes processes take local checkpoints independently and requires them to take additional local checkpoints in order that all local checkpoints be members of some consistent global checkpoint. We first study the characteristics of such adaptive algorithms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the following question: ‘does a given local checkpoint belong to a consistent global checkpoint’' (such a local checkpoint is not useless). The resulting algorithm has the nice property to reduce the number of additional local checkpoints taken to ensure the property ‘no local checkpoint is useless’. Futhermore, it provides each local checkpoint with a consistent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks.  相似文献   

17.
Checkpoint and rollback recovery is a well‐known technique for providing fault tolerance to long‐running distributed applications. Performance of a checkpoint and recovery protocol depends on the characteristics of the application and the system on which it runs. However, given an application and system environment, there is no easy way to identify which checkpoint and recovery protocol will be most suitable for it. Conventional approaches require implementing the application with all the protocols under consideration, running them on the desired system, and comparing their performances. This process can be very tedious and time consuming. This paper first presents the design and implementation of a simulation environment, distributed process simulation or dPSIM, which enables easy implementation and evaluation of checkpoint and recovery protocols. The tool enables the protocols to be simulated under a wide variety of application, system, and network characteristics. The paper then presents performance evaluation of five checkpoint and recovery protocols. These protocols are implemented and executed in dPSIM under different simulated application, system, and network characteristics. Copyright © 2003 John Wiley & Sons, Ltd.  相似文献   

18.
检查点算法作为一种有效的故障技术及容错手段,已广泛地运用在网格、分布式和云计算系统中。该文提出了一种非阻塞协调检查点算法,该算法增加了系统的可靠性,并允许检查点灵活设置,充分缩减了同步信息数量,加速了检查点形成时间。和典型的相关算法比较,该文提出的算法使用更少的同步控制消息,具有更低的费用,引入同步控制消息的时间复杂度由一般的O(n2)降到O(n),且同步消息数仅仅为n-1。  相似文献   

19.
SFT:一个具有较短冻结时间的一致检查点算法   总被引:1,自引:0,他引:1  
介绍了一个基于消息记录的一致检查点算法-SFT算法,SFT算法能够实现分布式系统的容错,该算法具有无多米诺效应,冻结时间短,开销小和重启动算法简单的优点,SFT的IPC机制基于PVM,能够保证消息的有序到达,并且其消息的发送和接收操作都是原子操作,另外,IPC机制中进程的id值编码与所在机器无关,这样一个过程即使从故障机器迁移到其它机器上运行仍可与其它进程继续通信,为提高检查点操作的并行性,SFT  相似文献   

20.
基于blcr软件,在Linux内核层,设计会话断点保存与恢复软件。该软件可在同一个会话内、进程间实现同步断点保存与恢复,无须改变进程间的相互依赖关系。应用结果表明,将该软件集成到Torque/Maui集群管理和调度系统中,可对用户运行程序进行透明的断点保存与恢复。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号