首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 125 毫秒
1.
一种改进的同步检查点设置算法   总被引:1,自引:0,他引:1  
检查点设置与卷回恢复是集群系统中容错计算的重要手段.同步检查点方法在集群系统中得到了广泛应用.为了提高集群计算系统的工作效率,降低系统的容错开销,根据基于消息驱赶的同步检查点设置算法的性质和在实际应用中并行应用程序的通信特征,通过减小协同过程中的阻塞时间,降低系统中控制消息的数量,对基于消息驱赶的Syncand-Stop算法进行优化.改进的算法有效降低检查点设置的时间和空间开销,减小在系统应用中检查点设置的代价,进一步提高系统可扩展性和应用可靠性.  相似文献   

2.
检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。  相似文献   

3.
一种基于移动计算环境的因果日志卷回恢复算法   总被引:2,自引:0,他引:2  
由于移动节点的不可靠和无线网络连接的脆弱性,研究移动计算系统容错机制具有重要意义.对可以跨区移动、随时可以与网络断开的自治性很强的移动节点来说,异步的卷回恢复是一种重要的容错手段.现有的移动计算环境下的卷回恢复算法都无法完全实现一致的异步卷回恢复.基于因果消息日志,提出一种新的移动计算环境的卷回恢复算法:通过先行图来记录节点间的消息依赖关系,将异步检查点、基于发送方的暂存消息日志和先行图全部在移动支持站上存储和处理,为移动节点提供一种透明的容错服务,完全消除依赖关系在移动节点之间造成的影响.用形式化的方法证明了系统的一致性.仿真结果表明,在卷回开销达到最低的同时,也显著降低了无错运行时的通信和存储开销.  相似文献   

4.
一个适合大规模集群并行计算的检查点系统   总被引:5,自引:1,他引:4  
分布式检查点系统是大规模并行计算系统容错的重要手段.协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈.针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性.初步测试结果表明,C系统的设计策略适合大规模并行计算的容错.  相似文献   

5.
基于虚拟文件操作的文件检查点设置   总被引:1,自引:0,他引:1  
刘少锋  汪东升  朱晶 《软件学报》2002,13(8):1528-1533
实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点.  相似文献   

6.
为降低设置检查点的开销,提出一种高效的异步存储非阻塞协调式检查点算法。该算法允许多个进程并发地在进程状态信息量较小时设置检查点,只在稳固存储器空闲时进行异步存储,并可同时进行检查点设置及进程执行。实验结果表明,该算法能降低设置检查点的开销,提高系统性能。  相似文献   

7.
MPI是广泛应用于集群系统的并行程序开发环境,MPI的容错是集群系统可靠性的关键问题。该文讨论了MPI标准中的容错,结合协调设置检查点和同步卷回等机制设计了基于检查点的卷回恢复系统MPIChaRR、该系统应用于Linux集群机,MPICH应用程序运行中的节点故障恢复是对用户透明的。  相似文献   

8.
机群系统中检查点卷回恢复协议分析   总被引:2,自引:0,他引:2       下载免费PDF全文
检查点机制作为一种软件容错机制,可以很好地满足机群系统的容错要求,本文详细分析了各类检查点卷回恢复协议,并比较它们的性能和特点。  相似文献   

9.
具有O(n)消息复杂度的协调检查点设置算法   总被引:3,自引:0,他引:3  
协调检查点设置及回卷恢复技术作为一种有效的容错手段,已广泛地运用在集群等并行/分布计算机系统中.为了进一步降低协调检查点设置的时间和空间开销,提出了一种基于消息计数的协调检查点设置算法.该算法无须对底层消息通道的FIFO特性进行假设,并使同步阶段引入的控制消息复杂度由通常的O(n2)降低到O(n),有效地提高了系统的效率和扩展性.  相似文献   

10.
针对计算流体力学应用开发框架容错支持能力的不足,提出了一种新的容错周期优化方法。该方法基于系统故障的概率建模,计算得到理想最优容错周期;并结合计算流体力学应用场数据输出的特点,在线确定实际检查点备份时机。三个典型应用的实验结果表明,在不同平均无故障时间的系统上,与固定时间步进行容错的方法相比,该方法总能够得到最优的容错开销。用户可以基于该方法通过框架接口便捷地设置容错周期,并有效降低容错所引起的开销。  相似文献   

11.
Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O(kn) when k initiators initiate concurrently. The time complexity is O(n). For the recovery algorithm, time and message complexities are both O(n).  相似文献   

12.
This paper presents an optimal checkpoint strategy for fault-tolerance in real-time systems where transient faults occur in Poisson distribution. In our environment, multiple real-time tasks with different deadlines and harmonic periods are scheduled in the system by rate-monotonic algorithm, and checkpoints are inserted at a constant interval in each task. When a fault is detected, the system carries out rollback to the latest checkpoint and re-executes tasks. The maximum number of re-executable checkpoints and an equation to check schedulability are derived, and the optimal number of checkpoints is selected to maximize the probability of completing all the tasks within their deadlines.  相似文献   

13.
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm.  相似文献   

14.
基于检测点设置依赖图和属性表的卷回恢复算法   总被引:2,自引:0,他引:2  
为了解决检测点设置过程中的Domino效应问题及卷回恢复过程中的活锁问题,并最大限度地减小时间开销,提出了基于检测点设置依赖图和属性表的卷回恢复算法。同以前的算法相比较,该算法一方面节省了用于进程之间同步的时间开销,另一方面检测点设置及卷回过程中涉及少量的相关进程。对该算法的正确性进行了证明。  相似文献   

15.
计算机系统容错技术研究   总被引:2,自引:1,他引:1       下载免费PDF全文
针对计算机系统中软、硬件可靠性问题的不同特点,讨论容错技术的最新发展现状,分析计算机系统中的各种容错方法,包括传统的冗余设计、错误回卷恢复机制以及当前研究较多的一般化容错设计方法等,研究目前已有的一些容错方法在反应延迟、容错成本、精确量化、异构同步、可靠性建模等方面存在的缺陷以及待解决关键问题,并对如何进一步更好地完善和使用这些容错方法进行总结。  相似文献   

16.
A rollback recovery scheme for distributed systems is proposed. The state-save synchronization among processes is implemented by bounding clock drifts such that no state-save synchronization messages are required. Since the clocks are only loosely synchronized, the synchronization overhead can be negligible in many applications. An interprocess communication protocol which encodes state-save progress information within message frames is introduced to checkpoint consistent system states. A rollback recovery algorithm that will force a minimum number of nodes to roll back after failures is developed  相似文献   

17.
Determining consistent global checkpoints is common to many distributed problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addresses a third technique, namely adaptive checkpointing, that has recently emerged. This technique assumes processes take local checkpoints independently and requires them to take additional local checkpoints in order that all local checkpoints be members of some consistent global checkpoint. We first study the characteristics of such adaptive algorithms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the following question: ‘does a given local checkpoint belong to a consistent global checkpoint’' (such a local checkpoint is not useless). The resulting algorithm has the nice property to reduce the number of additional local checkpoints taken to ensure the property ‘no local checkpoint is useless’. Futhermore, it provides each local checkpoint with a consistent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks.  相似文献   

18.
A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号