期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On properties of RDT communication-induced checkpointing protocols 总被引：1，自引：0，他引：1

Jichiang Tsai 《Parallel and Distributed Systems, IEEE Transactions on》2003,14(8):755-764

Rollback-dependency trackability (RDT) is a property stating that all rollback dependencies between local checkpoints are online trackable by using a transitive dependency vector. The most crucial RDT characterizations introduced in the literature can be represented as certain types of RDT-PXCM-paths. Here, let the U-path and V-path be any two types of RDT-PXCM-paths. We investigate several properties of communication-induced checkpointing protocols that ensure the RDT property. First, we prove that if an online RDT protocol encounters a U-path at a point of a checkpoint and communication pattern associated with a distributed computation, it also encounters a V-path there. Moreover, if this encountered U-path is invisibly doubled, the corresponding encountered V-path is invisibly doubled as well. Therefore, we can conclude that breaking all invisibly doubled U-paths is equivalent to breaking all invisibly doubled V-paths for an online RDT protocol. Next, we continue to demonstrate that a visibly doubled U-path must contain a doubled U-cycle in the causal past. These results can further deduce that some different checkpointing protocols actually have the same behavior for all possible patterns. Finally, we present a commendatory systematic technique for comparing the performance of online RDT protocols. 相似文献

2.

Communication-based prevention of useless checkpoints in distributed computations

J.-M. Hélary A. Mostefaoui R.H.B. Netzer M. Raynal 《Distributed Computing》2000,13(1):29-43

Summary. A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional local (forced) checkpoints to ensure no local checkpoint is useless. The paper first proves two properties related to integer timestamps which are associated with each local checkpoint. The first property is a necessary and sufficient condition that these timestamps must satisfy for no checkpoint to be useless. The second property provides an easy timestamp-based determination of consistent global checkpoints. Then, a general communication-induced checkpointing protocol is proposed. This protocol, derived from the two previous properties, actually defines a family of timestamp-based communication-induced checkpointing protocols. It is shown that several existing checkpointing protocols for the same problem are particular instances of the general protocol. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in “consistent global checkpoint”-based distributed applications such as the detection of stable or unstable properties and the determination of distributed breakpoints. Received: July 1997 / Accepted: August 1999 相似文献

3.

Theoretical analysis for communication-induced checkpointingprotocols with rollback-dependency trackability

Jichiang Tsai Sy-Yen Kuo Yi-Min Wang 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(10):963-971

Rollback-Dependency Trackability (RDT) is a property that states that all rollback dependencies between local checkpoints are on-line trackable by using a transitive dependency vector. In this paper, we address three fundamental issues in the design of communication-induced checkpointing protocols that ensure RDT. First, we prove that the following intuition commonly assumed in the literature is in fact false: If a protocol forces a checkpoint only at a stronger condition, then it must take, at most, as many forced checkpoints as a protocol based on a weaker condition. This result implies that the common approach of sharpening the checkpoint-inducing condition by piggybacking more control information on each message may not always yield a more efficient protocol. Next, we prove that there is no optimal on-line RDT protocol that takes fewer forced checkpoints than any other RDT protocol for all possible communication patterns. Finally, since comparing checkpoint-inducing conditions is not sufficient for comparing protocol performance, we present some formal techniques for comparing the performance of several existing RDT protocols 相似文献

4.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

5.

An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Qiangfeng Yi D. 《Journal of Parallel and Distributed Computing》2008,68(12):1575-1589

Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm. 相似文献

6.

Adaptive checkpointing in message passing distributed systems

ROBERTO BALDON JEAN-MICHEL HELARYI ACHOUR MOSTEFAOUI MICHEL RAYNAL 《International journal of systems science》2013,44(11):1145-1161

Determining consistent global checkpoints is common to many distributed problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addresses a third technique, namely adaptive checkpointing, that has recently emerged. This technique assumes processes take local checkpoints independently and requires them to take additional local checkpoints in order that all local checkpoints be members of some consistent global checkpoint. We first study the characteristics of such adaptive algorithms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the following question: ‘does a given local checkpoint belong to a consistent global checkpoint’' (such a local checkpoint is not useless). The resulting algorithm has the nice property to reduce the number of additional local checkpoints taken to ensure the property ‘no local checkpoint is useless’. Futhermore, it provides each local checkpoint with a consistent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks. 相似文献

7.

基于检测点设置依赖图和属性表的卷回恢复算法 总被引：2，自引：0，他引：2

张宇洪炳熔《计算机研究与发展》2001,38(2):246-251

为了解决检测点设置过程中的Domino效应问题及卷回恢复过程中的活锁问题,并最大限度地减小时间开销,提出了基于检测点设置依赖图和属性表的卷回恢复算法。同以前的算法相比较,该算法一方面节省了用于进程之间同步的时间开销,另一方面检测点设置及卷回过程中涉及少量的相关进程。对该算法的正确性进行了证明。相似文献

8.

Communication-induced determination of consistent snapshots

Helary J. Mostefaoui A. Raynal M. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(9):865-877

A classical way to determine consistent snapshots consists in using Chandy-Lamport's algorithm. This algorithm relies on specific control messages that allow processes to synchronize local checkpoint determination and message recording in order for the resulting snapshot to be consistent. This paper investigates a communication-induced approach to determine consistent snapshots. In such an approach, control information is carried out by application messages. Two abstract necessary and sufficient conditions are stated: one associated with global checkpoint consistency, the other associated with message recording. A general protocol is derived from these abstract conditions. Actually, this general protocol can be instantiated in distinct ways, giving rise to a family of communication-induced snapshot protocols. This general protocol shows there is an intrinsic trade-off between the number of forced checkpoints and the number of recorded messages. Finally, a particular instantiation of the general protocol is provided 相似文献

9.

An efficient protocol for checkpointing recovery in distributedsystems

Kim J.L. Park T. 《Parallel and Distributed Systems, IEEE Transactions on》1993,4(8):955-960

The authors present an efficient synchronized checkpointing protocol that exploits the dependency relation between processes in distributed systems. In this protocol, a process takes a checkpoint when it knows that all processes on which it computationally depends took their checkpoints, hence the process need not always wait for the decision made by the checkpointing coordinator as in the conventional synchronized protocols. As a result, the checkpointing coordination time is substantially reduced and the possibility of total abort of the checkpointing coordination is reduced 相似文献

10.

Consistency issues in distributed checkpoints

Helary J.-M. Netzer R.H.B. Raynal M. 《IEEE transactions on pattern analysis and machine intelligence》1999,25(2):274-281

A global checkpoint is a set of local checkpoints, one per process. The traditional consistency criterion for global checkpoints states that a global checkpoint is consistent if it does not include messages received and not sent. The paper investigates other consistency criteria, transitlessness, and strong consistency. A global checkpoint is transitless if it does not exhibit messages sent and not received. Transitlessness can be seen as a dual of traditional consistency. Strong consistency is the addition of transitlessness to traditional consistency. The main result of the paper is a statement of the necessary and sufficient condition answering the following question: “given an arbitrary set of local checkpoints, can this set be extended to a global checkpoint that satisfies P” (where P is traditional consistency, transitlessness, or strong consistency). From a practical point of view, this condition, when applied to transitlessness, is particularly interesting as it helps characterize which messages do not need to be recorded by checkpointing protocols 相似文献

11.

Mutable checkpoints: a new checkpointing approach for mobilecomputing systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(2):157-172

Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm combined them. However, we found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper; we introduce the concept of “mutable checkpoint,” which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage 相似文献

12.

Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

《Journal of Parallel and Distributed Computing》2004,64(5):649-661

Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O(kn) when k initiators initiate concurrently. The time complexity is O(n). For the recovery algorithm, time and message complexities are both O(n). 相似文献

13.

Asynchronous recovery without using vector timestamps

《Journal of Parallel and Distributed Computing》2002,62(12):1695-1728

A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery. 相似文献

14.

可扩展的多周期检查点设置 总被引：1，自引：0，他引：1

慈轶为张展左德承吴智博杨孝宗《软件学报》2010,21(2):218-230

提出了一种多周期检查点设置方法.它允许各个进程采用不同周期进行检查点设置.为了保证一致全局检查点的向前推进,检查点周期可以根据一个P模式进行调整.在所提出的方法中,进程可以进行组划分处理,从而用于检查点周期调整的依赖跟踪可被限定在组内,同时也将使基于时间的多周期检查点设置具有较好的可扩展性. 相似文献

15.

支持移动合作实时事务的一种新的协同检验点算法

李国徽陈基雄王洪亚刘云生《小型微型计算机系统》2004,25(11):1943-1947

现有的协同检验点方法在移动环境中会带来较大的检验点过程延时 ,不能很好地支持实时事务处理 .提出了一种新的协同并行检验点方法 ,在正常的消息传输过程中 ,通过一点额外的带宽传送事务间检验点依赖关系 ;在某一事务记检验点时 ,尽可能地同时通知相关的事务记检验点 .实验表明 ,该算法对网络带宽没有明显的增加 ,而能大大降低事务记检验点的延时 ,使系统中超截止期的事务比例大大降低相似文献

16.

支持分布式合作实时事务处理的协同检验点方法 总被引：1，自引：0，他引：1

李国徽王洪亚陈基雄刘云生《计算机学报》2004,27(9):1207-1212

在实时事务执行时，事务故障或数据竞争会导致事务重启，为减少事务重启损失的工作量，可以采用检验点技术保证事务的时间正确性．在一类分布式实时数据库应用中，不同结点的事务通过消息交换形成合作关系，为保证合作事务间的全局一致性，当某一事务记检验点时，相关事务也要记检验点．传统协同检验点方法没有考虑应用的定时约束，不能很好地支持分布式合作实时事务处理．该文提出了一种基于图论的协同检验点方法，利用在每个计算结点上为每个合作事务集维护的局部有向图，使用一个基于图论的计算过程标识出应记检验点的事务，该方法既具有最小协同检验点特性，又使全局检验点的时延最小．实验表明该算法减少了全局检验点时延，有利于实时事务截止期的满足．相似文献

17.

Checkpointing with mutable checkpoints

《Theoretical computer science》2003,290(2):1127-1148

There are two approaches to reduce the overhead associated with coordinated checkpointing: first is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process non-blocking. In our previous work (IEEE Parallel Distributed Systems 9 (12) (1998) 1213), we proved that there does not exist a non-blocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper, we present a min-process algorithm which relaxes the non-blocking condition while tries to minimize the blocking time, and a non-blocking algorithm which relaxes the min-process condition while minimizing the number of checkpoints saved on the stable storage. The proposed non-blocking algorithm is based on the concept of “mutable checkpoint”, which is neither a tentative checkpoint nor a permanent checkpoint. Based on mutable checkpoints, our non-blocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage. 相似文献

18.

A multi-cycle checkpointing protocol that ensures strict 1-rollback

Yi-Wei Ci Zhan Zhang De-Cheng Zuo Zhi-Bo Wu Xiao-Zong Yang 《Information Processing Letters》2012,112(20):788-793

In this paper, a checkpointing protocol based on loose synchronization is proposed. The protocol enables processes to take checkpoints at different frequencies so that each process can control its rollback distance. In traditional asynchronous and quasi-synchronous checkpointing protocols, the checkpoints that are not up-to-date may be used for recovery. As a result, the rollback distance is often difficult to control. In the proposed protocol, the checkpoint cycle of each process is dynamically adjusted using a pessimistic scheme so that strict 1-rollback is achieved; namely, one of the last two checkpoints of each process can be utilized for recovery. 相似文献

19.

On coordinated checkpointing in distributed systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(12):1213-1225

Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems 相似文献

20.

Quasi-synchronous checkpointing: Models, characterization, andclassification

Manivannan D. Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(7):703-713

Checkpointing algorithms are classified as synchronous and asynchronous in the literature. In synchronous checkpointing, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system. Synchronizing checkpointing activity involves message overhead and process execution may have to be suspended during the checkpointing coordination, resulting in performance degradation. In asynchronous checkpointing, processes take checkpoints without any coordination with others. Asynchronous checkpointing provides maximum autonomy for processes to take checkpoints; however, some of the checkpoints taken may not lie on any consistent global checkpoint, thus making the checkpointing efforts useless. Asynchronous checkpointing algorithms in the literature can reduce the number of useless checkpoints by making processes take communication induced checkpoints besides asynchronous checkpoints. We call such algorithms quasi-synchronous. In this paper, we present a theoretical framework for characterizing and classifying such algorithms. The theory not only helps to classify and characterize the quasi-synchronous checkpointing algorithms, but also helps to analyze the properties and limitations of the algorithms belonging to each class. It also provides guidelines for designing and evaluating such algorithms 相似文献