期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Mutable checkpoints: a new checkpointing approach for mobilecomputing systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(2):157-172

Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm combined them. However, we found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper; we introduce the concept of “mutable checkpoint,” which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage 相似文献

2.

Checkpointing with mutable checkpoints

《Theoretical computer science》2003,290(2):1127-1148

There are two approaches to reduce the overhead associated with coordinated checkpointing: first is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process non-blocking. In our previous work (IEEE Parallel Distributed Systems 9 (12) (1998) 1213), we proved that there does not exist a non-blocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper, we present a min-process algorithm which relaxes the non-blocking condition while tries to minimize the blocking time, and a non-blocking algorithm which relaxes the min-process condition while minimizing the number of checkpoints saved on the stable storage. The proposed non-blocking algorithm is based on the concept of “mutable checkpoint”, which is neither a tentative checkpoint nor a permanent checkpoint. Based on mutable checkpoints, our non-blocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage. 相似文献

3.

Asynchronous recovery without using vector timestamps

《Journal of Parallel and Distributed Computing》2002,62(12):1695-1728

A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery. 相似文献

4.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

5.

Checkpoint space reclamation for uncoordinated checkpointing inmessage-passing systems

Yi-Min Wang Pi-Yu Chung In-Jen Lin Fuchs W.K. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(5):546-554

Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by N(N+1)/2 checkpoints where N is the number of processes 相似文献

6.

An index-based checkpointing algorithm for autonomous distributedsystems

Baldoni R. Quaglia F. Fornara P. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(2):181-192

This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation defined between pairs of successive checkpoints of a process which allows us, in some cases, to advance the recovery line of the computation without forcing checkpoints in other processes. The algorithm is well-suited for autonomous and heterogeneous environments, where each process does not know any private information about other processes and private information of the same type of distinct processes is not related (e.g., clock granularity, local checkpointing strategy, etc.). We also present a simulation study which compares the checkpointing-recovery overhead of this algorithm to the ones of previous solutions 相似文献

7.

Quasi-synchronous checkpointing: Models, characterization, andclassification

Manivannan D. Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(7):703-713

Checkpointing algorithms are classified as synchronous and asynchronous in the literature. In synchronous checkpointing, processes synchronize their checkpointing activities so that a globally consistent set of checkpoints is always maintained in the system. Synchronizing checkpointing activity involves message overhead and process execution may have to be suspended during the checkpointing coordination, resulting in performance degradation. In asynchronous checkpointing, processes take checkpoints without any coordination with others. Asynchronous checkpointing provides maximum autonomy for processes to take checkpoints; however, some of the checkpoints taken may not lie on any consistent global checkpoint, thus making the checkpointing efforts useless. Asynchronous checkpointing algorithms in the literature can reduce the number of useless checkpoints by making processes take communication induced checkpoints besides asynchronous checkpoints. We call such algorithms quasi-synchronous. In this paper, we present a theoretical framework for characterizing and classifying such algorithms. The theory not only helps to classify and characterize the quasi-synchronous checkpointing algorithms, but also helps to analyze the properties and limitations of the algorithms belonging to each class. It also provides guidelines for designing and evaluating such algorithms 相似文献

8.

Adaptive checkpointing in message passing distributed systems

ROBERTO BALDON JEAN-MICHEL HELARYI ACHOUR MOSTEFAOUI MICHEL RAYNAL 《International journal of systems science》2013,44(11):1145-1161

Determining consistent global checkpoints is common to many distributed problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addresses a third technique, namely adaptive checkpointing, that has recently emerged. This technique assumes processes take local checkpoints independently and requires them to take additional local checkpoints in order that all local checkpoints be members of some consistent global checkpoint. We first study the characteristics of such adaptive algorithms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the following question: ‘does a given local checkpoint belong to a consistent global checkpoint’' (such a local checkpoint is not useless). The resulting algorithm has the nice property to reduce the number of additional local checkpoints taken to ensure the property ‘no local checkpoint is useless’. Futhermore, it provides each local checkpoint with a consistent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks. 相似文献

9.

基于检测点设置依赖图和属性表的卷回恢复算法 总被引：2，自引：0，他引：2

张宇洪炳熔《计算机研究与发展》2001,38(2):246-251

为了解决检测点设置过程中的Domino效应问题及卷回恢复过程中的活锁问题,并最大限度地减小时间开销,提出了基于检测点设置依赖图和属性表的卷回恢复算法。同以前的算法相比较,该算法一方面节省了用于进程之间同步的时间开销,另一方面检测点设置及卷回过程中涉及少量的相关进程。对该算法的正确性进行了证明。相似文献

10.

Staggered consistent checkpointing

Vaidya N.H. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(7):694-702

A consistent checkpointing algorithm saves a consistent view of a distributed application's state on stable storage. The traditional consistent checkpointing algorithms require different processes to save their state at about the same time. This causes contention for the stable storage, potentially resulting in large overheads. Staggering the checkpoints taken by various processes can reduce checkpoint overhead. This paper presents a simple approach to arbitrarily stagger the checkpoints. Our approach requires that the processes take consistent logical checkpoints, as compared to consistent physical checkpoints enforced by existing algorithms. Experimental results on nCube-2 are presented 相似文献

11.

一种基于索引的准同步检查点协议 总被引：3，自引：0，他引：3

罗元盛闵应骅张大方《计算机学报》2005,28(10):1620-1625

在基于索引的分布式检查点算法中,尽量减少全局一致性检查点和强制检查点的数目对提高计算效率具有重要意义．该文在已有的基于索引的检查点算法的基础上,提出了一种新的检查点协议,既减少检查点的数目,又使各个进程的检查点之间实时同步,以免程序出错后回卷执行的开销太大,丢失过多有效计算．模拟实验表明,按该文所提协议,平均每条消息导致的强制检查点数比传统方法平均减少23．2％．相似文献

12.

一种优化的分布式系统的失效恢复策略 总被引：1，自引：0，他引：1

刘云龙陈俊亮《计算机学报》1999,22(3):249-257

本文对确定性进程组的分布式系统的失效恢复策略做了深入的研究,独到地提出了应用数据流分析来静态地计算进程的最小备查点数据集的方法。相似文献

13.

支持分布式合作实时事务处理的协同检验点方法 总被引：1，自引：0，他引：1

李国徽王洪亚陈基雄刘云生《计算机学报》2004,27(9):1207-1212

在实时事务执行时，事务故障或数据竞争会导致事务重启，为减少事务重启损失的工作量，可以采用检验点技术保证事务的时间正确性．在一类分布式实时数据库应用中，不同结点的事务通过消息交换形成合作关系，为保证合作事务间的全局一致性，当某一事务记检验点时，相关事务也要记检验点．传统协同检验点方法没有考虑应用的定时约束，不能很好地支持分布式合作实时事务处理．该文提出了一种基于图论的协同检验点方法，利用在每个计算结点上为每个合作事务集维护的局部有向图，使用一个基于图论的计算过程标识出应记检验点的事务，该方法既具有最小协同检验点特性，又使全局检验点的时延最小．实验表明该算法减少了全局检验点时延，有利于实时事务截止期的满足．相似文献

14.

Low-cost checkpointing and failure recovery in mobile computingsystems

Prakash R. Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》1996,7(10):1035-1048

A mobile computing system consists of mobile and stationary nodes, connected to each other by a communication network. The presence of mobile nodes in the system places constraints on the permissible energy consumption and available communication bandwidth. To minimize the lost computation during recovery from node failures, periodic collection of a consistent snapshot of the system (checkpoint) is required. Locating mobile nodes contributes to the checkpointing and recovery costs. Synchronous snapshot collection algorithms, designed for static networks, either force every node in the system to take a new local snapshot, or block the underlying computation during snapshot collection. Hence, they are not suitable for mobile computing systems. If nodes take their local checkpoints independently in an uncoordinated manner, each node may have to store multiple local checkpoints in stable storage. This is not suitable for mobile nodes as they have small memory. This paper presents a synchronous snapshot collection algorithm for mobile systems that neither forces every node to take a local snapshot, nor blocks the underlying computation during snapshot collection. If a node initiates snapshot collection, local snapshots of only those nodes that have directly or transitively affected the initiator since their last snapshots need to be taken. We prove that the global snapshot collection terminates within a finite time of its invocation and the collected global snapshot is consistent. We also propose a minimal rollback/recovery algorithm in which the computation at a node is rolled back only if it depends on operations that have been undone due to the failure of node(s). Both the algorithms have low communication and storage overheads and meet the low energy consumption and low bandwidth constraints of mobile computing systems 相似文献

15.

Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

Soyeon Park Seung Ryoul Maeng 《The Journal of supercomputing》2006,35(2):141-154

A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing. This research is supported by KISTEP under the National Research Laboratory program. 相似文献

16.

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

D. Manivannan Q. Jiang Jianchang Yang M. Singhal 《Information Sciences》2008,178(15):3110-3117

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead. 相似文献

17.

A multi-cycle checkpointing protocol that ensures strict 1-rollback

Yi-Wei Ci Zhan Zhang De-Cheng Zuo Zhi-Bo Wu Xiao-Zong Yang 《Information Processing Letters》2012,112(20):788-793

In this paper, a checkpointing protocol based on loose synchronization is proposed. The protocol enables processes to take checkpoints at different frequencies so that each process can control its rollback distance. In traditional asynchronous and quasi-synchronous checkpointing protocols, the checkpoints that are not up-to-date may be used for recovery. As a result, the rollback distance is often difficult to control. In the proposed protocol, the checkpoint cycle of each process is dynamically adjusted using a pessimistic scheme so that strict 1-rollback is achieved; namely, one of the last two checkpoints of each process can be utilized for recovery. 相似文献

18.

An efficient checkpointing method for multicomputers with wormhole routing

Kai Li Jeffrey F. Naughton James S. Plank 《International journal of parallel programming》1991,20(3):159-180

Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements. 相似文献

19.

Communication-based prevention of useless checkpoints in distributed computations

J.-M. Hélary A. Mostefaoui R.H.B. Netzer M. Raynal 《Distributed Computing》2000,13(1):29-43

Summary. A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional local (forced) checkpoints to ensure no local checkpoint is useless. The paper first proves two properties related to integer timestamps which are associated with each local checkpoint. The first property is a necessary and sufficient condition that these timestamps must satisfy for no checkpoint to be useless. The second property provides an easy timestamp-based determination of consistent global checkpoints. Then, a general communication-induced checkpointing protocol is proposed. This protocol, derived from the two previous properties, actually defines a family of timestamp-based communication-induced checkpointing protocols. It is shown that several existing checkpointing protocols for the same problem are particular instances of the general protocol. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in “consistent global checkpoint”-based distributed applications such as the detection of stable or unstable properties and the determination of distributed breakpoints. Received: July 1997 / Accepted: August 1999 相似文献

20.

网格计算服务系统检查点算法研究

张至柔《计算机工程与设计》2008,29(14)

在由机构内部空闲计算机组成的为计算移动Agent提供服务的网格计算服务系统中减少容错开销,提高计算效率是一个重要的问题.一个具有非封闭、非阻塞、低开销等优势的新检查点算法被提出,且该算法的同步垃圾收集过程可以避免不同进程间在确立新检查点、抛弃旧检查点时的不同步造成的不一致状态.实验结果表明,该算法的开销与系统节点数量呈线性关系. 相似文献