期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

2.

Checkpointing for distributed databases: starting from the basics

Pilarski S. Kameda T. 《Parallel and Distributed Systems, IEEE Transactions on》1992,3(5):602-610

Checkpointing in a distributed database system is analyzed by establishing a correspondence between consistent snapshots in a general distributed system and transaction-consistent checkpoints in a distributed database system. The analysis culminates in a useful condition for transaction-consistent checkpoints. Based on this condition, a general checkpointing scheme, which records a transaction-consistent set of values of all or some selected data items is presented. These rules are implemented in some representative concurrency control protocols, i.e., those based on two-phase locking and timestamping. These implementations cause little interference with other activities in the database system 相似文献

3.

An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Qiangfeng Yi D. 《Journal of Parallel and Distributed Computing》2008,68(12):1575-1589

Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm. 相似文献

4.

An efficient checkpointing method for multicomputers with wormhole routing

Kai Li Jeffrey F. Naughton James S. Plank 《International journal of parallel programming》1991,20(3):159-180

Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements. 相似文献

5.

Adaptive checkpointing in message passing distributed systems

ROBERTO BALDON JEAN-MICHEL HELARYI ACHOUR MOSTEFAOUI MICHEL RAYNAL 《International journal of systems science》2013,44(11):1145-1161

Determining consistent global checkpoints is common to many distributed problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addresses a third technique, namely adaptive checkpointing, that has recently emerged. This technique assumes processes take local checkpoints independently and requires them to take additional local checkpoints in order that all local checkpoints be members of some consistent global checkpoint. We first study the characteristics of such adaptive algorithms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the following question: ‘does a given local checkpoint belong to a consistent global checkpoint’' (such a local checkpoint is not useless). The resulting algorithm has the nice property to reduce the number of additional local checkpoints taken to ensure the property ‘no local checkpoint is useless’. Futhermore, it provides each local checkpoint with a consistent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks. 相似文献

6.

Communication-based prevention of useless checkpoints in distributed computations

J.-M. Hélary A. Mostefaoui R.H.B. Netzer M. Raynal 《Distributed Computing》2000,13(1):29-43

Summary. A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional local (forced) checkpoints to ensure no local checkpoint is useless. The paper first proves two properties related to integer timestamps which are associated with each local checkpoint. The first property is a necessary and sufficient condition that these timestamps must satisfy for no checkpoint to be useless. The second property provides an easy timestamp-based determination of consistent global checkpoints. Then, a general communication-induced checkpointing protocol is proposed. This protocol, derived from the two previous properties, actually defines a family of timestamp-based communication-induced checkpointing protocols. It is shown that several existing checkpointing protocols for the same problem are particular instances of the general protocol. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in “consistent global checkpoint”-based distributed applications such as the detection of stable or unstable properties and the determination of distributed breakpoints. Received: July 1997 / Accepted: August 1999 相似文献

7.

支持分布式合作实时事务处理的协同检验点方法 总被引：1，自引：0，他引：1

李国徽王洪亚陈基雄刘云生《计算机学报》2004,27(9):1207-1212

在实时事务执行时，事务故障或数据竞争会导致事务重启，为减少事务重启损失的工作量，可以采用检验点技术保证事务的时间正确性．在一类分布式实时数据库应用中，不同结点的事务通过消息交换形成合作关系，为保证合作事务间的全局一致性，当某一事务记检验点时，相关事务也要记检验点．传统协同检验点方法没有考虑应用的定时约束，不能很好地支持分布式合作实时事务处理．该文提出了一种基于图论的协同检验点方法，利用在每个计算结点上为每个合作事务集维护的局部有向图，使用一个基于图论的计算过程标识出应记检验点的事务，该方法既具有最小协同检验点特性，又使全局检验点的时延最小．实验表明该算法减少了全局检验点时延，有利于实时事务截止期的满足．相似文献

8.

Asynchronous recovery without using vector timestamps

《Journal of Parallel and Distributed Computing》2002,62(12):1695-1728

A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery. 相似文献

9.

一种高效的合作实时事务并行检验点算法

李国徽王洪亚刘云生《计算机科学》2005,32(7):69-71

许多数据和活动上都有很强时间性的应用在地理上同时具有分布性,这种应用需求使得分布式实时数据库的研完成为数据库研究领域的热点。在实时事务执行时,事务故障或数据竞争会导致事务重启,为了减少因重启而损失的工作量,可以采用检验点技术以利于事务时间正确性的满足。在一些分布式实时数据库应用中,不同结点的事务通过消息交换形成合作关系,当某一事务记检验点时,为保证合作事务间的全局一致性,相关事务也要相应地记检验点。传统的协同检验点方法没有考虑应用的定时约束,不能很好地支持分布式实时事务处理。本文提出了一种高效的并行协同检验点方法,该算法既具有最小协同检验点特性又使全局检验点过程延时最小。实验表明该算法减少了全局检验点阻塞时间,有利于分布式实时事务截止期的满足。相似文献

10.

An index-based checkpointing algorithm for autonomous distributedsystems

Baldoni R. Quaglia F. Fornara P. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(2):181-192

This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation defined between pairs of successive checkpoints of a process which allows us, in some cases, to advance the recovery line of the computation without forcing checkpoints in other processes. The algorithm is well-suited for autonomous and heterogeneous environments, where each process does not know any private information about other processes and private information of the same type of distinct processes is not related (e.g., clock granularity, local checkpointing strategy, etc.). We also present a simulation study which compares the checkpointing-recovery overhead of this algorithm to the ones of previous solutions 相似文献

11.

面向更新密集型应用的内存数据库高效检查点技术

覃雄派肖艳芹曹巍王珊《计算机学报》2009,32(11)

面向更新密集型应用的内存数据库系统,其检查点技术应符合几个关键的要求,包括检查点操作对正常事务处理的干扰尽可能小、能够处理存取倾斜状况、支持数据库系统的快速恢复、提供恢复过程中的系统可用性等.该文提出一种事务一致的分区检查点技术,采用基于元组的动态多版本并发控制机制,避免了读写事务的加锁冲突,提高系统吞吐能力;检查点操作以只读事务形式实现,存多版本并发控制下,避免检查点操作对正常事务处理的堵塞;由于检查点文件是事务一致的,只需要记录事务的Redo 日志信息,在系统恢复过程中,只需要对日志文件进行一遍扫描处理,加快恢复过程;基于优先级的数据分区装载和恢复,使得恢复过程中新事务的数据存取请求迅速得到满足,保证了恢复过程中的系统可用性.由于采用两级版本管理机制以及动态版本共享技术,多版本管理的空间开销降低到可以接受的水平.实验结果表明,文中提出的检查点技术方案获得比模糊检查点技术高27%的系统吞吐量,同时版本管理的空间开销在可接受的范围之内,满足高性能应用的要求. 相似文献

12.

A Survey of Distributed Database Checkpointing

Jun-Lin Lin Margaret H. Dunham 《Distributed and Parallel Databases》1997,5(3):289-319

Checkpointing a database is a vital technique to reduce the recovery time in the presence of a failure. For distributed databases, checkpointing also provides an efficient way to perform global reconstruction. In this paper, we survey and classify previous approaches for checkpointing a distributed database. Since the need for global reconstruction is infrequent in most distributed databases, a less restrictive and less resource-consuming approach to checkpoint distributed databases in an integrated distributed database system is recommended over a transaction consistent checkpoint approach. For a federated or multidatabase system, any type of global consistent checkpoint is difficult to achieve without violating local autonomy. 相似文献

13.

On coordinated checkpointing in distributed systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(12):1213-1225

Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems 相似文献

14.

Application Level Fault Tolerance in Heterogeneous Networks of Workstations 总被引：2，自引：0，他引：2

Adam Beguelin Erik Seligman Peter Stephan 《Journal of Parallel and Distributed Computing》1997,43(2):2078

We have explored methods for checkpointing and restarting processes within the distributed object migration environment (Dome), a C++ library of data parallel objects that are automatically distributed over heterogeneous networks of workstations (NOWs). System level checkpointing methods, although transparent to the user, were rejected because they lack support for heterogeneity. We have implemented application level checkpointing which places the checkpoint and restart mechanisms within Dome's C++ objects. Application level checkpointing has been implemented with a library-based technique for the programmer and a more transparent preprocessor-based technique. Dome's implementation of checkpointing successfully checkpoints and restarts processes on different numbers of machines and different architectures. Results from executing Dome programs across a NOW with realistic failure rates have been experimentally determined and are compared with theoretical results. The overhead of checkpointing is found to be low, while providing substantial decreases in expected runtime on realistic systems. 相似文献

15.

Finding consistent global checkpoints in a distributed computation

Manivannan D. Netzer R.H.B. Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(6):623-627

Consistent global checkpoints have many uses in distributed computations. A central question in applications that use consistent global checkpoints is to determine whether a consistent global checkpoint that includes a given set of local checkpoints can exist. Netzer and Xu (1995) presented the necessary and sufficient conditions under which such a consistent global checkpoint can exist, but they did not explore what checkpoints could be constructed. In this paper, we prove exactly which local checkpoints can be used for constructing such consistent global checkpoints. We illustrate the use of our results with a simple and elegant algorithm to enumerate all such consistent global checkpoints 相似文献

16.

Consistency issues in distributed checkpoints

Helary J.-M. Netzer R.H.B. Raynal M. 《IEEE transactions on pattern analysis and machine intelligence》1999,25(2):274-281

A global checkpoint is a set of local checkpoints, one per process. The traditional consistency criterion for global checkpoints states that a global checkpoint is consistent if it does not include messages received and not sent. The paper investigates other consistency criteria, transitlessness, and strong consistency. A global checkpoint is transitless if it does not exhibit messages sent and not received. Transitlessness can be seen as a dual of traditional consistency. Strong consistency is the addition of transitlessness to traditional consistency. The main result of the paper is a statement of the necessary and sufficient condition answering the following question: “given an arbitrary set of local checkpoints, can this set be extended to a global checkpoint that satisfies P” (where P is traditional consistency, transitlessness, or strong consistency). From a practical point of view, this condition, when applied to transitlessness, is particularly interesting as it helps characterize which messages do not need to be recorded by checkpointing protocols 相似文献

17.

Checkpoint space reclamation for uncoordinated checkpointing inmessage-passing systems

Yi-Min Wang Pi-Yu Chung In-Jen Lin Fuchs W.K. 《Parallel and Distributed Systems, IEEE Transactions on》1995,6(5):546-554

Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by N(N+1)/2 checkpoints where N is the number of processes 相似文献

18.

Mutable checkpoints: a new checkpointing approach for mobilecomputing systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(2):157-172

Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm combined them. However, we found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper; we introduce the concept of “mutable checkpoint,” which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage 相似文献

19.

Error recovery in shared memory multiprocessors using privatecaches

Wu K.-L. Fuchs W.K. Patel J.H. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(2):231-240

The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented 相似文献

20.

Low-latency, concurrent checkpointing for parallel programs 总被引：2，自引：0，他引：2

Kai Li Naughton J.F. Plank J.S. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(8):874-879

Presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed 相似文献