共查询到20条相似文献,搜索用时 15 毫秒
1.
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm. 相似文献
2.
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures 相似文献
3.
This paper describes a compiler-based approach to checkpointing for process recovery. The implementation is transparent to both the programmer and the hardware. The compiler-generated sparse potential checkpoint code maintains the desired checkpoint interval. Adaptive checkpointing reduces the size of the checkpoints. Training is used to select low-cost, high-coverage potential checkpoints. The problem of selecting potential checkpoints is shown to be NP-complete, and a heuristic algorithm is introduced that determines a quick suboptimal solution. These compiler-assisted checkpointing techniques have been implemented in a modified version of the GNU C (GCC) compiler. Experiments involving the modified version of the GCC compiler on a Sun SPARC workstation are summarized. 相似文献
5.
检查点是并行系统中实现容错的重要手段,同步检查点方法已广泛应用在工作站机群系统中。PVM所提供的消息传递机制支持高效的异构网络计算,但不支持客错功能。为了降低同步检查点设置的时间开销,提出了一种基于PVM的准同步检查点设置方法,它吸取了同步检查点方法的优点,又通过消息记录方式实现各节点间独立进行状态保存,大大降低了检查点的同步开销,提高了检查点操作效率,该方法在PVM环境下得以实现,实验结果表明所提出的方法具有较好的客错性能。 相似文献
6.
Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems 相似文献
7.
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems. 相似文献
8.
One of the key elements required for writing self-healing applications for distributed and dynamic computing environments
is checkpointing. Checkpointing is a mechanism by which an application is made resilient to failures by storing its state
periodically to the disk. The main goal of this research is to enable non-invasive reengineering of existing applications
to insert Application-Level Checkpointing (ALC) mechanism. The Domain-Specific Language (DSL) developed in this research serves
as a perfect means towards this end and is used for obtaining the ALC-specifications from the end-users. These specifications
are used for generating and inserting the actual checkpointing code into the existing application. The performance of the
application having the generated checkpointing code is comparable to the performance of the application in which the checkpointing
code was inserted manually. With slight modifications, the DSL developed in this research can be used for specifying the ALC
mechanism in several base languages (e.g., C/C++, Java, and FORTRAN). 相似文献
9.
Reverse-mode derivative calculations have favorable time cost for many problems. Unfortunately “real world” reverse-mode computations frequently experience prohibitive space costs. To mitigate this space cost, users resort to checkpointing techniques to recompute, rather than save, the necessary values. Injudicious checkpointing, however, can destroy the favorable time performance that made reverse mode attractive in the first place. Consequently, reverse-mode users must spend significant amounts of development time analyzing and developing checkpointing schemes that complement their reverse-mode computation code. In this paper, we describe a particular instance of this checkpointing problem: we were using reverse-mode code generated by Adifor 3.0 to compute derivatives of a large computational fluid dynamics code. Our effort labored under the additional constraint that development time was minimal (as always, it was needed yesterday). Our solution was to use profiling to narrowly focus our checkpoint analysis. This profiling approach worked well for our problem. Furthermore, the profiling idea is sufficiently general that it should work well for other problems. This paper details both our results on our specific problem and guidelines for applying the profiling technique to other checkpoint-based reverse-mode development problems. 相似文献
10.
时间冗余作为容错的重要手段被广泛应用于安全关键实时系统中。传统容错调度算法为失败任务的重运行(Re-execute)预留了大量的空闲时间,但是重运行的使用会降低系统的资源利用率。提出了一种基于检查点机制的容错调度算法CP-PRA,通过降低错误恢复需要的时间,可以有效地提高系统的资源利用率。给出了该算法的可调度奈件,并证明了其算法的正确性。 相似文献
11.
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability. 相似文献
12.
Rollback-dependency trackability (RDT) is a property stating that all rollback dependencies between local checkpoints are online trackable by using a transitive dependency vector. The most crucial RDT characterizations introduced in the literature can be represented as certain types of RDT-PXCM-paths. Here, let the U-path and V-path be any two types of RDT-PXCM-paths. We investigate several properties of communication-induced checkpointing protocols that ensure the RDT property. First, we prove that if an online RDT protocol encounters a U-path at a point of a checkpoint and communication pattern associated with a distributed computation, it also encounters a V-path there. Moreover, if this encountered U-path is invisibly doubled, the corresponding encountered V-path is invisibly doubled as well. Therefore, we can conclude that breaking all invisibly doubled U-paths is equivalent to breaking all invisibly doubled V-paths for an online RDT protocol. Next, we continue to demonstrate that a visibly doubled U-path must contain a doubled U-cycle in the causal past. These results can further deduce that some different checkpointing protocols actually have the same behavior for all possible patterns. Finally, we present a commendatory systematic technique for comparing the performance of online RDT protocols. 相似文献
13.
This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation defined between pairs of successive checkpoints of a process which allows us, in some cases, to advance the recovery line of the computation without forcing checkpoints in other processes. The algorithm is well-suited for autonomous and heterogeneous environments, where each process does not know any private information about other processes and private information of the same type of distinct processes is not related (e.g., clock granularity, local checkpointing strategy, etc.). We also present a simulation study which compares the checkpointing-recovery overhead of this algorithm to the ones of previous solutions 相似文献
14.
Presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed 相似文献
15.
It is important to design computer systems to tolerate some failures. This paper proposes two-level recovery schemes, soft checkpoint (SC) and hard checkpoint (HC), which are useful to recover from failures. Soft checkpoint is less reliable and less overhead than those of HC, and is set up between HCs to reduce the overhead of the process. The total expected overhead of one cycle from HC to HC is obtained, using Markov renewal processes, and an optimal interval which minimizes it is computed. It is shown in a numerical example that a two-level recovery scheme can achieve a good performance. 相似文献
16.
In this paper, a checkpointing protocol based on loose synchronization is proposed. The protocol enables processes to take checkpoints at different frequencies so that each process can control its rollback distance. In traditional asynchronous and quasi-synchronous checkpointing protocols, the checkpoints that are not up-to-date may be used for recovery. As a result, the rollback distance is often difficult to control. In the proposed protocol, the checkpoint cycle of each process is dynamically adjusted using a pessimistic scheme so that strict 1-rollback is achieved; namely, one of the last two checkpoints of each process can be utilized for recovery. 相似文献
17.
The probability distribution of the overhead caused by the use of the checkpointing rollback recovery technique is evaluated in both cases of a single critical task and of an overall transaction-oriented system. This distribution is obtained in Laplace-Stieltjes transform form, from which all the moments can be easily calculated. Alternatively, inversion methods can be used to evaluate the distribution. The authors propose checkpointing strategies based on the above distribution in order to optimize performance criteria motivated, in the case of critical tasks, by real time constraints, and in the case of transaction-oriented systems, by the need of guaranteeing the users about the maximum system unavailability 相似文献
18.
Uncoordinated checkpointing allows process autonomy and general nondeterministic execution, but suffers from potential domino effects and the associated space overhead. Previous to this research, checkpoint space reclamation had been based on the notion of obsolete checkpoints; as a result, a potentially unbounded number of nonobsolete checkpoints may have to be retained on stable storage. In this paper, we derive a necessary and sufficient condition for identifying all garbage checkpoints. By using the approach of recovery line transformation and decomposition, we develop an optimal checkpoint space reclamation algorithm and show that the space overhead for uncoordinated checkpointing is in fact bounded by N(N+1)/2 checkpoints where N is the number of processes 相似文献
19.
Summary This paper is an application of the theory of Markov renewal and semi regenerative processes into checkpointing problems. Its main practical contribution consists in the analytic expression of mean response time of systems under checkpointing and in the presence of intermittent failures (data bases, file systems ...). 相似文献
20.
Recent advance of virtualization technology provides a new approach to check-point/restart at the virtual machine(VM) level.In contrast to traditional process-level checkpointing,checkpointing at the virtualization layer brings up several advantages,such as compatibility,transparence,flexibility and simplicity.However,because the virtualization layer has little semantic knowledge about the operation system and the applications running atop,VM-layer checkpointing requires saving the entire operating system state rather than a single process.The overhead may render the approach impractical.To reduce the size of VM checkpoint,in this paper we propose a page eviction scheme and an incremental checkpointing mechanism to avoid saving unnecessary VM pages in the checkpoint.To keep the system online transparently,we propose a live checkpointing mechanism by saving the memory image in a copy-on-write(COW) manner.We implement the performance optimization mechanisms in a prototype system,called VMckpt.Experimental results with a group of representative applications show that our page eviction scheme and incremental checkpointing can significantly reduce the checkpoint file size by up to 87% and shorten the total checkpointing/restart time by a factor of up to 71%,in comparison with the Xens default checkpointing mechanism.The observed application downtimes due to checkpointing can be reduced to as small as 300 ms. 相似文献
|