首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A consistent checkpointing algorithm with short freezing time(SFT) is presented in this paper.It supports fault-tolerance in distributed systems,The algorithm has shorter freezing time,lower overhead,and simplicity of recovery.To make checkpoint time shorter,a special control message(Munblock)is used to ensure that a process can respond the checkpoint event quickly at any given time.Moreover,main memory algorithm is used to improve the concurrency of checkpointing.By using SFT,the freezing time resulted by checkpointing is less than 0.03s.Furthermore,the control message number of SFT is only O(n).  相似文献   

2.
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm.  相似文献   

3.
Diskless checkpointing   总被引:4,自引:0,他引:4  
Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures  相似文献   

4.
Checkpointing is a basic mechanism for backward error-recovery in fault-tolerant systems. A checkpointed process stops execution and saves its states to files periodically. To reduce the file sizes, only data modified between two consecutive checkpoint times is saved. However, existing approaches do not consider operating system paging activities; which, if ignored may double the number of disk accesses required to checkpoint non-resident dirty pages. In this paper, we propose continuous checkpointing, which combines the checkpoint facility with virtual memory paging operations. Thus, checkpointing is continuous during the lifetime of a process without extra overhead. Checkpoint size is no longer proportional to application size, but rather is bounded by resident dirty pages. Experimental results show that disk accesses can be reduced by about 80% when checkpointing large applications. © 1997 John Wiley & Sons, Ltd.  相似文献   

5.
This paper describes a compiler-based approach to checkpointing for process recovery. The implementation is transparent to both the programmer and the hardware. The compiler-generated sparse potential checkpoint code maintains the desired checkpoint interval. Adaptive checkpointing reduces the size of the checkpoints. Training is used to select low-cost, high-coverage potential checkpoints. The problem of selecting potential checkpoints is shown to be NP-complete, and a heuristic algorithm is introduced that determines a quick suboptimal solution. These compiler-assisted checkpointing techniques have been implemented in a modified version of the GNU C (GCC) compiler. Experiments involving the modified version of the GCC compiler on a Sun SPARC workstation are summarized.  相似文献   

6.
非易失处理器NVP可以在自供能环境下快速恢复,非常适合物联网等应用环境。备份(Checkpointing)是NVP的核心保障技术。然而,现有的备份策略假设NVP处于理想的工作环境,只考虑了能量输入不稳等因素,没有考虑外界的恶意攻击对NVP安全带来的影响,比如,外界篡改备份过程中寄存器的内容,使系统崩溃;篡改备份过程中写到非易失存储中的内容,使数据不可信等,阻碍了NVP在可穿戴医疗设备等安全攸关领域中的应用。梳理了最新的带维持态的NVP在备份过程中存在的安全威胁,并提出了相应的应对机制。  相似文献   

7.
8.
用户指导的多层混合检查点技术及性能优化   总被引:2,自引:0,他引:2  
检查点机制是一种典型有效的软件容错技术。在对现有检查点实现技术综合研究的基础上,设计了一个用户指导的多层混合检查点模型uHybcr,并在IA64 Linux系统中予以实现。最后,通过对比测试对引入用户指导机制所带来的性能优化进行了验证。  相似文献   

9.
检查点是并行系统中实现容错的重要手段,同步检查点方法已广泛应用在工作站机群系统中。PVM所提供的消息传递机制支持高效的异构网络计算,但不支持客错功能。为了降低同步检查点设置的时间开销,提出了一种基于PVM的准同步检查点设置方法,它吸取了同步检查点方法的优点,又通过消息记录方式实现各节点间独立进行状态保存,大大降低了检查点的同步开销,提高了检查点操作效率,该方法在PVM环境下得以实现,实验结果表明所提出的方法具有较好的客错性能。  相似文献   

10.
Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems  相似文献   

11.
In this paper,the hard problem of the thorough garbage collection in uncoordinated checkpointing algorithms is studied.After introduction of the traditional garbage collecting scheme,with which only obsolete checkpoints can be discarded,it is shown that this kind of traditional method may fail to discard any checkpoint in some special cases,and it is necessary and urgent to find a thorough garbage collecting method,with which all the checkpoints useless for any future rollback-recovery including the obsolete ones can be discarded.Then,th Thorough Garbage Collection Theorem is proposed and proved,which ensures th feasibility of the thorough garbage collection,and gives the method to calculate the set of the useful checkpoints as well.  相似文献   

12.
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems.  相似文献   

13.
A technique for non-invasive application-level checkpointing   总被引:1,自引:1,他引:0  
One of the key elements required for writing self-healing applications for distributed and dynamic computing environments is checkpointing. Checkpointing is a mechanism by which an application is made resilient to failures by storing its state periodically to the disk. The main goal of this research is to enable non-invasive reengineering of existing applications to insert Application-Level Checkpointing (ALC) mechanism. The Domain-Specific Language (DSL) developed in this research serves as a perfect means towards this end and is used for obtaining the ALC-specifications from the end-users. These specifications are used for generating and inserting the actual checkpointing code into the existing application. The performance of the application having the generated checkpointing code is comparable to the performance of the application in which the checkpointing code was inserted manually. With slight modifications, the DSL developed in this research can be used for specifying the ALC mechanism in several base languages (e.g., C/C++, Java, and FORTRAN).  相似文献   

14.
This paper presents an index-based checkpointing algorithm for distributed systems with the aim of reducing the total number of checkpoints while ensuring that each checkpoint belongs to at least one consistent global checkpoint (or recovery line). The algorithm is based on an equivalence relation defined between pairs of successive checkpoints of a process which allows us, in some cases, to advance the recovery line of the computation without forcing checkpoints in other processes. The algorithm is well-suited for autonomous and heterogeneous environments, where each process does not know any private information about other processes and private information of the same type of distinct processes is not related (e.g., clock granularity, local checkpointing strategy, etc.). We also present a simulation study which compares the checkpointing-recovery overhead of this algorithm to the ones of previous solutions  相似文献   

15.
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability.  相似文献   

16.
On properties of RDT communication-induced checkpointing protocols   总被引:1,自引:0,他引:1  
Rollback-dependency trackability (RDT) is a property stating that all rollback dependencies between local checkpoints are online trackable by using a transitive dependency vector. The most crucial RDT characterizations introduced in the literature can be represented as certain types of RDT-PXCM-paths. Here, let the U-path and V-path be any two types of RDT-PXCM-paths. We investigate several properties of communication-induced checkpointing protocols that ensure the RDT property. First, we prove that if an online RDT protocol encounters a U-path at a point of a checkpoint and communication pattern associated with a distributed computation, it also encounters a V-path there. Moreover, if this encountered U-path is invisibly doubled, the corresponding encountered V-path is invisibly doubled as well. Therefore, we can conclude that breaking all invisibly doubled U-paths is equivalent to breaking all invisibly doubled V-paths for an online RDT protocol. Next, we continue to demonstrate that a visibly doubled U-path must contain a doubled U-cycle in the causal past. These results can further deduce that some different checkpointing protocols actually have the same behavior for all possible patterns. Finally, we present a commendatory systematic technique for comparing the performance of online RDT protocols.  相似文献   

17.
Low-latency, concurrent checkpointing for parallel programs   总被引:2,自引:0,他引:2  
Presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed  相似文献   

18.
Determining consistent global checkpoints is common to many distributed problems such as fault-tolerance, distributed debugging, properties detection, etc. Uncoordinated and coordinated checkpointing algorithms have been traditionally used for such determinations. This paper addresses a third technique, namely adaptive checkpointing, that has recently emerged. This technique assumes processes take local checkpoints independently and requires them to take additional local checkpoints in order that all local checkpoints be members of some consistent global checkpoint. We first study the characteristics of such adaptive algorithms. Then, a general adaptive checkpointing algorithm is designed from a condition, first stated by Netzer and Xu, that answers the following question: ‘does a given local checkpoint belong to a consistent global checkpoint’' (such a local checkpoint is not useless). The resulting algorithm has the nice property to reduce the number of additional local checkpoints taken to ensure the property ‘no local checkpoint is useless’. Futhermore, it provides each local checkpoint with a consistent global checkpoint to which it belongs. Compared to uncoordinated and coordinated checkpointing algorithms, this algorithm combines the advantages of both without inheriting their drawbacks.  相似文献   

19.
It is important to design computer systems to tolerate some failures. This paper proposes two-level recovery schemes, soft checkpoint (SC) and hard checkpoint (HC), which are useful to recover from failures. Soft checkpoint is less reliable and less overhead than those of HC, and is set up between HCs to reduce the overhead of the process. The total expected overhead of one cycle from HC to HC is obtained, using Markov renewal processes, and an optimal interval which minimizes it is computed. It is shown in a numerical example that a two-level recovery scheme can achieve a good performance.  相似文献   

20.
时间冗余作为容错的重要手段被广泛应用于安全关键实时系统中。传统容错调度算法为失败任务的重运行(Re-execute)预留了大量的空闲时间,但是重运行的使用会降低系统的资源利用率。提出了一种基于检查点机制的容错调度算法CP-PRA,通过降低错误恢复需要的时间,可以有效地提高系统的资源利用率。给出了该算法的可调度奈件,并证明了其算法的正确性。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号