共查询到20条相似文献,搜索用时 0 毫秒
1.
Checkpointing systems are a convenient way for users to make their programs fault‐tolerant by intermittently saving program state to disk and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion, an important class of optimizations that reduce the overhead of checkpointing. Some forms of memory exclusion are well‐known in the checkpointing community. Others are relatively new. In this paper, we describe all of them within the same framework. We have implemented these optimization techniques in two checkpointers: libckpt , which works on Unix‐based workstations, and CLIP , which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long‐running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show the improvements in time and space overhead. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献
2.
Communication-Induced Checkpointing (CIC) protocols are classified into two categories in the literature: Index-based and Model-based. In this paper, we discuss two data structures being used in these two kinds of CIC protocols, and their different roles in helping the checkpointing algorithms to enforce Z-cycle Free (ZCF) property. Then, we present our Fully Informed aNd Efficient (FINE) communication-induced checkpointing algorithm, which not only has less checkpointing overhead than the well-known Fully Informed (FI) CIC protocol proposed by Helary et al. but also has less message overhead. Performance evaluation indicates that our protocol performs better than many of the other existing CIC protocols. 相似文献
3.
A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long‐running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long‐running applications. Several fault tolerance approaches were investigated; however, it was found that rollback‐recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback‐recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献
4.
In this paper,the hard problem of the thorough garbage collection in uncoordinated checkpointing algorithms is studied.After introduction of the traditional garbage collecting scheme,with which only obsolete checkpoints can be discarded,it is shown that this kind of traditional method may fail to discard any checkpoint in some special cases,and it is necessary and urgent to find a thorough garbage collecting method,with which all the checkpoints useless for any future rollback-recovery including the obsolete ones can be discarded.Then,th Thorough Garbage Collection Theorem is proposed and proved,which ensures th feasibility of the thorough garbage collection,and gives the method to calculate the set of the useful checkpoints as well. 相似文献
5.
Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach.
In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper
proposes thesample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing.
We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme.
Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than
those of the schemes without it, while the required checkpoint interval is larger.
This research was supported in part by the MIC (Ministry of Information and Communication), Korea, under the ITRC support
program supervised by the UTA and CUCN 21st Century Frontier R&D Program. 相似文献
6.
基于PVM的协调检查点设置关键技术 总被引:1,自引:0,他引:1
本文论述了基于PVM的并行程序运行回卷恢复系统设计和实现过程中的退出再加入PVM机制、任务号隐式映射机制、任务结束前同步机制、防止PVM库重入机制,信号与消息协同触发机制、应用任务初始化机制以及作为前述各机制实现基础的函数包裹与换名机制等关键技术。这些技术已经成功地应用于我们自主开发的ChaRM系统中,证明了技术的正确性和有效性。 相似文献
7.
Panagiotis Katsaros Lefteris Angelis Constantine Lazos 《Concurrency and Computation》2007,19(1):37-63
Checkpointing has a crucial impact on systems' performance and fault‐tolerance effectiveness: excessive checkpointing results in performance degradation, while deficient checkpointing incurs expensive recovery. In distributed systems with independent checkpoint activities there is no easy way to determine checkpoint frequencies optimizing response‐time and fault‐tolerance costs at the same time. The purpose of this paper is to investigate the potentialities of a statistical decision‐making procedure. We adopt a simulation‐based approach for obtaining performance metrics that are afterwards used for determining a trade‐off between checkpoint interval reductions and efficiency in performance. Statistical methodology including experimental design, regression analysis and optimization provides us with the framework for comparing configurations, which use possibly different fault‐tolerance mechanisms (replication‐based or message‐logging‐based). Systematic research also allows us to take into account additional design factors, such as load balancing. The method is described in terms of a standardized object replication model (OMG FT‐CORBA), but it could also be applied in other (e.g. process‐based) computational models. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献
8.
多米诺效应的解决策略研究 总被引:1,自引:0,他引:1
定义了备查点间隔之间的先于关系,并对分布式系统执行的语义正确性进行了约束,证明了逆时先于现象是产生多米诺效应的本质,提出了多米诺避免、多米诺检测与消除、多米诺容忍三大解决策略. 相似文献
9.
基于数据流分析的软件容错策略 总被引:4,自引:1,他引:4
该文就软件容错中备查点与卷回机制展开深入讨论,提出一种基于数据流分析技术的软件容错新方法.首先对软件容错进行简介,指出数据错是一切控制系统软件失效的根源与最终表现以及对数据采取强有力的容错措施的必要性.然后将数据流分析技术应用于软件容错,通过求解程序变量的到达-定值数据流方程来静态地确定任何数据在任何引用点出错时的最小充分卷回,通过求解活跃变量的数据流方程来静态地确定程序在执行各个基本块时需动态保存的变量集合,得出最小充分卷回定理与备查点数据范围定理,从而解决了时间冗余容错途径中必须回答的两个基本问题.此外,还给出了恢复块定义有效的充分条件.最后,以电信系统为应用实例,介绍了该方法的一种具体实施.该方法在简单地扩展后可被广泛应用于各种容错软件的设计中. 相似文献
10.
The existing user‐level checkpointing schemes support only a limited portion of multithreaded programs because they are derived from the schemes for single‐threaded applications. This paper addresses the impact of thread suspension point on rollback recovery, and presents a checkpointing scheme for multithreaded processes. Unlike the existing schemes in which the checkpointer suspends every working thread, our scheme employs a distinctive strategy that every working thread suspends itself. This technique manages to avoid the suspension point in the API code or kernel code, ensuring correct rollback recovery. Our scheme supports inter‐thread synchronization and thread lifetime. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献
11.
Eugen FellerAuthor Vitae John Mehnert-SpahnAuthor Vitae Michael SchoettnerAuthor Vitae Christine MorinAuthor Vitae 《Future Generation Computer Systems》2012,28(1):163-170
The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability. 相似文献
12.
A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead
because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network.
For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during
failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a
node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also
introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution,
can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing
overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing.
This research is supported by KISTEP under the National Research Laboratory program. 相似文献
13.
14.
15.
16.
基于检测点设置依赖图和属性表的卷回恢复算法 总被引:2,自引:0,他引:2
为了解决检测点设置过程中的Domino效应问题及卷回恢复过程中的活锁问题,并最大限度地减小时间开销,提出了基于检测点设置依赖图和属性表的卷回恢复算法。同以前的算法相比较,该算法一方面节省了用于进程之间同步的时间开销,另一方面检测点设置及卷回过程中涉及少量的相关进程。对该算法的正确性进行了证明。 相似文献
17.
一种优化的分布式系统的失效恢复策略 总被引:1,自引:0,他引:1
本文对确定性进程组的分布式系统的失效恢复策略做了深入的研究,独到地提出了应用数据流分析来静态地计算进程的最小备查点数据集的方法。 相似文献
18.
为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。
相似文献
19.
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm. 相似文献
20.
If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any number of processes. The proposed checkpointing algorithm can deal with concurrent multiple initiations of checkpointing and data faults. A process can recover from a fault, using the proposed recovery algorithm in spite of multiple data faults present in the system. All the proposed algorithms converge in O(n) steps, where n is the number of processes. The algorithm can be extended to work for general topologies too. 相似文献