期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Memory exclusion: optimizing the performance of checkpointing systems

James S. Plank Yuqun Chen Kai Li Micah Beck Gerry Kingsley 《Software》1999,29(2):125-142

Checkpointing systems are a convenient way for users to make their programs fault‐tolerant by intermittently saving program state to disk and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion, an important class of optimizations that reduce the overhead of checkpointing. Some forms of memory exclusion are well‐known in the checkpointing community. Others are relatively new. In this paper, we describe all of them within the same framework. We have implemented these optimization techniques in two checkpointers: libckpt , which works on Unix‐based workstations, and CLIP , which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long‐running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show the improvements in time and space overhead. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

2.

FINE: A Fully Informed aNd Efficient communication-induced checkpointing protocol for distributed systems

Yi Luo D. Manivannan 《Journal of Parallel and Distributed Computing》2009

Communication-Induced Checkpointing (CIC) protocols are classified into two categories in the literature: Index-based and Model-based. In this paper, we discuss two data structures being used in these two kinds of CIC protocols, and their different roles in helping the checkpointing algorithms to enforce Z-cycle Free (ZCF) property. Then, we present our Fully Informed aNd Efficient (FINE) communication-induced checkpointing algorithm, which not only has less checkpointing overhead than the well-known Fully Informed (FI) CIC protocol proposed by Helary et al. but also has less message overhead. Performance evaluation indicates that our protocol performs better than many of the other existing CIC protocols. 相似文献

3.

A survey and review of the current state of rollback‐recovery for cluster systems

Andrew Maloney Andrzej Goscinski 《Concurrency and Computation》2009,21(12):1632-1666

A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long‐running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long‐running applications. Several fault tolerance approaches were investigated; however, it was found that rollback‐recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback‐recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd. 相似文献

4.

Garbage collection in uncoordinated checkpointing algorithms

下载免费PDF全文

LIU Yunlong CHEN Junliang 《计算机科学技术学报》1999,14(3):242-249

In this paper,the hard problem of the thorough garbage collection in uncoordinated checkpointing algorithms is studied.After introduction of the traditional garbage collecting scheme,with which only obsolete checkpoints can be discarded,it is shown that this kind of traditional method may fail to discard any checkpoint in some special cases,and it is necessary and urgent to find a thorough garbage collecting method,with which all the checkpoints useless for any future rollback-recovery including the obsolete ones can be discarded.Then,th Thorough Garbage Collection Theorem is proposed and proved,which ensures th feasibility of the thorough garbage collection,and gives the method to calculate the set of the useful checkpoints as well. 相似文献

5.

A new approach for high performance computing systems with various checkpointing schemes

Gyung-Leen Park Hee Yong Youn 《The Journal of supercomputing》2005,33(1-2):65-78

Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes thesample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger. This research was supported in part by the MIC (Ministry of Information and Communication), Korea, under the ITRC support program supervised by the UTA and CUCN 21st Century Frontier R&D Program. 相似文献

6.

基于PVM的协调检查点设置关键技术 总被引：1，自引：0，他引：1

王春露汪东升《小型微型计算机系统》2002,23(5):524-528

本文论述了基于PVM的并行程序运行回卷恢复系统设计和实现过程中的退出再加入PVM机制、任务号隐式映射机制、任务结束前同步机制、防止PVM库重入机制，信号与消息协同触发机制、应用任务初始化机制以及作为前述各机制实现基础的函数包裹与换名机制等关键技术。这些技术已经成功地应用于我们自主开发的ChaRM系统中，证明了技术的正确性和有效性。相似文献

7.

Performance and effectiveness trade‐off for checkpointing in fault‐tolerant distributed systems

Panagiotis Katsaros Lefteris Angelis Constantine Lazos 《Concurrency and Computation》2007,19(1):37-63

Checkpointing has a crucial impact on systems' performance and fault‐tolerance effectiveness: excessive checkpointing results in performance degradation, while deficient checkpointing incurs expensive recovery. In distributed systems with independent checkpoint activities there is no easy way to determine checkpoint frequencies optimizing response‐time and fault‐tolerance costs at the same time. The purpose of this paper is to investigate the potentialities of a statistical decision‐making procedure. We adopt a simulation‐based approach for obtaining performance metrics that are afterwards used for determining a trade‐off between checkpoint interval reductions and efficiency in performance. Statistical methodology including experimental design, regression analysis and optimization provides us with the framework for comparing configurations, which use possibly different fault‐tolerance mechanisms (replication‐based or message‐logging‐based). Systematic research also allows us to take into account additional design factors, such as load balancing. The method is described in terms of a standardized object replication model (OMG FT‐CORBA), but it could also be applied in other (e.g. process‐based) computational models. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

8.

多米诺效应的解决策略研究 总被引：1，自引：0，他引：1

刘云龙陈俊亮《软件学报》1998,9(12):942-945

定义了备查点间隔之间的先于关系，并对分布式系统执行的语义正确性进行了约束，证明了逆时先于现象是产生多米诺效应的本质，提出了多米诺避免、多米诺检测与消除、多米诺容忍三大解决策略. 相似文献

9.

基于数据流分析的软件容错策略 总被引：4，自引：1，他引：4

刘云龙陈俊亮《软件学报》1998,9(7):537-541

该文就软件容错中备查点与卷回机制展开深入讨论,提出一种基于数据流分析技术的软件容错新方法.首先对软件容错进行简介,指出数据错是一切控制系统软件失效的根源与最终表现以及对数据采取强有力的容错措施的必要性.然后将数据流分析技术应用于软件容错,通过求解程序变量的到达-定值数据流方程来静态地确定任何数据在任何引用点出错时的最小充分卷回,通过求解活跃变量的数据流方程来静态地确定程序在执行各个基本块时需动态保存的变量集合,得出最小充分卷回定理与备查点数据范围定理,从而解决了时间冗余容错途径中必须回答的两个基本问题.此外,还给出了恢复块定义有效的充分条件.最后,以电信系统为应用实例,介绍了该方法的一种具体实施.该方法在简单地扩展后可被广泛应用于各种容错软件的设计中. 相似文献

10.

Reliable user‐level rollback recovery implementation for multithreaded processes on windows

Jin‐Min Yang Da‐Fang Zhang Xue‐Dong Yang Wen‐Wei Li 《Software》2007,37(3):331-346

The existing user‐level checkpointing schemes support only a limited portion of multithreaded programs because they are derived from the schemes for single‐threaded applications. This paper addresses the impact of thread suspension point on rollback recovery, and presents a checkpointing scheme for multithreaded processes. Unlike the existing schemes in which the checkpointer suspends every working thread, our scheme employs a distinctive strategy that every working thread suspends itself. This technique manages to avoid the suspension point in the API code or kernel code, ensuring correct rollback recovery. Our scheme supports inter‐thread synchronization and thread lifetime. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

11.

Independent checkpointing in a heterogeneous grid environment

Eugen FellerAuthor Vitae John Mehnert-SpahnAuthor Vitae Michael SchoettnerAuthor Vitae Christine MorinAuthor Vitae 《Future Generation Computer Systems》2012,28(1):163-170

The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability. 相似文献

12.

Log-Based Rollback Recovery without Checkpoints of Shared Memory in Software DSM

Soyeon Park Seung Ryoul Maeng 《The Journal of supercomputing》2006,35(2):141-154

A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing. This research is supported by KISTEP under the National Research Laboratory program. 相似文献

13.

Design and Evaluation of a Low-Latency Checkpointing Scheme for Mobile Computing Systems

Li Guohui; Shu LihChyun 《Computer Journal》2006,49(5):527-540

相似文献

14.

具有O(n)消息复杂度的协调检查点设置算法 总被引：9，自引：0，他引：9

下载免费PDF全文

汪东升邵明珑《软件学报》2003,14(1):43-48

协调检查点设置及回卷恢复技术作为一种有效的容错手段,已广泛地运用在集群等并行/分布计算机系统中.为了进一步降低协调检查点设置的时间和空间开销,提出了一种基于消息计数的协调检查点设置算法.该算法无须对底层消息通道的FIFO特性进行假设,并使同步阶段引入的控制消息复杂度由通常的O(n²)降低到O(n),有效地提高了系统的效率和扩展性. 相似文献

15.

基于虚拟文件操作的文件检查点设置 总被引：1，自引：0，他引：1

刘少锋汪东升朱晶《软件学报》2002,13(8):1528-1533

实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点. 相似文献

16.

基于检测点设置依赖图和属性表的卷回恢复算法 总被引：2，自引：0，他引：2

张宇洪炳熔《计算机研究与发展》2001,38(2):246-251

为了解决检测点设置过程中的Domino效应问题及卷回恢复过程中的活锁问题,并最大限度地减小时间开销,提出了基于检测点设置依赖图和属性表的卷回恢复算法。同以前的算法相比较,该算法一方面节省了用于进程之间同步的时间开销,另一方面检测点设置及卷回过程中涉及少量的相关进程。对该算法的正确性进行了证明。相似文献

17.

一种优化的分布式系统的失效恢复策略 总被引：1，自引：0，他引：1

刘云龙陈俊亮《计算机学报》1999,22(3):249-257

本文对确定性进程组的分布式系统的失效恢复策略做了深入的研究,独到地提出了应用数据流分析来静态地计算进程的最小备查点数据集的方法。相似文献

18.

分布式计算系统回卷恢复容错的仿真设计

董奇 黄斌 颜耀 李韦韦 曾玮妮 张恒 《计算机与现代化》2017,(4):48

为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。  相似文献

19.

An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Qiangfeng Yi D. 《Journal of Parallel and Distributed Computing》2008,68(12):1575-1589

Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm. 相似文献

20.

Self-stabilizing algorithm for checkpointing in a distributed system

Partha Sarathi Mandal Krishnendu Mukhopadhyaya 《Journal of Parallel and Distributed Computing》2007

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any number of processes. The proposed checkpointing algorithm can deal with concurrent multiple initiations of checkpointing and data faults. A process can recover from a fault, using the proposed recovery algorithm in spite of multiple data faults present in the system. All the proposed algorithms converge in O(n)

O (n)

steps, where n

n

is the number of processes. The algorithm can be extended to work for general topologies too. 相似文献