首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, a checkpointing protocol based on loose synchronization is proposed. The protocol enables processes to take checkpoints at different frequencies so that each process can control its rollback distance. In traditional asynchronous and quasi-synchronous checkpointing protocols, the checkpoints that are not up-to-date may be used for recovery. As a result, the rollback distance is often difficult to control. In the proposed protocol, the checkpoint cycle of each process is dynamically adjusted using a pessimistic scheme so that strict 1-rollback is achieved; namely, one of the last two checkpoints of each process can be utilized for recovery.  相似文献   

2.
工作站机群系统已成为分布式并行处理发展的主流方向之一 .随着机群系统应用领域的逐渐拓展和规模的不断扩大 ,人们对其可靠性的要求日益提高 .设计高可靠的群机系统 ,需要着重研究其系统容错技术 .本文叙述了并行异构环境回卷恢复和检查点派生 .实现透明的可移植容错和负载均衡能力 .避免调整检查点就可以构成全局一致性状态 .不仅使 BSP应用程序自治容错能力 ,而且能够在机群 (Clusters)间迁移 ,保持系统负载均衡 .重点介绍检查点设置、检查点派生、卷回、进程迁移技术  相似文献   

3.
Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class.  相似文献   

4.
Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm.  相似文献   

5.
基于虚拟文件操作的文件检查点设置   总被引:1,自引:0,他引:1  
刘少锋  汪东升  朱晶 《软件学报》2002,13(8):1528-1533
实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点.  相似文献   

6.
Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.  相似文献   

7.
Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval  相似文献   

8.
Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O(kn) when k initiators initiate concurrently. The time complexity is O(n). For the recovery algorithm, time and message complexities are both O(n).  相似文献   

9.
A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery.  相似文献   

10.
检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。  相似文献   

11.
The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented  相似文献   

12.
Existing cache-level checkpointing schemes do not continuously support a large rollback window. Immediately after a checkpoint, the number of instructions that the processor can undo falls to zero. To address this problem, we introduce Swich, an FPGA-based prototype of a new cache-level scheme that keeps two live checkpoints at all times, forming a sliding rollback window that maintains a large minimum and average length  相似文献   

13.
Soundness-preserving reduction rules for reset workflow nets   总被引:2,自引:0,他引:2  
The application of reduction rules to any Petri net may assist in its analysis as its reduced version may be significantly smaller while still retaining the original net’s essential properties. Reset nets extend Petri nets with the concept of a reset arc, allowing one to remove all tokens from a certain place. Such nets have a natural application in business process modelling where possible cancellation of activities need to be modelled explicitly and in workflow management where such process models with cancellation behaviours should be enacted correctly. As cancelling the entire workflow or even cancelling certain activities in a workflow has serious implications during execution (for instance, a workflow can deadlock because of cancellation), such workflows should be thoroughly tested before deployment. However, verification of large workflows with cancellation behaviour is time consuming and can become intractable due to the state space explosion problem. One way of speeding up verification of workflows based on reset nets is to apply reduction rules. Even though reduction rules exist for Petri nets and some of its subclasses and extensions, there are no documented reduction rules for reset nets. This paper systematically presents such reduction rules. Because we want to apply the results to the workflow domain, this paper focusses on reset workflow nets (RWF-nets), i.e. a subclass tailored to the modelling of workflows. The approach has been implemented in the context of the workflow system YAWL.  相似文献   

14.
Checkpointing and rollback recovery are well-known techniques for handling failures in distributed systems. The issues related to the design and implementation of efficient checkpointing and recovery techniques for distributed systems have been thoroughly understood. For example, the necessary and sufficient conditions for a set of checkpoints to be part of a consistent global checkpoint has been established for distributed computations. In this paper, we address the analogous question for distributed database systems. In distributed database systems, transaction-consistent global checkpoints are useful not only for recovery from failure but also for audit purposes. If each data item of a distributed database is checkpointed independently by a separate transaction, none of the checkpoints taken may be part of any transaction-consistent global checkpoint. However, allowing individual data items to be checkpointed independently results in non-intrusive checkpointing. In this paper, we establish the necessary and sufficient conditions for the checkpoints of a set of data items to be part of a transaction-consistent global checkpoint of the distributed database. Such conditions can also help in the design and implementation of non-intrusive checkpointing algorithms for distributed database systems.  相似文献   

15.
WindowsNT环境下的进程检查点设置与回卷恢复   总被引:6,自引:0,他引:6  
阐述了WindowsNT环境下应用程序的检查点设置与回卷恢复机制,并介绍了设计和实现的检查点设置与恢复工具WinNTCkp.WinNTCkpt采用标准WindowsAPI函数,通过代码动态注入和对系统调用进行包裹的方法进行检查点设置与回卷恢复。与同类工具相比,WinNTCkpt具有不需修改应用程序源代码,不需对应用程序进行重新编译或连接,支持对用户文件内容的检查设置与回卷恢复的特点。WinNTCkpt是正在研制开发的高可用性机群计算环境的核心,也是在机群环境下实现进程迁移和负载平衡的技术基础。  相似文献   

16.
基于检测点设置依赖图和属性表的卷回恢复算法   总被引:2,自引:0,他引:2  
为了解决检测点设置过程中的Domino效应问题及卷回恢复过程中的活锁问题,并最大限度地减小时间开销,提出了基于检测点设置依赖图和属性表的卷回恢复算法。同以前的算法相比较,该算法一方面节省了用于进程之间同步的时间开销,另一方面检测点设置及卷回过程中涉及少量的相关进程。对该算法的正确性进行了证明。  相似文献   

17.
The Time Warp distributed simulation algorithm uses checkpointing to save process states after certain event executions for later recovery at the time of a rollback. Two main techniques have been used for checkpointing: periodic state saving and incremental state saving. The former technique introduces large overheads in reconstructing a desired state by coasting forward from an earlier checkpointed state if the computational granularity is large. The latter technique also has large overheads in applications with large rollback distances. A hybrid checkpointing technique is proposed which uses both periodic and incremental state saving simultaneously in such a way that it reduces checkpointing time overheads. A detailed analytical model is developed for the hybrid technique, and comparisons are made using similar analytical models with periodic and incremental state saving techniques. Results show that when the system parameters are chosen to represent large and complex simulated systems, the hybrid approach has less checkpointing time overhead than the other two techniques  相似文献   

18.
The existing user‐level checkpointing schemes support only a limited portion of multithreaded programs because they are derived from the schemes for single‐threaded applications. This paper addresses the impact of thread suspension point on rollback recovery, and presents a checkpointing scheme for multithreaded processes. Unlike the existing schemes in which the checkpointer suspends every working thread, our scheme employs a distinctive strategy that every working thread suspends itself. This technique manages to avoid the suspension point in the API code or kernel code, ensuring correct rollback recovery. Our scheme supports inter‐thread synchronization and thread lifetime. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

19.
Contracts are complex to understand, represent and process electronically. Usually, contracts involve various entities such as parties, activities and clauses. An e-contract is a contract modeled, specified, executed and enacted (controlled and monitored) by a software system (such as a workflow system). Workflows are used to automate business processes that govern adherence to the e-contracts. E-contracts can be mapped to inter-related workflows, which have to be specified carefully to satisfy the contract requirements. Most workflow models do not have the capabilities to handle complex inter/intra relationships among entities in e-contracts. An e-contract does not adhere to activity/task oriented workflow processes, thus generating a gap between a conceptual model of e-contract and workflow. In this paper, we describe conceptual modeling of e-contracts and present a business process model for e-contract enactment. The enactment of e-contracts necessitates dynamic generation and initiation of workflows during the e-contract execution, besides the static workflows. EREC business process model facilitates an integrated approach to e-contracts enactment. Our methodology is illustrated by means of a case study conducted using Financial Messaging Solution contract for banking transactions.  相似文献   

20.
Workflows are used to formally describe processes of various types such as business and manufacturing processes. One of the critical tasks of workflow management is automated discovery of possible flaws in the workflow – workflow verification. In this paper, we formalize the problem of workflow verification as the problem of verifying that there exists a feasible process for each task in the workflow. This problem is tractable for nested workflows that are the workflows with a hierarchical structure similar to hierarchical task networks in planning. However, we show that if extra synchronization, precedence, or causal constraints are added to the nested structure, the workflow verification problem becomes NP-complete. We present a workflow verification algorithm for nested workflows with extra constraints that is based on constraint satisfaction techniques and exploits an incremental temporal reasoning algorithm. We then experimentally demonstrate efficiency of the proposed techniques on randomly generated workflows with various structures and sizes. The paper is concluded by notes on exploiting the presented techniques in the application FlowOpt for modeling, optimizing, visualizing, and analyzing production workflows.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号