首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
现有的回卷恢复容错技术存在同步约束和阻塞问题,其时间开销随系统节点规模的增大而剧增。为此,提出一种基于并发性发掘的低开销回卷恢复实现方法。利用消息传递附带跟踪消息依赖的策略解除消息日志中的同步约束,解析进程负载以发掘进程负载的并发性,构建进程负载并发执行的实现架构,采用数据缓存策略和多线程技术实现进程内部各负载的并发执行,以降低故障恢复开销。3个NASNPB2.3标准性能检测程序的实验结果表明,该方法可使检查点开销从0.63S、3.19S、1.21S分别降低到0.18S、O.67S、0.19S,日志开销率从13.4%、3.5%、18.3%分别降低到0.7%、0.1%、1.0%。  相似文献   

2.
A rollback recovery scheme for distributed systems is proposed. The state-save synchronization among processes is implemented by bounding clock drifts such that no state-save synchronization messages are required. Since the clocks are only loosely synchronized, the synchronization overhead can be negligible in many applications. An interprocess communication protocol which encodes state-save progress information within message frames is introduced to checkpoint consistent system states. A rollback recovery algorithm that will force a minimum number of nodes to roll back after failures is developed  相似文献   

3.
The cost of recovery in message logging protocols   总被引:1,自引:0,他引:1  
Past research in message logging has focused on studying the relative overhead imposed by pessimistic, optimistic and causal protocols during failure-free executions. In this paper, we give the first experimental evaluation of the performance of these protocols during recovery. Our results suggest that applications face a complex tradeoff when choosing a message logging protocol for fault tolerance. On the one hand, optimistic protocols can provide fast failure-free execution and good performance during recovery, but are complex to implement and can create orphan processes. On the other hand, orphan-free protocols either risk being slow during recovery (e.g. sender-based pessimistic and causal protocols) or incur a substantial overhead during failure-free execution (e.g. receiver-based pessimistic protocols). To address this tradeoff, we propose hybrid logging protocols, which are a new class of orphan-free protocols. We show that hybrid protocols perform within 2% of causal logging during failure-free execution and within 2% of receiver-based logging during recovery  相似文献   

4.
The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High‐Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade‐off between failure‐free and recovery performances. In this paper we describe our contribution in fault tolerance for high‐level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure‐free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties. Copyright © 2010 John Wiley & Sons, Ltd.  相似文献   

5.
一种新的优化的检查点间隔的求解模型   总被引:1,自引:0,他引:1  
在具有容错功能的高性能计算环境中,由于加入检查点机制会给系统引入额外负载,检查点间隔的适当选定能使系统性能优化,Vaidya的贡献是用他的模型得出的检查点间隔的求解等式独立于检查点潜伏时间(L)及检查点恢复时间(R),本文介绍了一种新的基于时间分段的模型NSBM,引入了系统平均利用率这一容错领域更易理解的概念代替Vaidya模型中的平均负载率并推导出了也是独立于LR的求解等等式,实验结果表明NSBM的求解模型比Vaidya的求解模型更优化。  相似文献   

6.
一种基于移动计算环境的因果日志卷回恢复算法   总被引:2,自引:0,他引:2  
由于移动节点的不可靠和无线网络连接的脆弱性,研究移动计算系统容错机制具有重要意义.对可以跨区移动、随时可以与网络断开的自治性很强的移动节点来说,异步的卷回恢复是一种重要的容错手段.现有的移动计算环境下的卷回恢复算法都无法完全实现一致的异步卷回恢复.基于因果消息日志,提出一种新的移动计算环境的卷回恢复算法:通过先行图来记录节点间的消息依赖关系,将异步检查点、基于发送方的暂存消息日志和先行图全部在移动支持站上存储和处理,为移动节点提供一种透明的容错服务,完全消除依赖关系在移动节点之间造成的影响.用形式化的方法证明了系统的一致性.仿真结果表明,在卷回开销达到最低的同时,也显著降低了无错运行时的通信和存储开销.  相似文献   

7.
Fault-tolerance techniques based on checkpointing and message logging have been increasingly used in real-world applications to reduce service down-time. Most industrial applications have chosen pessimistic logging because it allows fast and localized recovery. The price that they must pay, however, is the high failure-free overhead. In this paper, we introduce the concept of K-optimistic logging where K is the degree of optimism that can be used to fine-tune the trade-off between failure-free overhead and recovery efficiency. Traditional pessimistic logging and optimistic logging then become the two extremes in the entire spectrum spanned by K-optimistic logging. Our results generalize several previously known protocols.Our approach is to prove that only dependencies on those states that may be lost upon a failure need to be tracked on-line, and so transitive dependency tracking can be performed with a variable-size vector. The size of the vector piggy-backed on a message then indicates the number of processes whose failures may revoke the message, and K corresponds to the upper bound on the vector size. Furthermore, the parameter K is dynamically tunable in response to changing system characteristics.  相似文献   

8.
A common approach to fault-tolerant software DSM is to take checkpoints with message logging. Our remote logging has low overhead because each node saves the coherence-related data into the memory of a remote node through a high-speed system area network. For more lightweight fault-tolerant DSM, in this paper, we mainly focused on eliminating shared memory checkpointing during failure-free execution. Each node independently takes the checkpoints of execution states and non-shared data only. When a node fails, it regenerates its pages from the remote copies in live nodes. In order to efficiently reconstruct pages, we also introduced a XOR-diffing technique. The diff logs, which have been created by XOR operations during failure-free execution, can be applicable to any version of remote copies either backward or forward for recovery. Our scheme reduces the checkpointing overhead and also alleviates the imbalance in execution times among nodes due to independent checkpointing. This research is supported by KISTEP under the National Research Laboratory program.  相似文献   

9.
Management of replicated data has received considerable attention in the last few years. Several replica control schemes have been proposed which work in the presence of both node and communication link failures. However, this resiliency to failure inflicts a performance penalty in terms of the communication overhead incurred. Though the issue of performance of these schemes from the standpoint of availability of the system has been well addressed, the issue of message overhead has been limited to the analysis of worst case and best case message bounds. In this paper we derive expressions for computing the average message overhead of several well known replica control protocols and provide a comparative study of the different protocols with respect to both average message overhead and system availabilities  相似文献   

10.
在分布式计算环境中经常使用检查点/恢复策略来进行容错。文中主要研究在信道不可靠的环境中通过协调使相互通信的各进程所做的检查点保持全局一致性的方法。通过分析中途消息与信道可靠性之闯的关系以及已有检查点协议对于中途消息处理方法,提出了一种应用于信道不可靠环境下的协调式检查点方法,其消息复杂度为O(N)且不引入其他的计算负担,只通过一次同步即可达到全局一致性状态,相比于以往的协调式检查点协议大大减小了时间开销,提高了在不可靠信道环境中做全局一致检查点的效率。  相似文献   

11.
A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery.  相似文献   

12.
检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。  相似文献   

13.
如今随着存储系统规模的扩大和廉价磁盘的大量使用,单一磁盘故障在存储系统中发生故障的概率也不断上升。而在基于RDP编码的阵列存储系统中,恢复单个故障磁盘,需要读取全部的剩余数据磁盘,读取开销大,故障恢复时间长。而故障时间长就会导致系统在恢复过程中出错的概率增大,影响系统整体的稳定性。为进一步降低单个磁盘故障恢复的读取开销,减少恢复时间,提升存储系统可靠性,提出一种局部修复RDP码,通过增加一个局部冗余列来减少故障恢复时需要读取的数据量。实验结果表明改进方法在降低读取开销和减少恢复时间方面相对于传统的RDP单盘故障恢复方法有明显提高,并且能够恢复75%的三盘故障情况。  相似文献   

14.
为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。   相似文献   

15.
Parallel discrete event simulation is a useful technique to improve performance of sequential discrete event simulation. We consider the time warp algorithm for asynchronous distributed discrete event simulation. Time warp is an optimistic synchronization mechanism for asynchronous distributed systems that allows a system to violate the synchronization constraint and, in this case, make the system rollback to a correct state. We focus on the kernel of the time warp algorithm, that is the rollback operation, and we propose some techniques to reduce the overhead due to this operation. In particular, we propose a method to reduce the overhead involved in state saving operation, two methods to reduce the overhead of a single rollback operation and a method to reduce the overall number of rollbacks. These methods have been implemented in a distributed simulation environment on a distributed memory system. Some experimental results show the effectiveness of the proposed techniques.  相似文献   

16.
在现有电缆拖挂单轨吊电液控制系统基础上,通过增加遥控器和无线收发器,设计了一种电缆拖挂单轨吊电液控制遥控系统,介绍了该系统的组成、遥控器和无线收发器的硬件设计、系统软件设计及遥控器的具体操作方式。该系统具有原电缆拖挂单轨吊电液控制系统的功能,并可通过操作遥控器实现安全距离内的电缆拖挂单轨吊远程控制。现场应用验证了该系统的稳定性及可靠性。  相似文献   

17.
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing method for DSM that takes advantage of DSM′s specific properties to reduce error-free and rollback overhead. The scheme reduces the dependencies that need to be considered for correct rollback to those resulting from transfers of pages. Furthermore, in-transit messages can be recovered without the use of logging. We extend the scheme to a DSM implementation using lazy release consistency, where the frequency of dependencies is further reduced.  相似文献   

18.
In a distributed computing system, message logging is widely used for providing nodes with recoverability. To reduce the piggyback overhead of traditional causal message logging, we present a zoning causal message logging approach in this paper. The crux of the approach is to control the propagation of dependency information: the nodes in the system are divided into zones, and by a message fragment mechanism, the dependency information of a node is only visible in the zone scope. Simulation results show that the piggyback overhead of the proposed approach is lower than that of traditional causal message logging.  相似文献   

19.
王准  陈俊亮 《计算机学报》1998,21(8):730-737
消息日志是用于多进程、分布式系统中状态恢复的一种方法。本文针对传统的消息日志方法仅仅适用于确定性进程的局限性,提出一种新的消息日志思想,充分考虑到不确定性的存在在容错方面的积极作用,主张在满足应用进程一致性语义的基础上,在一定程度上允许不确定性现象的存在。从而以新的角度看待单一进程和分布式并发系统中存在的不确定性所带来的状态重建不能完全复原的问题。这样,消息日志亦能适用于某些不满足确定性条件的进程  相似文献   

20.
An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency, which is the main source of multistep rollback in distributed systems, is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation using an object-oriented test framework  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号