首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
现有的回卷恢复容错技术存在同步约束和阻塞问题,其时间开销随系统节点规模的增大而剧增。为此,提出一种基于并发性发掘的低开销回卷恢复实现方法。利用消息传递附带跟踪消息依赖的策略解除消息日志中的同步约束,解析进程负载以发掘进程负载的并发性,构建进程负载并发执行的实现架构,采用数据缓存策略和多线程技术实现进程内部各负载的并发执行,以降低故障恢复开销。3个NASNPB2.3标准性能检测程序的实验结果表明,该方法可使检查点开销从0.63S、3.19S、1.21S分别降低到0.18S、O.67S、0.19S,日志开销率从13.4%、3.5%、18.3%分别降低到0.7%、0.1%、1.0%。  相似文献   

2.
In this paper, a checkpointing protocol based on loose synchronization is proposed. The protocol enables processes to take checkpoints at different frequencies so that each process can control its rollback distance. In traditional asynchronous and quasi-synchronous checkpointing protocols, the checkpoints that are not up-to-date may be used for recovery. As a result, the rollback distance is often difficult to control. In the proposed protocol, the checkpoint cycle of each process is dynamically adjusted using a pessimistic scheme so that strict 1-rollback is achieved; namely, one of the last two checkpoints of each process can be utilized for recovery.  相似文献   

3.
An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. A common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement. These advantages are assessed quantitatively by developing a probabilistic model  相似文献   

4.
基于检测点设置依赖图和属性表的卷回恢复算法   总被引:2,自引:0,他引:2  
为了解决检测点设置过程中的Domino效应问题及卷回恢复过程中的活锁问题,并最大限度地减小时间开销,提出了基于检测点设置依赖图和属性表的卷回恢复算法。同以前的算法相比较,该算法一方面节省了用于进程之间同步的时间开销,另一方面检测点设置及卷回过程中涉及少量的相关进程。对该算法的正确性进行了证明。  相似文献   

5.
The existing user‐level checkpointing schemes support only a limited portion of multithreaded programs because they are derived from the schemes for single‐threaded applications. This paper addresses the impact of thread suspension point on rollback recovery, and presents a checkpointing scheme for multithreaded processes. Unlike the existing schemes in which the checkpointer suspends every working thread, our scheme employs a distinctive strategy that every working thread suspends itself. This technique manages to avoid the suspension point in the API code or kernel code, ensuring correct rollback recovery. Our scheme supports inter‐thread synchronization and thread lifetime. Copyright © 2006 John Wiley & Sons, Ltd.  相似文献   

6.
Time Warp synchronized parallel discrete event simulators are organized to operate asynchronously and aggressively without explicit synchronization between the concurrently executing simulation objects. In place of an explicit synchronization mechanism, the concurrent simulators implement an independent but common virtual clock model and a rollback/recovery mechanism to restore causal order when out-of-order events are detected. When the critical path of execution of the simulation is balanced across this parallel threads of execution, this can result in a highly effective, lightweight synchronization mechanism to implement parallel simulation. However, imbalances in the workload across the threads can result in excessive rollback in some threads and slowed progress of the critical path. On small shared memory multi-core systems, a lowest time-stamp scheduling policy can effectively balance the workload. However, on larger many-core chips, conventional load balancing and workload migration will once again become necessary. Fortunately, emerging many-core chips contain some interesting features that can potentially be exploited to improve the performance of parallel simulations. In particular, the recently developed Intel Single-chip Cloud Computer (SCC) provides mechanisms for the runtime control of the frequency and voltage settings of the chip. Furthermore, the frequency and voltage settings are independently set within different regions (called islands) of the chip. Thus, in a Time Warp simulation, one could increase the frequency of the cores executing threads on the critical path (those experiencing infrequent rollback) and decrease the frequency of the cores executing threads off the critical path (those experiencing excessive rollback). This paper investigates the run-time control and adjustment of core frequency in some contemporary x86 multi-core processors to identify the platforms that can support the exploration of dynamic run-time control of core frequency settings. The results show that while all multi-core processors have software controllable core frequency modulation capabilities, they are generally not fully independent as the system comes under load and are therefore unsuitable for these studies. Fortunately, one processor, the AMD X6 line, provides software control for core frequencies that can be fixed (by software) even as the system operates under load.  相似文献   

7.
Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O(kn) when k initiators initiate concurrently. The time complexity is O(n). For the recovery algorithm, time and message complexities are both O(n).  相似文献   

8.
Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.  相似文献   

9.
Parallel discrete event simulation is a useful technique to improve performance of sequential discrete event simulation. We consider the time warp algorithm for asynchronous distributed discrete event simulation. Time warp is an optimistic synchronization mechanism for asynchronous distributed systems that allows a system to violate the synchronization constraint and, in this case, make the system rollback to a correct state. We focus on the kernel of the time warp algorithm, that is the rollback operation, and we propose some techniques to reduce the overhead due to this operation. In particular, we propose a method to reduce the overhead involved in state saving operation, two methods to reduce the overhead of a single rollback operation and a method to reduce the overall number of rollbacks. These methods have been implemented in a distributed simulation environment on a distributed memory system. Some experimental results show the effectiveness of the proposed techniques.  相似文献   

10.
为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。   相似文献   

11.
Three alternatives for implementing recovery blocks (RB's) are conceivable for backward error recovery in concurrent processing. These are the asynchronous, synchronous, and the pseudorecovery point implementations. Asynchronous RB's are based on the concept of maximum autonomy in each of concurrent processes. Consequently, establishment of RB's in a process is made independently of others and unbounded rollback propagations become a serious problem. In order to completely avoid unbounded rollback propagations, it is necessary to synchronize the establishment of recovery blocks in all cooperating processes. Process autonomy is sacrificed and processes are forced to wait for commitments from others to establish a recovery line, leading to inefficiency in time utilization. As a compromise between asynchronous and synchronous RB's we propose to insert pseudorecovery points (PRP's) so that unbounded rollback propagations may be avoided while maintaining process autonomy. We developed probabilistic models for analyzing these three methods under standard assumptions in computer performance analysis, i.e., exponential distributions for related random variables. With these models we have estimated 1) the interval between two successive recovery lines for asynchronous RB's, 2) mean loss in computation power for the synchronized method, and 3) additional overhead and rollback distance in case PRP's are used.  相似文献   

12.
协同式检查点设置及卷回恢复技术是一种简单有效的容错手段,被广泛地运用于并行/分布式系统中。为进一步降低协同式检查点算法的开销,该文给出了一个基于可重建检查点的非阻塞协同式检查点算法。并行程序出错导致卷回恢复发生的概率远小于检查点设置概率,该算法利用这一特性,将检查点设置的部分开销转至卷回恢复阶段,降低了容错的开销,提高了系统的可扩展性。  相似文献   

13.
一种基于移动计算环境的因果日志卷回恢复算法   总被引:2,自引:0,他引:2  
由于移动节点的不可靠和无线网络连接的脆弱性,研究移动计算系统容错机制具有重要意义.对可以跨区移动、随时可以与网络断开的自治性很强的移动节点来说,异步的卷回恢复是一种重要的容错手段.现有的移动计算环境下的卷回恢复算法都无法完全实现一致的异步卷回恢复.基于因果消息日志,提出一种新的移动计算环境的卷回恢复算法:通过先行图来记录节点间的消息依赖关系,将异步检查点、基于发送方的暂存消息日志和先行图全部在移动支持站上存储和处理,为移动节点提供一种透明的容错服务,完全消除依赖关系在移动节点之间造成的影响.用形式化的方法证明了系统的一致性.仿真结果表明,在卷回开销达到最低的同时,也显著降低了无错运行时的通信和存储开销.  相似文献   

14.
在分布式计算环境中经常使用检查点/恢复策略来进行容错。文中主要研究在信道不可靠的环境中通过协调使相互通信的各进程所做的检查点保持全局一致性的方法。通过分析中途消息与信道可靠性之闯的关系以及已有检查点协议对于中途消息处理方法,提出了一种应用于信道不可靠环境下的协调式检查点方法,其消息复杂度为O(N)且不引入其他的计算负担,只通过一次同步即可达到全局一致性状态,相比于以往的协调式检查点协议大大减小了时间开销,提高了在不可靠信道环境中做全局一致检查点的效率。  相似文献   

15.
Bowen  N.S. Pradham  D.K. 《Computer》1993,26(2):22-31
Several hardware-based techniques that support checkpoint and rollback recovery are presented. The focus is on hardware schemes for uniprocessors, shared-memory multiprocessors, and distributed virtual-memory systems. A taxonomy for processor and memory techniques based on the memory hierarchy is presented. This provides a basis for understanding subtle differences among the various schemes. Processor-based schemes that handle transient faults by using processor-based transparent rollback techniques and memory-based schemes that roll back data instead of instructions and can be integrated with the processor techniques or can be exploited by higher levels of software are discussed  相似文献   

16.
An approach to coordination of cooperating concurrent processes, each capable of error direction and recovery, is presented. Error detection, rollback, and retry in a process are specified by a well-structured language construct called recovery block. Recovery points of processes must be properly coordinated to prevent a disastrous avalanche of process rollbacks. The approach relies on an intelligent processor system (that runs processes) capable of establishing and discarding the recovery points of interacting processes in a well coordinated manner such that a process never makes two consecutive rollbacks without making a retry between the two, and every process rollback becomes a minimum-distance rollback. Following a discussion of the underlying philosophy of the author's approach, basic rules of reducing storage and time overhead in such a processor system are discussed. Examples are drawn from the systems in which processes communicate through monitors  相似文献   

17.
一种基于检查点的卷回恢复与进程迁移系统*   总被引:14,自引:2,他引:12  
ChaRM是一种并行程序后向故障恢复与进程迁移系统.它不仅实现了对工作站机群系统瞬时故障的恢复,而且通过检查点设置时的Mirror存储技术和进程迁移技术,实现了对机群系统结点永久故障的恢复,并支持系统软硬件的在线维护、处理机资源的排他/限时使用和动态负载平衡等功能.文章主要介绍ChaRM系统的检查点设置与回卷恢复、进程迁移等实现技术,并给出了部分性能评测结果.  相似文献   

18.
耿技  陈非  聂鹏  陈伟  秦志光 《计算机应用》2012,32(10):2748-2751
基于检查点的协同式回滚恢复机制是一种针对分布式系统生存性保障的有效机制,现有分布式系统中基于检查点的回滚恢复机制以分布式信道可靠作为假设前提,而实际应用场景中,该假设并不总是成立。针对分布式系统实际的应用环境,提出了适用于信道不可靠的分布式计算环境的协同式系统生存性保障模型。该模型在保留检查点回滚恢复机制优点的基础上,通过建立冗余通信链路和进程迁移来保障不可靠通信信道环境下分布式系统的生存性。  相似文献   

19.
基于虚拟文件操作的文件检查点设置   总被引:1,自引:0,他引:1  
刘少锋  汪东升  朱晶 《软件学报》2002,13(8):1528-1533
实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点.  相似文献   

20.
A variety of research problems exist that require considerable time and computational resources to solve. Attempting to solve these problems produces long‐running applications that require a reliable and trustworthy system upon which they can be executed. Cluster systems provide an excellent environment upon which to run these applications because of their low cost to performance ratio; however, due to being created using commodity components they are prone to failures. This report surveyed and reviewed the issues currently relating to providing fault tolerance for long‐running applications. Several fault tolerance approaches were investigated; however, it was found that rollback‐recovery provides a favourable approach for user applications in cluster systems. Two facilities are required to provide fault tolerance using rollback‐recovery: checkpointing and recovery. It was shown here that a multitude of work has been done for enhancing checkpointing; however, the intricacies of providing recovery have been neglected. The problems associated with providing recovery include; providing transparent and autonomic recovery, selecting appropriate recovery computers, and maintaining a consistent observable behaviour when an application fails. Copyright © 2009 John Wiley & Sons, Ltd.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号