首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 110 毫秒
1.
Unix进程检查点设置关键技术   总被引:4,自引:0,他引:4  
Unix进程的检查点设置是实现分布/并行系统容错、重播调试、进程迁移、系统模拟和作业切换等功能的基础。该论文主要论述UNIX进程检查点基本信息的保存与恢复、文件检查点、检查点信息的优化等关键技术,最后介绍Libckpt、Condor以及自行设计的Libcsm等检查点设置工具。  相似文献   

2.
《软件》2017,(7):137-142
检查点机制是高性能计算平台的一项重要特性。它能够在程序运行的某一时刻保存程序的运行状态,并在系统故障后恢复程序状态继续执行。由于文件操作在应用程序中的普遍性,支持文件回卷对于检查点技术来说是十分必要的。文件数据备份可以使文件在回卷后恢复到正常状态,但是开销太大。本文提出了一种基于行为特征的文件检查点优化策略(BBFC),能够提供文件数据的正确恢复,有效保证了程序回卷恢复到上一个检查点时文件状态与进程其它状态保持一致。BBFC对文件行为特征进行分类,并根据这些行为特征采取相应的保存恢复策略,从而在很大概率上减少了检查点间隔需要保存的文件内容,降低了文件检查点的时间、空间开销。它对用户透明,简单易用。  相似文献   

3.
Solaris系统多线程检查点设置与卷回恢复   总被引:1,自引:0,他引:1  
文章利用UNIX进程检查点设置思想,结合多线程在Solaris系统中的实现特点,提出了一种适合于Solaris操作系统的多线程检查点设置与恢复技术,其检查点设置与恢复技术具有在用户级实现、对用户透明和简单高效的特点。文章主要介绍检查点信息的保存与恢复、函数换名、包裹,线程号映射等关键技术。  相似文献   

4.
针对空中交通管制系统(ATC)中对飞行数据集群处理的可靠性要求,提出了一种基于Linux的用户级进程检查点设置与恢复方案.对基于该Linux用户级的进程检查点的飞行数据集群处理的各个主要模块进行了介绍,在此基础上给出了系统设计框架.从进程的初始化数据段、堆、栈和打开的文件的保存与恢复,给出了该方案的详细实现方法.该进程检查点设置与恢复方案不但可以在主机崩溃重启后恢复进程在重启前的运行状态,更重要的是可以在分布式系统通过进程迁移将保存的进程检查点迁移到其它主机运行,从而有效的提高系统的可靠性,减少运算损失.  相似文献   

5.
WindowsNT环境下的进程检查点设置与回卷恢复   总被引:6,自引:0,他引:6  
阐述了WindowsNT环境下应用程序的检查点设置与回卷恢复机制,并介绍了设计和实现的检查点设置与恢复工具WinNTCkp.WinNTCkpt采用标准WindowsAPI函数,通过代码动态注入和对系统调用进行包裹的方法进行检查点设置与回卷恢复。与同类工具相比,WinNTCkpt具有不需修改应用程序源代码,不需对应用程序进行重新编译或连接,支持对用户文件内容的检查设置与回卷恢复的特点。WinNTCkpt是正在研制开发的高可用性机群计算环境的核心,也是在机群环境下实现进程迁移和负载平衡的技术基础。  相似文献   

6.
毛文涛  金文标  牟俊  邓通 《计算机工程与设计》2006,27(15):2759-2762,2827
在基于Linux检查点机制的Apache服务器进程迁移过程中,实现了打开文件状态的一致性恢复。简要阐述了Apache服务器的体系结构及其在集群系统内的进程迁移实现技术。分析了目的结点上迁移进程恢复运行后打开文件状态不一致的原因,进行了相应的理论研究。最后给出了Apache服务器进程迁移过程中文件状态一致性恢复的具体实现。  相似文献   

7.
Minix进程检查点机制的实现   总被引:1,自引:0,他引:1  
李毅  周明天 《计算机应用》2003,23(1):13-14,17
通过将进程用户栈和核心上下文数据存入数据段,可把与检查点有关的进程上下文简化为用户级寄存器上下文和用户数据段。检查点机制的状态检查操作就是将进程在该运行时刻的用户级寄存器上下文和用户数据段保存到检查点文件中,状态操作是状态检查的递操作,文章给出了Minix进程检查点机制的核外实现技术,并对该实现技术作了适当的优化。  相似文献   

8.
支持文件迁移的Linux检查点机制的实现   总被引:2,自引:2,他引:0       下载免费PDF全文
杨晖  陈闳中 《计算机工程》2010,36(3):266-268
在BLCR系统的基础上实现一种支持进程打开文件迁移的检查点机制,给出该机制的总体框架、关键技术、进程打开文件保存恢复、状态保存和恢复的流程。实验结果表明,该机制支持多线程、信号、进程打开文件及管道等的保存与恢复,无需重编译内核,对用户具有良好的透明性。  相似文献   

9.
本文概述了联机事务处理的特点及其对计算机系统的要求,详细地介绍了在此环境下,如何利用日志文件对数据库文件的故障恢复技术,包括硬件故障恢复技术和数据库文件恢复技术,分析了数据库保护措施及面临的挑战和问题,针对数据库文件在系统运行过程中出现的主要故障,提出了数据库文件恢复的策略。数据库文件的保护策略主要用用四种方法,即:并发控制,交易检查点,自动卷回恢复,向前卷回恢复。本文对建立后备副本,更新日志,建  相似文献   

10.
一种基于检查点的卷回恢复与进程迁移系统*   总被引:14,自引:2,他引:12  
ChaRM是一种并行程序后向故障恢复与进程迁移系统.它不仅实现了对工作站机群系统瞬时故障的恢复,而且通过检查点设置时的Mirror存储技术和进程迁移技术,实现了对机群系统结点永久故障的恢复,并支持系统软硬件的在线维护、处理机资源的排他/限时使用和动态负载平衡等功能.文章主要介绍ChaRM系统的检查点设置与回卷恢复、进程迁移等实现技术,并给出了部分性能评测结果.  相似文献   

11.
The Time Warp distributed simulation algorithm uses checkpointing to save process states after certain event executions for later recovery at the time of a rollback. Two main techniques have been used for checkpointing: periodic state saving and incremental state saving. The former technique introduces large overheads in reconstructing a desired state by coasting forward from an earlier checkpointed state if the computational granularity is large. The latter technique also has large overheads in applications with large rollback distances. A hybrid checkpointing technique is proposed which uses both periodic and incremental state saving simultaneously in such a way that it reduces checkpointing time overheads. A detailed analytical model is developed for the hybrid technique, and comparisons are made using similar analytical models with periodic and incremental state saving techniques. Results show that when the system parameters are chosen to represent large and complex simulated systems, the hybrid approach has less checkpointing time overhead than the other two techniques  相似文献   

12.
Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different fault tolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint – a snapshot of the whole workflow enactment state at normal execution (without failures) – has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows in which every workflow node in the hierarchy accomplishes its own local checkpoint autonomously and in an uncoordinated way after its enactment. In contrast to other proposals, we utilise the Reference net formalism for expressing the scheme. Reference nets are a particular type of Petri nets which can more effectively provide the abstractions to support and to express hierarchical workflows and their dynamic adaptability.  相似文献   

13.
耿技  陈非  聂鹏  陈伟  秦志光 《计算机应用》2012,32(10):2748-2751
基于检查点的协同式回滚恢复机制是一种针对分布式系统生存性保障的有效机制,现有分布式系统中基于检查点的回滚恢复机制以分布式信道可靠作为假设前提,而实际应用场景中,该假设并不总是成立。针对分布式系统实际的应用环境,提出了适用于信道不可靠的分布式计算环境的协同式系统生存性保障模型。该模型在保留检查点回滚恢复机制优点的基础上,通过建立冗余通信链路和进程迁移来保障不可靠通信信道环境下分布式系统的生存性。  相似文献   

14.
In this paper, a checkpointing protocol based on loose synchronization is proposed. The protocol enables processes to take checkpoints at different frequencies so that each process can control its rollback distance. In traditional asynchronous and quasi-synchronous checkpointing protocols, the checkpoints that are not up-to-date may be used for recovery. As a result, the rollback distance is often difficult to control. In the proposed protocol, the checkpoint cycle of each process is dynamically adjusted using a pessimistic scheme so that strict 1-rollback is achieved; namely, one of the last two checkpoints of each process can be utilized for recovery.  相似文献   

15.
We propose an approach for implementing rollback recovery in a distributed computing system. A concept of logical ring is introduced for the maintenance of information required for consistent recovery from a system crash. Message processing order of a process is kept by all other processes on its logical ring. Transmission of data messages are accompanied by the circulation of the associated order messages on the ring. The sizes of the order messages are small. In addition, redundant transmission of order information is avoided, thereby reducing the communication overhead incurred during failure free operation. Furthermore, updating of the order information and garbage collection task are simplified in the proposed mechanism. Our approach does not require information about message processing order be written to stable storage; in fact, the time consuming operations of saving information in stable storage are confined to the checkpointing activities. When failures occur, a surviving process need roll back only if some preceding order information is totally lost, which is relatively unlikely considering the ever growing speed of communication networks. It is shown that a system can recover correctly as long as there exists at least one surviving process  相似文献   

16.
工作站机群系统已成为分布式并行处理发展的主流方向之一 .随着机群系统应用领域的逐渐拓展和规模的不断扩大 ,人们对其可靠性的要求日益提高 .设计高可靠的群机系统 ,需要着重研究其系统容错技术 .本文叙述了并行异构环境回卷恢复和检查点派生 .实现透明的可移植容错和负载均衡能力 .避免调整检查点就可以构成全局一致性状态 .不仅使 BSP应用程序自治容错能力 ,而且能够在机群 (Clusters)间迁移 ,保持系统负载均衡 .重点介绍检查点设置、检查点派生、卷回、进程迁移技术  相似文献   

17.
异步检查点容错PVM   总被引:1,自引:0,他引:1  
以工作站簇为代表的计算环境是当前分布式系统和并行计算的研究重点之一,PVM所提供的消息传递机制支持了高效的异构网络计算。但标准PVM缺乏对系统容错的支持,这可以通过使用检查点的回滚恢复方式予以弥补。该文对如何在用户级实现PVM全局容错,分析其设计思想和实现技术。主要思想是使用进行消息记录的异步检查点算法,并利用PVM守护进程和全局调度进程进行控制,所有操作对应用程序都是透明的。利用该系统还可以进一步实现PVM的透明进程迁移和负载均衡。  相似文献   

18.
Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.  相似文献   

19.
An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. A common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement. These advantages are assessed quantitatively by developing a probabilistic model  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号