首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
Unix进程检查点设置关键技术   总被引:4,自引:0,他引:4  
Unix进程的检查点设置是实现分布/并行系统容错、重播调试、进程迁移、系统模拟和作业切换等功能的基础。该论文主要论述UNIX进程检查点基本信息的保存与恢复、文件检查点、检查点信息的优化等关键技术,最后介绍Libckpt、Condor以及自行设计的Libcsm等检查点设置工具。  相似文献   

2.
Solaris系统多线程检查点设置与卷回恢复   总被引:1,自引:0,他引:1  
文章利用UNIX进程检查点设置思想,结合多线程在Solaris系统中的实现特点,提出了一种适合于Solaris操作系统的多线程检查点设置与恢复技术,其检查点设置与恢复技术具有在用户级实现、对用户透明和简单高效的特点。文章主要介绍检查点信息的保存与恢复、函数换名、包裹,线程号映射等关键技术。  相似文献   

3.
基于Linux内核的进程检查点系统设计与实现   总被引:1,自引:0,他引:1  
作为一种流行的软件容错机制,检查点与恢复技术的实现模式有两种:用户级和系统级.首先阐述了两者的区别,然后根据Linux可加栽内核模块机制提出了一种基于Linux内核的进程检查点与恢复实现方法.利用Linux内核线程实现了检查点与恢复内核模块,并基于此内核模块在用户层构造了一检查点函数库,为用户提供了相应接口.用户通过组合使用这些接口可以高效地实现具体检查点与恢复算法.  相似文献   

4.
李春江  肖侬  杨学军 《计算机工程》2005,31(10):57-59,102
分析了计算网格环境中实现检查点机制的特殊性,提出了一种新的应用级检查点方法:基于作业进展描述的检查点方法。介绍了这种检查点方法的基本思想,定义了构成作业进展描述的作业进展状态对象和作业进展描述对象,这些对象的方法构成了检查点API;讨论了检查点作业的构建。  相似文献   

5.
大规模异构众核计算机系统具有计算能力强、性能功耗比高等突出优点,已成为超级计算机的发展方向,但其复杂的异构结构和庞大的系统规模,也使系统的可用性面临巨大挑战,因此研究面向大规模异构众核系统的轻量级容错技术具有重要意义。针对传统基于检查点的系统级容错开销过大的问题,在Parallel C语言中设计并实现了故障局部感知的轻量级降级、编译指导与自动分析的检查点等语言支持的容错机制,兼顾了好用性和高效性。局部故障感知的轻量级降级结合动态任务调度框架实现,支持众核系统,可扩展到百万以上并行规模;编译指导与自动分析的检查点通过程序员插入简单的编译指示,由编译器进行分析,提示不需要保留的数据,可有效降低保留恢复的数据量。神威太湖之光超级计算机上的测试数据表明,两种容错措施相对于传统容错方法效果良好,轻量级降级的容错开销小于1%,相对于传统回卷容错方法单次故障执行时间可减少3.5%以上,编译指导与自动分析的检查点在典型应用中最多可将保留量降低至1/10,具有很好的实用性。  相似文献   

6.
设计并实现了一种在VxWorks系统下基于检查点的任务恢复机制。分析了VxWorks系统下检查点文件的内容。采用基于内存预先分配的主动内存管理、基于系统内核对象池的任务间通信和基于检查点的用户层中间件的方法实现了任务恢复机制。通过验证实验表明,原型系统实现了基本的任务恢复功能,有效提高了系统的软件容错能力。  相似文献   

7.
在大规模机群环境下,检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制,在不作全局同步的情况下获取通信系统全局状态的方法,并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销,适用于大规模机群应用。  相似文献   

8.
并行离散事件仿真对复杂大规模动态系统的研究以及探索其长远的应用提供了便利,近年来日益成为研究的热点。然而时间同步管理是影响并行离散事件仿真系统高效运行的重要因素之一。乐观的同步是采用检测和回退机制,允许逻辑进程积极的处理本地事件。一旦出现同步错误则利用回退机制从错误中恢复到较早状态,然后再恢复执行。这一切都是通过基于检查点状态保存重建机制来实现的,因而状态保存及状态重建必然伴随着时间和空间的损耗。该文深入研究了在乐观同步机制下,仿真执行时间和内存空间的损耗与检查点间隔之间的关系,并通过推理计算给出了检查点间隔的最优取值范围。  相似文献   

9.
为了提升内存数据库从各种故障中恢复的速度,提出了基于影子页面技术、混合日志策略以及模糊检查点思想的内存数据库恢复方法。在分析内存数据库运行过程中主要的时间消耗点的基础上建立了内存数据库的系统模型,通过分析事务过程和检查点过程,讨论了该恢复策略的执行过程以及优点,讲述了内存数据库在此系统模型和恢复策略下的事务故障和系统故障的恢复过程以及系统的性能分析。  相似文献   

10.
针对空中交通管制系统(ATC)中对飞行数据集群处理的可靠性要求,提出了一种基于Linux的用户级进程检查点设置与恢复方案.对基于该Linux用户级的进程检查点的飞行数据集群处理的各个主要模块进行了介绍,在此基础上给出了系统设计框架.从进程的初始化数据段、堆、栈和打开的文件的保存与恢复,给出了该方案的详细实现方法.该进程检查点设置与恢复方案不但可以在主机崩溃重启后恢复进程在重启前的运行状态,更重要的是可以在分布式系统通过进程迁移将保存的进程检查点迁移到其它主机运行,从而有效的提高系统的可靠性,减少运算损失.  相似文献   

11.
大型科学与工程计算需要实现空前复杂的数值模拟计算和处理空前庞大的数据,有必要设计一个容错环境,自动调度加载故障程序。基于并行作业和系统提供的checkpoint/restart功能,本文设计了一个用户级的并行作业容错自动调度环境,包括并行程序容错调度的自动感知、自动加载和数据完整性保证算法。测试结果表明,并行作业容错自动调度环境保证了checkpoint数据的完整性,并在应用程序出错退出以后,调度环境可以自动感知,自动提交运行作业,实现了并行作业无需用户干预的容错自动调度计算,避免了系统资源和计算时间的浪费。  相似文献   

12.
对于HPC用户来说,计算成本是迁云所考虑的重要因素之一,阿里云上提供的抢占式实例,是一种按需实例,旨在降低使用公共云计算资源成本,抢占式实例市场价格是波动的,通常远低于正常的按需实例,甚至达到正常按需实例的一折。抢占式实例一般会在创建时为用户保留一段最短时间,过后有可能会被释放,所以一般适用于无状态的应用场景。提出在公共云上的自动伸缩策略,其面向通用的HPC集群调度器,基于用户的应用软件类型、提交作业规律以及用户对性能和成本等多方面需求,自动在云上部署扩容计算资源,控制成本。对用户来说,可以做到"only pay for what you want and what you use"。基于公共云上丰富的资源规格类型和售卖方式,利用自动伸缩服务,抢占式实例,断点续算等技术可以配置低成本的公共云上HPC自动伸缩方案:用户提交作业的同时可以指定成本上限,自动伸缩服务自动在低于此成本的前提下寻找和扩容抢占式计算资源,同时利用断点续算功能保证作业在计算资源切换的时候可以继续运算。最后,通过LAMMPS和GROMACS两个高性能应用实例验证了该策略的可行性和有效性。  相似文献   

13.
In this paper, we present a unified model for several well‐known checkpoint/restart protocols. The proposed model is generic enough to encompass both extremes of the checkpoint/restart space, from coordinated approaches to a variety of uncoordinated checkpoint strategies (with message logging). We identify a set of crucial parameters, instantiate them, and compare the expected efficiency of the fault tolerant protocols, for a given application/platform pair. We then propose a detailed analysis of several scenarios, including some of the most powerful currently available high performance computing platforms, as well as anticipated Exascale designs. The results of this analytical comparison are corroborated by a comprehensive set of simulations. Altogether, they outline comparative behaviors of checkpoint strategies at very large scale, thereby providing insight that is hardly accessible to direct experimentation. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

14.
Checkpoint/Restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. These are techniques with many potential applications, including establishment of a fault-tolerant environment, improving system resource utilization, and true migration of a process. With increasing hardware speed and size of clusters the average time between failures has been reduced. Therefore, fault tolerance and ability to checkpoint a process have become inevitable. Almost all platforms deployed for high-performance computing support process checkpoint/restart. Linux as one of the popular operating systems does not provide a general purpose implementation. Some are limited to specific type of parallel programming library, confined to some unique well-behaved type of applications, or reliant on specific features in kernel which could be missing on many occasions. Most of implementations demand elaborate practice of recompiling a whole kernel to apply required patches. In this paper, we describe the design and implementation of multithreaded process checkpoint/restart system for Linux which provide capability of dynamic extension to increase compatibility and reduce system overhead. It does not impose any requirement on the existence of a special facility in the operating system and can do checkpoint/restart of an application independent of their behavior and fully transparent. The entire system is absolutely implemented in multiple kernel loadable modules, which result in ease of use and eliminate the burden of complex system administration.  相似文献   

15.
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems.  相似文献   

16.
容错系统中文件状态的保存与恢复算法   总被引:1,自引:0,他引:1  
在机群计算环境中实现容错是人们日益关心的热点。许多著名的机群计算环境都使用检查点实现了容错的功能。但目前的检查点算法在使程序卷回执行的同时,不能相应地恢复文件系统的状态,因而对应用程序访问文件系统有较多的限制。本文在原子操作和并发控制的基础上,提出了能够恢复文件系统状态的SCR算法,进一步发展了文件系统可恢复性的概念,SCR算法与检查点机制结合使用,便可支持分布式应用程序在容错运行时对文件系统的任  相似文献   

17.
Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed.  相似文献   

18.

Until now, jobs running on HPC clusters were tied to the node where their execution started. We have removed that limitation by integrating a user-level checkpoint/restart library into a resource manager, fully transparent to both the user and running application. This opens the door to a whole new set of tools and scheduling possibilities based on the fact that jobs can be migrated, checkpointed, and restarted on a different place or in a different moment, while providing fault tolerance for every job running on the cluster. This is of utmost importance in the future generation of exascale HPC clusters, where an increasing degree and complexities of efficient scheduling make it challenging to obtain the required degree of parallelism demanded by the applications.

  相似文献   

19.
Chen  Genlang  Zhang  Jiajian  Zhu  Zufang  Jiang  Qiangqiang  Jiang  Hai  Pang  Chaoyi 《The Journal of supercomputing》2021,77(6):5426-5467
The Journal of Supercomputing - The checkpoint/restart mechanism is critical in a preemptive system because clusters with this mechanism will be improved in terms of fault tolerance, load balance,...  相似文献   

20.
设置检查点是保存和恢复进程运行状态的一种重要技术,是实现容错、卷回调试和进程迁移的重要手段。本文研究了全透明检查点系统Epckpt在系统Ⅴ共享内存方面的实现方法和不足,给出了自己的改进,从而更好地实现了系统Ⅴ共享内存的保存与恢复。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号