排序方式: 共有82条查询结果,搜索用时 15 毫秒
1.
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems. 相似文献
2.
For distributed databases, checkpointing is used to ensure an efficient way to perform global reconstruction. However, the need for global reconstruction is infrequent. Most current checkpointing approaches for distributed databases are too expensive during run time. Some of them allow the checkpointing process to run in parallel with normal transactions at the cost of more data and resource contention, which in turn causes longer response time for normal transactions. Thus, an efficient way to checkpoint distributed databases is needed to avoid degrading the system performance. This paper presents a low-cost solution, called Loosely Synchronized Local Fuzzy Checkpointing (LSLFC), to these problems. LSLFC supports global reconstruction, and our performance study shows that LSLFC has little overhead during run time. 相似文献
3.
Linux中检查点(Checkpoint)的核心支持——ckpt文件系统的设计 总被引:1,自引:0,他引:1
检查点(Checkpoint)是一种软件容错机制,它的目的是提高系统可靠性、减少运算损失,同时检查点机制也是并行系统中进程迁移和负载平衡的基础。在一些检查点系统中,由于对进程的状态检查/状态恢复只具有用户级支持,所以有许多局限性,比如不能完成进程外部状态检查。而在作者的设计与实现中由于具有了核心级的支持,所以能够充分地克服这些局限性。 相似文献
4.
SFT:一个具有较短冻结时间的一致检查点算法 总被引:1,自引:0,他引:1
介绍了一个基于消息记录的一致检查点算法-SFT算法,SFT算法能够实现分布式系统的容错,该算法具有无多米诺效应,冻结时间短,开销小和重启动算法简单的优点,SFT的IPC机制基于PVM,能够保证消息的有序到达,并且其消息的发送和接收操作都是原子操作,另外,IPC机制中进程的id值编码与所在机器无关,这样一个过程即使从故障机器迁移到其它机器上运行仍可与其它进程继续通信,为提高检查点操作的并行性,SFT 相似文献
5.
Jichiang Tsai 《中国工程学刊》2013,36(6):1113-1118
Abstract Communication‐induced checkpointing (CIC) protocols can be used to prevent the domino effect. Such protocols that belong to the index‐based category have been shown to perform more efficiently. In this paper, some results of comparing index‐based CIC protocols are proposed. First, we prove that comparing several protocols based on the lazy indexing strategy can be simply based on their checkpoint‐inducing conditions. Next, we show that improved indexing strategies may not always yield a better performance than the classical strategy. Finally, we present a simulation study to verify our foregoing theoretical results. The simulation is conducted in the typical point‐to‐point computational environment. Influences of enhancements on indexing strategies and checkpoint‐inducing conditions for index‐based CIC protocols are discussed. 相似文献
6.
实时内存数据库分区模糊检验点策略 总被引:5,自引:0,他引:5
检验点技术是实时内存数据库恢复的关键技术之一.在分析实时内存数据库数据特征基础上,给出了综合考虑数据和事务定时约束的数据检验点优先级计算方法.然后,结合内存数据库段式存储结构,讨论了一种基于数据段检验点优先级的分区模糊检验点策略PFCS-SCP. 通过性能测试,表明所提出的检验点策略能减低超截止期事务比率. 相似文献
7.
8.
This paper assesses the use of Chandy and Lamport's distributed snapshots algorithm (DSA) for stabilizing a communication protocol, a special type of distributed system. We show that when a loss of coordination occurs during the distributed execution of the protocol, DSA is not guaranteed to terminate, and therefore it sometimes fails to obtain a global state or snapshot. We propose some modifications to DSA to solve this problem. Finally, we discuss how, in the case of a loss of coordination, the modified algorithm can be used to stabilize a communication protocol, and we assess the suitability of the global state obtained by DSA as a recovery point to be used later in a backward recovery procedure. 相似文献
9.
基于检测点设置依赖图和属性表的卷回恢复算法 总被引:2,自引:0,他引:2
为了解决检测点设置过程中的Domino效应问题及卷回恢复过程中的活锁问题,并最大限度地减小时间开销,提出了基于检测点设置依赖图和属性表的卷回恢复算法。同以前的算法相比较,该算法一方面节省了用于进程之间同步的时间开销,另一方面检测点设置及卷回过程中涉及少量的相关进程。对该算法的正确性进行了证明。 相似文献
10.
Marcelo Pereira da Silva Rafael Rodrigues Obelheiro 《International Journal of Parallel, Emergent and Distributed Systems》2017,32(4):348-367
With the ever increasing dependence on computers and networks, many systems are required to be continuously available in order to fulfil their mission. Virtualization technology enables high availability to be offered in a convenient, cost-effective manner: with the encapsulation provided by virtual machines (VMs), entire systems can be replicated transparently in software, obviating the need for expensive fault-tolerant hardware. Remus is a VM replication mechanism for the Xen hypervisor that provides high availability despite crash failures. Replication is performed by checkpointing the VM at fixed intervals. However, there is an antagonism between processing and communication regarding the optimal checkpoint interval: while longer intervals benefit processor-intensive applications, shorter intervals favour network-intensive applications. Thus, any chosen interval may not always be suitable for the hosted applications, limiting Remus usage in many scenarios. This work introduces Adaptive Remus, a proposal for adaptive checkpointing in Remus that dynamically adjusts the replication frequency according to the characteristics of running applications. Experimental results indicate that our proposal improves performance for applications that require both processing and communication, without harming applications that use only one type of resource. 相似文献