共查询到19条相似文献,搜索用时 78 毫秒
1.
一种基于扩展数据流分析的OpenMP程序应用级检查点机制 总被引:1,自引:0,他引:1
随着多核处理器体系结构在高性能计算领域日益广泛的应用,面向共享存储并行程序的容错问题成为研究的热点.近年来,检查点技术已经成为该领域占主导地位的容错机制.目前已有一些针对OpenMP程序检查点技术的研究工作,但其中绝大多数解决方案都依赖于特殊的运行时库或硬件平台.该文提出一种编译辅助的OpenMP应用级检查点,它是一种平台无关的方案,通过面向OpenMP的扩展数据流分析选择那些"必需"的变量保存到检查点映像,从而降低容错的开销,同时通过运行一种非阻塞式的协议维护检查点的全局一致性.文章讨论了该机制的各个关键问题,并通过实验评测以及与同类工作的比较,表明了该文所提出的检查点机制在容错性能方面的优势. 相似文献
2.
3.
4.
5.
减少检查点开销的一种方法 总被引:1,自引:0,他引:1
设置检查点(checkpointing)是容错计算机系统进行故障恢复的重要手段。设置检查点的开销则是影响其性能的一个主要因素。文章提出了一种预先保存部分检查点数据的新方法。该方法不仅能够有效地减少检查点开销,而且具有比较短的检查点延迟。 相似文献
6.
检查点是并行系统中实现容错的重要手段,同步检查点方法已广泛应用在工作站机群系统中。PVM所提供的消息传递机制支持高效的异构网络计算,但不支持客错功能。为了降低同步检查点设置的时间开销,提出了一种基于PVM的准同步检查点设置方法,它吸取了同步检查点方法的优点,又通过消息记录方式实现各节点间独立进行状态保存,大大降低了检查点的同步开销,提高了检查点操作效率,该方法在PVM环境下得以实现,实验结果表明所提出的方法具有较好的客错性能。 相似文献
7.
8.
应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术,该技术由用户程序员选择在适当的地方保存关键数据,从而降低了容错开销。选择合适的checkpointing位置、减小全局checkpoint保存数据量是优化应用级 checkpointing 技术的关键问题。对于近年来推出的带有通用 GPU 的异构系统上的应用级checkpointing 技术,也同样面临上述问题。针对异构系统体系结构和程序特征,对面向异构系统的应用级checkpointing 技术的检查点设置进行了静态分析,提出两套不同机制的检查点设置方法:同步及异步检查点设置方法,并分别就checkpointing优化设置问题对其进行数学建模和求解。最后,通过实验验证并评估了所提出的两种方法的性能。 相似文献
9.
10.
检查点机制作为一种软件容错机制,可以很好地满足机群系统的容错要求,本文详细分析了各类检查点卷回恢复协议,并比较它们的性能和特点。 相似文献
11.
Rafael Tolosana-Calasanz José Ángel Bañares Pedro Álvarez Joaquín Ezpeleta Omer Rana 《Journal of Computer and System Sciences》2010,76(6):403-415
Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different fault tolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint – a snapshot of the whole workflow enactment state at normal execution (without failures) – has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows in which every workflow node in the hierarchy accomplishes its own local checkpoint autonomously and in an uncoordinated way after its enactment. In contrast to other proposals, we utilise the Reference net formalism for expressing the scheme. Reference nets are a particular type of Petri nets which can more effectively provide the abstractions to support and to express hierarchical workflows and their dynamic adaptability. 相似文献
12.
As fault tolerance is the ability of a system to perform its function correctly even in the presence of faults. Therefore, different fault tolerance techniques (FTTs) are critical for improving the efficient utilization of expensive resources in high performance grid computing systems, and an important component of grid workflow management system.This paper presents a performance evaluation of most commonly used FTTs in grid computing system. In this study, we considered different system centric parameters, such as throughput, turnaround time, waiting time and network delay for the evaluation of these FTTs. For comprehensive evaluation we setup various conditions in which we vary the average percentage of faults in a system, along with different workloads in order to find out the behavior of FTTs under these conditions. The empirical evaluation shows that the workflow level alternative task techniques have performance priority on task level checkpointing techniques. This comparative study will help to grid computing researchers in order to understand the behavior and performance of different FTTs in detail. 相似文献
13.
与超级计算机的快速的开发,规模和复杂性曾经正在增加,并且可靠性和跳回面临更大的挑战。在容错有许多重要技术,例如基于差错预言的积极失败回避技术,反应容错基于检查点,和安排技术到改进可靠性。系统差错的特征上的质、量的描述为这些技术是很批评的。这研究在超级计算机把 Sunway BlueLight 称为的二典型 petascale 上分析失败的来源(基于多核心中央处理器) 并且 Sunway TaihuLight (基于异构的 manycore 中央处理器) 。它揭开一些有趣的差错特征并且在主要部件差错之中发现未知关联关系。最后,纸在资源和不同时间跨度的各种各样的谷物分析二台超级计算机的失败时间,并且为 petascale 超级计算机造一个一致多维的失败时间模型。 相似文献
14.
针对现有拜占庭容错中的恢复算法不适用于主动复制品的这一问题,提出支持有状态复制品的前摄恢复算法。每个复制品维护一个恢复队列。当到达一个检查点后,使用该前摄恢复算法复制品检查恢复队列,在服务复制品发生错误前,提前将复制品恢复成正确的状态。如果复制品已经出错,该算法也适用。实验分析结果显示算法的有效性。 相似文献
15.
16.
Supada Laosooksathit Raja Nassar Chokchai Leangsuksun Mihaela Paun 《The Journal of supercomputing》2014,68(3):1630-1651
Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed. 相似文献
17.
The concept of object can be employed to achieve tolerance to hardware faults in distributed systems.Replication by introducing several copies for each object allows a continuous service even in case of failure.In particular,the paper describes an object model,PROM,which exploits replication by defining several passive back-up copies for any object.The system automatically reovers any failure of a copy in execution by activating a spare copy and restarting it from a previous checkpoint.The aim of the paper is the analysis of the effective support for PROM.This support is organized in structured levels on a distributed architecture.The services that the support should include to guarantee the desired replication model are described. 相似文献
18.
Checkpoint Management with Double Modular Redundancy Based on the Probability of Task Completion 下载免费PDF全文
This paper proposes a checkpoint rollback strategy for real-time systems with double modular redundancy.Without built-in fault-detection and spare processors,our scheme is able to recover from both transient and permanent faults.Two comparisons are conducted at each checkpoint.First,the states stored in two consecutive checkpoints of one processor are compared for checking integrity of the processor.The states of two processors are also compared for detecting faults and the system rolls back to the previous checkpoint whenever required by logic of the proposed scheme.A Markov model is induced by the fault recovery scheme and analyzed to provide the probability of task completion within its deadline.The optimal number of checkpoints is selected so as to maximize the probability of task completion. 相似文献