期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

一种基于扩展数据流分析的OpenMP程序应用级检查点机制 总被引：1，自引：0，他引：1

富弘毅丁滟宋伟杨学军《计算机学报》2010,33(10)

随着多核处理器体系结构在高性能计算领域日益广泛的应用,面向共享存储并行程序的容错问题成为研究的热点.近年来,检查点技术已经成为该领域占主导地位的容错机制.目前已有一些针对OpenMP程序检查点技术的研究工作,但其中绝大多数解决方案都依赖于特殊的运行时库或硬件平台.该文提出一种编译辅助的OpenMP应用级检查点,它是一种平台无关的方案,通过面向OpenMP的扩展数据流分析选择那些"必需"的变量保存到检查点映像,从而降低容错的开销,同时通过运行一种非阻塞式的协议维护检查点的全局一致性.文章讨论了该机制的各个关键问题,并通过实验评测以及与同类工作的比较,表明了该文所提出的检查点机制在容错性能方面的优势. 相似文献

2.

一种基于异常处理的并发程序容错机制

下载免费PDF全文

牛如美陈雨亭《计算机工程》2012,38(13):44-47

当前并发程序容错机制处理方式单一、效率较低。为此,提出一种适用于多种并发程序错误处理的容错机制。通过在编译及运行过程中对程序进行异常处理,并在异常发生时根据设置的检查点对程序进行回滚和防错误处理,以实现并发程序容错。实验结果表明,该容错机制可有效检测并发程序中的错误,在不增加程序总体运行时间的情况下达到比较理想的容错效果。相似文献

3.

一种低开销非阻塞的协同式检查点算法

下载免费PDF全文

万国伟卢宇彤谢旻沈志宇《计算机工程》2007,33(24):66-68

协同式检查点设置及卷回恢复技术是一种简单有效的容错手段，被广泛地运用于并行/分布式系统中。为进一步降低协同式检查点算法的开销，该文给出了一个基于可重建检查点的非阻塞协同式检查点算法。并行程序出错导致卷回恢复发生的概率远小于检查点设置概率，该算法利用这一特性，将检查点设置的部分开销转至卷回恢复阶段，降低了容错的开销，提高了系统的可扩展性。相似文献

4.

网格环境中检查点技术的研究与实现

下载免费PDF全文

梁鸿曾科宏《计算机系统应用》2007,16(4):46-49

检查点机制作为一种软件容错机制，将其与网格环境相结合，提高网格计算的服务质量。更好地满足网格系统的要求。本文研究了如何面向网格应用实现检查点设置，使网格环境能够在某个计算结点发生故障后。将相关进程恢复到故障前的检查状态，从该检查点处继续执行，避免重新执行整个任务，节省了大量重复计算时间，实现了容错服务。相似文献

5.

减少检查点开销的一种方法 总被引：1，自引：0，他引：1

李凯原杨孝宗《计算机工程与应用》2000,36(2):4-5,14

设置检查点（ｃｈｅｃｋｐｏｉｎｔｉｎｇ）是容错计算机系统进行故障恢复的重要手段。设置检查点的开销则是影响其性能的一个主要因素。文章提出了一种预先保存部分检查点数据的新方法。该方法不仅能够有效地减少检查点开销,而且具有比较短的检查点延迟。相似文献

6.

基于PVM的准同步检查点设置方法

张宇张玉芳《计算机工程与设计》2006,27(3):494-496

检查点是并行系统中实现容错的重要手段，同步检查点方法已广泛应用在工作站机群系统中。PVM所提供的消息传递机制支持高效的异构网络计算，但不支持客错功能。为了降低同步检查点设置的时间开销，提出了一种基于PVM的准同步检查点设置方法，它吸取了同步检查点方法的优点，又通过消息记录方式实现各节点间独立进行状态保存，大大降低了检查点的同步开销，提高了检查点操作效率，该方法在PVM环境下得以实现，实验结果表明所提出的方法具有较好的客错性能。相似文献

7.

双机容错系统中最佳检查点间隔的分析 总被引：2，自引：0，他引：2

下载免费PDF全文

鄢喜爱杨金民田华《计算机工程》2007,33(5):283-285

设置检查点是容错计算机系统进行故障恢复的重要手段。因为检查点间隔选择过大或过小都将使系统性能受到影响，所以检查点间隔的适当选定是系统性能优化的一个重要指标。该文针对双机容错系统，采用检查点设置与回卷恢复的方法提出了一种系统模型，利用马尔科夫链得到了最佳检查点间隔的求解等式，通过实验证实了求解等式的正确性。相似文献

8.

静态分析面向异构系统的应用级Checkpoint 设置问题

贾佳杨学军马亚青《软件学报》2013,24(6):1361-1375

应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术,该技术由用户程序员选择在适当的地方保存关键数据,从而降低了容错开销。选择合适的checkpointing位置、减小全局checkpoint保存数据量是优化应用级 checkpointing 技术的关键问题。对于近年来推出的带有通用 GPU 的异构系统上的应用级checkpointing 技术,也同样面临上述问题。针对异构系统体系结构和程序特征,对面向异构系统的应用级checkpointing 技术的检查点设置进行了静态分析,提出两套不同机制的检查点设置方法：同步及异步检查点设置方法,并分别就checkpointing优化设置问题对其进行数学建模和求解。最后,通过实验验证并评估了所提出的两种方法的性能。相似文献

9.

一种工作流自适应容错模型

林星沈奇威王纯《计算机系统应用》2012,21(4):111-114,104

设计了一种在工作流子系统中,可根据工作流的异常类型自动选择容错策略的自适应容错模型,针对不可恢复异常采用了事务补偿机制的容错策略进行异常处理,而针对可恢复异常采用了自动恢复的容错策略进行异常处理。详细描述了该模型所采用的消息队列、事务补偿机制、自动恢复机制。相似文献

10.

机群系统中检查点卷回恢复协议分析 总被引：2，自引：0，他引：2

下载免费PDF全文

张怡胡建平《计算机工程与科学》2001,23(5):66-69

检查点机制作为一种软件容错机制,可以很好地满足机群系统的容错要求,本文详细分析了各类检查点卷回恢复协议,并比较它们的性能和特点。相似文献

11.

An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows

Rafael Tolosana-Calasanz José Ángel Bañares Pedro Álvarez Joaquín Ezpeleta Omer Rana 《Journal of Computer and System Sciences》2010,76(6):403-415

Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different fault tolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint – a snapshot of the whole workflow enactment state at normal execution (without failures) – has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows in which every workflow node in the hierarchy accomplishes its own local checkpoint autonomously and in an uncoordinated way after its enactment. In contrast to other proposals, we utilise the Reference net formalism for expressing the scheme. Reference nets are a particular type of Petri nets which can more effectively provide the abstractions to support and to express hierarchical workflows and their dynamic adaptability. 相似文献

12.

Performance evaluation of fault tolerance techniques in grid computing system

Fiaz Gul Khan Kalim Qureshi Babar Nazir 《Computers & Electrical Engineering》2010,36(6):1110-1122

As fault tolerance is the ability of a system to perform its function correctly even in the presence of faults. Therefore, different fault tolerance techniques (FTTs) are critical for improving the efficient utilization of expensive resources in high performance grid computing systems, and an important component of grid workflow management system.This paper presents a performance evaluation of most commonly used FTTs in grid computing system. In this study, we considered different system centric parameters, such as throughput, turnaround time, waiting time and network delay for the evaluation of these FTTs. For comprehensive evaluation we setup various conditions in which we vary the average percentage of faults in a system, along with different workloads in order to find out the behavior of FTTs under these conditions. The empirical evaluation shows that the workflow level alternative task techniques have performance priority on task level checkpointing techniques. This comparative study will help to grid computing researchers in order to understand the behavior and performance of different FTTs in detail. 相似文献

13.

A Large-Scale Study of Failures on Petascale Supercomputers

下载免费PDF全文

Rui-Tao Liu Zuo-Ning Chen 《计算机科学技术学报》2018,33(1):24-41

与超级计算机的快速的开发,规模和复杂性曾经正在增加,并且可靠性和跳回面临更大的挑战。在容错有许多重要技术,例如基于差错预言的积极失败回避技术,反应容错基于检查点,和安排技术到改进可靠性。系统差错的特征上的质、量的描述为这些技术是很批评的。这研究在超级计算机把 Sunway BlueLight 称为的二典型 petascale 上分析失败的来源(基于多核心中央处理器) 并且 Sunway TaihuLight (基于异构的 manycore 中央处理器) 。它揭开一些有趣的差错特征并且在主要部件差错之中发现未知关联关系。最后,纸在资源和不同时间跨度的各种各样的谷物分析二台超级计算机的失败时间,并且为 petascale 超级计算机造一个一致多维的失败时间模型。相似文献

14.

基于拜占庭容错的前摄恢复算法

陈柳周伟《计算机与现代化》2013,(12):38-40

针对现有拜占庭容错中的恢复算法不适用于主动复制品的这一问题,提出支持有状态复制品的前摄恢复算法。每个复制品维护一个恢复队列。当到达一个检查点后,使用该前摄恢复算法复制品检查恢复队列,在服务复制品发生错误前,提前将复制品恢复成正确的状态。如果复制品已经出错,该算法也适用。实验分析结果显示算法的有效性。相似文献

15.

支持文件迁移的Linux检查点机制的实现 总被引：2，自引：2，他引：0

下载免费PDF全文

杨晖陈闳中《计算机工程》2010,36(3):266-268

在BLCR系统的基础上实现一种支持进程打开文件迁移的检查点机制,给出该机制的总体框架、关键技术、进程打开文件保存恢复、状态保存和恢复的流程。实验结果表明,该机制支持多线程、信号、进程打开文件及管道等的保存与恢复,无需重编译内核,对用户具有良好的透明性。相似文献

16.

Reliability-aware performance model for optimal GPU-enabled cluster environment

Supada Laosooksathit Raja Nassar Chokchai Leangsuksun Mihaela Paun 《The Journal of supercomputing》2014,68(3):1630-1651

Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed. 相似文献

17.

PROM:A Support for Robust Replication in a Distributed Object Environment

下载免费PDF全文

A.Corradi L.Leonardi 《计算机科学技术学报》1990,5(2):139-155

The concept of object can be employed to achieve tolerance to hardware faults in distributed systems.Replication by introducing several copies for each object allows a continuous service even in case of failure.In particular,the paper describes an object model,PROM,which exploits replication by defining several passive back-up copies for any object.The system automatically reovers any failure of a copy in execution by activating a spare copy and restarting it from a previous checkpoint.The aim of the paper is the analysis of the effective support for PROM.This support is organized in structured levels on a distributed architecture.The services that the support should include to guarantee the desired replication model are described. 相似文献

18.

Checkpoint Management with Double Modular Redundancy Based on the Probability of Task Completion

下载免费PDF全文

Seong Woo Kwak Kwan-Ho You Jung-Min Yang 《计算机科学技术学报》2012,27(2):273-280

This paper proposes a checkpoint rollback strategy for real-time systems with double modular redundancy.Without built-in fault-detection and spare processors,our scheme is able to recover from both transient and permanent faults.Two comparisons are conducted at each checkpoint.First,the states stored in two consecutive checkpoints of one processor are compared for checking integrity of the processor.The states of two processors are also compared for detecting faults and the system rolls back to the previous checkpoint whenever required by logic of the proposed scheme.A Markov model is induced by the fault recovery scheme and analyzed to provide the probability of task completion within its deadline.The optimal number of checkpoints is selected so as to maximize the probability of task completion. 相似文献

19.

基于消息队列的工作流引擎及其容错设计 总被引：1，自引：0，他引：1

下载免费PDF全文

汤丹胡志刚匡晓红《计算机工程》2008,34(19):49-52

在流程定义工具、Web中间件和工作流引擎组成的分布式工作流平台上,以提高工作流引擎的可信性为目的,围绕软件容错设计、硬件和网络平台的可靠性、可靠消息传输模式,结合对实际生产的分析,提出一种新的可信构件设计方案。运行结果表明,该方案取得了较好的应用效果。相似文献