首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
    
Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC systems, providing efficient fault‐tolerance mechanisms for this class of applications is paramount. The global‐restart model has been proposed to decrease the time of failure recovery in Bulk‐Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present EReinit , an implementation of the global‐restart model that addresses these problems. Our key idea and optimization is the co‐design of basic fault‐tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.  相似文献   

2.
    
The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many‐core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm‐based fault tolerance (ABFT) using adaptive mesh refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables, such as positivity or boundedness, may be violated during interpolation. These challenges can be addressed by the combination of two techniques: (1) a fault‐tolerant message passing interface (MPI) implementation to recover from runtime node failures, and (2) high‐order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a “limited essentially nonoscillatory” (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault‐tolerant MPI‐user‐level failure mitigation to recover from runtime failure and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10× faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.  相似文献   

3.
A universal quantum computer can be constructed using abelian anyons. Two qubit quantum logic gates such as controlled-NOT operations are performed using topological effects. Single-anyon operations such as hopping from site to site on a lattice suffice to perform all quantum logic operations. Anyonic quantum computation might be realized using quasiparticles of the fractional quantum Hall effect. PACS: 03.65-Lx  相似文献   

4.
杨娜  刘靖 《软件学报》2019,30(4):1191-1202
通过提供高效且持续可用的容错服务以保障云应用系统的可靠运行是至关重要的.采用容错即服务的模式,提出了一种优化的云容错服务动态提供方法,从云应用组件的可靠性及响应时间等方面描述云应用容错需求,以常用的复制、检查点和NVP(N-version programming)等容错技术为基础,充分考虑容错服务动态切换开销,分别针对支撑容错服务的底层云资源是否足够的场景,给出可用容错即服务提供方案的最优化求解方法.实验结果表明,所提方法降低了云应用系统支付的容错服务费用及支撑容错服务的底层云资源的开销,提高了容错服务提供商为多个云应用实施高效、可靠容错即服务的能力.  相似文献   

5.
面向瞬态故障的软件容错技术   总被引:1,自引:0,他引:1  
宇宙射线辐射所导致的瞬态故障一直是航天计算面临的最主要挑战之一.而随着集成电路制造工艺的持续进步,现代处理器的性能在大幅度提高的同时,其可信性也正日益面临着瞬态故障的严重威胁.当前针对瞬态故障的容错技术可大致分为两类:基于硬件实现和基于软件实现.相比较前者,后者由于在实现成本和灵活性等方面的优势而备受关注.本文首先概述...  相似文献   

6.
杨娜  刘靖 《计算机科学》2017,44(7):61-67, 97
云计算环境下,失效成为一种常态行为,可靠性保障能力不足不仅成为云计算应用推广的主要障碍,而且还促使云计算环境下的容错服务研究成为一个亟待解决的问题。针对目前云计算容错服务研究中存在的用户容错需求定义无法直接反映用户关心的可靠性问题,以及云容错服务供应商资源得不到灵活利用等问题,提出了一种融合容错需求和资源约束的云容错服务适配方法。从用户的角度,以组件为单位,基于可靠性对用户的容错需求进行定义。从云容错服务供应商的角度,分别在其资源充足和资源不足的情况下研究最佳的容错服务适配方法,并使用最优化理论求解该适配方法下的容错服务。实验结果表明,所提出的容错服务适配方法生成的容错服务能更好地满足用户和云容错服务供应商的需求。  相似文献   

7.
E级计算机系统规模巨大,使得故障异常总量随之增多,导致诊断发现的难度增加,因此,迫切需要一套更加准确高效的实时维护故障诊断系统,对硬件系统进行全面的异常及故障信息实时检测、故障诊断及故障预测。传统故障诊断系统在面对数万节点规模的诊断时存在执行效率低、异常检测误报率高的问题,异常检测及故障诊断的覆盖率不足。对异常及故障检测、故障诊断与故障预测相关技术进行研究,分析技术原理及适用性,并结合E级高性能计算机实际工程需求,设计一套满足数E级高性能计算机需求的维护故障诊断系统。基于维护系统的结构组成设计可扩展的边缘诊断架构,将高性能计算机系统知识、专家知识与数理统计、机器学习相融合给出故障检测、诊断及预测算法,并针对专用场景建立预测模型。实验结果表明,该系统具有较好的可扩展性,能在10 s内完成对十万个节点规模系统的故障诊断,与传统故障诊断系统相比,异常检测某特定指标误报率从3.3%降低到几乎为0,硬件故障检测覆盖率从90.2%提升至96%以上,硬件故障诊断覆盖率从71%提升至约94%,能较准确地预测多个重要应用场景下的故障。  相似文献   

8.
  总被引:10,自引:0,他引:10       下载免费PDF全文
Grid computing emerges as effective technologies to couple geographically dis-tributed resources and solve large-scale computational problems in wide area networks. The fault tolerance is a significant and complex issue in grid computing systems. Various techniques have been investigated to detect and correct faults in distributed computing systems. Unreliable fault detection is one of the most effective techniques. Globus as a grid middleware manages resources in a wide area network. The Globns fault detection service uses the well-known techniques basedon unreliable fault detectors to detect and report component failures. However, more powerful techniques are required to detect and correct both system-level and application-level faults in agrid system, and a convenient toolkit is also needed to maintain the consistency in the grid. Afault-tolerant grid platform (FTGP) based on an unreliable fault detector and the Globus faultdetection service is presented in this paper. The platform offers effective strategies in such threeaspects as grid key components, user tasks, and high-level applications.  相似文献   

9.
    
This paper describes an approach to providing software fault tolerance for future deep‐space robotic National Aeronautics and Space Administration missions, which will require a high degree of autonomy supported by an enhanced on‐board computational capability. We focus on introspection‐based adaptive fault tolerance guided by the specific requirements of applications. Introspection supports monitoring of the program execution with the goal of identifying, locating, and analyzing errors. Fault tolerance assertions for the introspection system can be provided by the user, domain‐specific knowledge, or via the results of static or dynamic program analysis. This work is part of an on‐going project at the Jet Propulsion Laboratory in Pasadena, California. Copyright © 2011 John Wiley & Sons, Ltd.  相似文献   

10.
在COTS微处理器上实现面向硬件故障的软件容错技术,与硬件容错技术相比,其性能、成本、功耗和灵活性上都拥有巨大的优势。其中容错编译技术通过在编译的时候自动地插入指令实现容错,实现简单、高效,不需要重写源代码,减轻了程序员的负担,有利于利用已有的大量程序,是软件容错研究中较为活跃的分支。本文以GNU开源编译器GCC为平台,结合现有容错编译算法,讨论一款初步具有容错编译能力的编译器的设计与实现。  相似文献   

11.
分布式计算技术提供了充分利用现有网络资源的有效途径。该文论述了基于解决生物计算中难解问题的具有开放接口的分布式并行计算系统的设计与实现技术。系统兼有开放式、异构性、容错性与易用性等特点。讨论了系统的容错性机制、检查点策略及任务调度算法。对Motif Finding问题的求解验证表明,分布式并行计算机制能大大缩短问题的求解时间,为计算领域的难解问题提供有效的解决途径。  相似文献   

12.
云计算在简化用户访问资源方式的同时导致了支撑系统开发部署的复杂,软件错误、部署管理失误导致的拜占庭故障已经成为影响系统可靠性的重要原因.对于在大部分运行周期都满足良性故障模型的系统,拜占庭容错协议在通信复杂度、安全等方面的开销以及其在攻击场景下性能鲁棒性方面的缺陷都限制了其在实际系统中的使用.如何满足实际系统对多种故障模型的需求,已经成为系统设计的一个重要问题.针对这一现状,设计了Nova-BFT,一种有效支持多种故障模型的副本状态机协议,通过牺牲部分峰值吞吐率的方式满足拜占庭容错协议对性能鲁棒性的要求,采用配置参数方式自适应满足良性故障的性能需求.实验表明,Nova-BFT在拜占庭故障模型下吞吐率为4~5 kop/s,同时其对良性故障模型的支持可以有效满足大多数实际应用的需求.  相似文献   

13.
高性能计算集群作为保证国家科研开展的“基础设施”已上升为国家战略。高性能计算应用广泛,特别在材料科学研究方面,是必不可少的工具。当下为科学计算用户提供高质量的远程化、可视化、图形化的高性能计算平台成为当前高性能服务研究的突破口。本文提出基于材料科学研究的新型高性能计算平台系统,该系统基于Java语言开发,采用B/S架构为用户提供服务,实现主流材料科学研究的计算软件与平台的整合,设计友好的用户操作界面,提供方便的接入方式,并结合OpenPBS优化作业调度方法为平台用户提供优先级更高的计算需求,保证材料科研应用中更高效的计算资源。  相似文献   

14.
A Flexible Framework for Fault Tolerance in the Grid   总被引:2,自引:0,他引:2  
This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major challenge in providing such a generic failure detection service on the Grid is to detect those failures without requiring any modification to both the Grid protocol and the local policy of each Grid node. This paper describes how to overcome the challenge by using a notification mechanism which is based on the interpretation of notification messages being delivered from the underlying Grid resources. The Grid-WFS built on top of FDS allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. Central to the framework is flexibility in handling failures. This paper describes how to achieve the flexibility by the use of workflow structure as a high-level recovery policy specification, which enables support for multiple failure recovery techniques, the separation of failure handling strategies from the application code, and user-defined exception handlings. Finally, this paper presents an experimental evaluation of the Grid-WFS using a simulation, demonstrating the value of supporting multiple failure recovery techniques in Grid applications to achieve high performance in the presence of failures.  相似文献   

15.
组件应用服务器框架是一种特定形式的分布式对象系统平台,要求成为高可靠性的系统.这里指的可靠性主要是指错误容忍和错误恢复两个特性.本文的主要目标是建立基于分布式对象的组件应用服务器的软件容错服务框架.我们采用一种名叫对象容错服务(OFS)的办法解决对象容错,我们解决的问题包括:对象失效、节点错误、网络隔离和不可预知的通信延迟等.本文介绍了OFS的服务规范,并给出了一个OFS实现的系统结构.  相似文献   

16.
大规模并行计算机系统硬件故障容错技术综述   总被引:2,自引:0,他引:2  
计算机系统的容错是一个不容忽视的问题。近年来,随着系统结构的复杂性增加,半导体制造工艺的发展,线宽的降低以及集成度的提高,从用户桌面系统到分布式计算环境,乃至大规模并行计算机系统,功耗和可靠性问题都很突出。本文首先介绍了计算机系统可靠性和容错技术的基本概念、基本方法和基本思想,然后回顾了近些年来一些具有代表性的硬件故障检测技术和硬件故障恢复技术,其中重点介绍了针对大规模并行计算机系统提出的容错方法。本文还介绍了我们在先前的研究工作中提出的一种优化的故障恢复技术,称为容错并行算法。最后,总结了一些可能的研究方向。  相似文献   

17.
    
Given the wide variety of cloud computing resources for creating high-performance computer clusters and their complex performance relationship with applications, finding the optimal, or near-optimal, cluster is a complex problem. As a result, several approaches have been proposed to find the optimal, or near-optimal, cluster for a given high-performance computing workload, while reducing the search cost. Among the approaches found in the literature, Bayesian optimization is one of the most known and applied. However, it is still possible to increase its performance by integrating it with historical data related to workload behavior. In this context, we suggest the PB 3 Opt $$ {mathrm{PB}}^3mathrm{Opt} $$ approach, which introduces a bias in the Bayesian optimization expected improvement acquisition function. The new acquisition function uses the ranking of computer clusters of previously explored workloads that have the same behavior as the workload being optimized. Our experimental results show that PB 3 Opt $$ {mathrm{PB}}^3mathrm{Opt} $$ classifies the behavior of workloads in groups so that the average-ranking has 88.7% similarity with the ranking of the workload. With this, PB 3 Opt $$ {mathrm{PB}}^3mathrm{Opt} $$ finds, for almost 95% of workloads, a solution that is less than or equal to 1.2 × $$ times $$ worse than the optimal computer cluster. In addition, the PB 3 Opt $$ {mathrm{PB}}^3mathrm{Opt} $$ works well when combined with the paramount iterations technique and is capable of reducing the search cost significantly.  相似文献   

18.
One of the major goals in the design of parallel processing machines and algorithms is to achieve robustness and reduce the effects of the overhead introduced when a given problem is parallelized or a fault occurs. A key contributor to overhead is communication time, in particular when a node is faulty and another node is substuiting for its operation. Many architectures try to reduce this overhead by minimizing the actual time for a communication to occur, including latency and bandwidth figures. Another approach is to hide communication by overlapping it with computation assuming that the computation is the most prominent factor. This paper presents the mechanisms provided in the Proteus parallel computer and its effective use of communication hiding through overlapping communication/computation techniques with and without the presence of a fault. These techniques are easily extended for use in compiler support of parallel programming. We also address the complexity (or rather simplicity) in achieving complete exchange on the Proteus Machine.  相似文献   

19.
云计算是一种以互联网为中心、以公开的标准和服务为基础,提供安全可靠的计算服务的计算机技术。该文将基于这个特殊的云环境,着力探讨在该环境下的数据汇总方式,并基于数据汇总过程中可能出现的意外状况,分析在数据容错方面云计算环境所采取的方式。  相似文献   

20.
为了解决分布式计算系统回卷恢复容错的验证评估问题,设计一种分布式计算系统的回卷恢复容错算法的仿真机制,依据分布式计算系统回卷恢复容错的总体架构,将分布式计算系统中的节点任务过程使用离散事件模拟,在网络系统仿真工具的应用层增加支持多任务回卷恢复容错仿真的模块,并设计用于回卷恢复容错仿真的结构、功能模块和系统参数设定。结果表明本文提出的仿真机制能够实现分布式计算系统的回卷恢复容错算法的模拟验证,为不同容错算法间对比、改进与优化提供参照。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号