共查询到20条相似文献,搜索用时 15 毫秒
1.
贾佳 《计算机工程与科学》2011,33(11)
应用级checkpointing技术是同构系统上最为常用和成熟的容错技术,但在异构系统下的应用还处于起步阶段,还没有一套严谨合理的针对异构系统架构和故障模型特点的实现方案和配置方法。针对这一现况,本文基于CUDA异构系统的体系结构和编程模型,对CUDA程序在CPU和GPU上的执行模式进行分析,提出了一种面向异构系统应用级checkpointing技术的异步执行机制,并基于这一机制对异构系统的检查点优化设置问题进行讨论,设计了一套优化方案。最后在CUDA平台下通过三个实例验证了这一技术的可行性和实用性,并进行了性能评估。结果表明,这种面向CPU-GPU的异构系统的应用级checkpointing异步执行机制是行之有效的,相比CPU-GPU同步执行的checkpointing机制在设置上更为灵活,优化空间更大。而本文基于这一机制所提出的检查点优化设置方法也有效地减少了check-pointing的开销,从而获得了更高的容错性能。 相似文献
2.
The ftIO-system provides portable and fault-tolerant file-I/O by enhancing the functionality of the ANSI C file system without changing its application programmer interface and without depending on system-specific implementations of the standard file operations. The ftIO-system is an extension of the porch compiler and its runtime system. The porch compiler automatically generates code to save bookkeeping information about ftIO's transactional file operations in portable checkpoints. These portable checkpoints can be recovered on a binary incompatible architecture. We developed a new algorithm for supporting transactional file operations in ftIO. Rather than using the well-known two-phase commit protocol, this algorithm uses only a single bit of information and an atomic rename file operation to guarantee fault tolerance. In this paper, we describe our new ftIO algorithm, discuss design choices for ftIO, and provide experimental data of our ftIO prototype. 相似文献
3.
支持分布式合作实时事务处理的协同检验点方法 总被引:1,自引:0,他引:1
在实时事务执行时,事务故障或数据竞争会导致事务重启,为减少事务重启损失的工作量,可以采用检验点技术保证事务的时间正确性.在一类分布式实时数据库应用中,不同结点的事务通过消息交换形成合作关系,为保证合作事务间的全局一致性,当某一事务记检验点时,相关事务也要记检验点.传统协同检验点方法没有考虑应用的定时约束,不能很好地支持分布式合作实时事务处理.该文提出了一种基于图论的协同检验点方法,利用在每个计算结点上为每个合作事务集维护的局部有向图,使用一个基于图论的计算过程标识出应记检验点的事务,该方法既具有最小协同检验点特性,又使全局检验点的时延最小.实验表明该算法减少了全局检验点时延,有利于实时事务截止期的满足. 相似文献
4.
基于PVM的协调检查点设置关键技术 总被引:1,自引:0,他引:1
本文论述了基于PVM的并行程序运行回卷恢复系统设计和实现过程中的退出再加入PVM机制、任务号隐式映射机制、任务结束前同步机制、防止PVM库重入机制,信号与消息协同触发机制、应用任务初始化机制以及作为前述各机制实现基础的函数包裹与换名机制等关键技术。这些技术已经成功地应用于我们自主开发的ChaRM系统中,证明了技术的正确性和有效性。 相似文献
5.
基于异构分布式系统的实时容错调度算法 总被引:26,自引:1,他引:26
目前文献中研究的实时容错调度算法都是基于同构分布式系统,系统中的所有处理机完全相同。该文首先建立了一个基于异构分布式系统实时容错调度模型,异构分布式系统中的各个处理机均不相同。基于该异构分布式系统模型,该文引入了可靠性代价(reliability cost)概念,并提出两种静态实时容错调度算法(RTFTNO和RTFTRC)用于调度周期性实时容错任务。算法RTFTRC在调度任务时,尽量使系统的可靠性代价最小;而算法RTFTNO在调度实时任务时,没有考虑系统的可靠性代价。该文详细讨论了两种调度算法的性能。性能模拟实验分别比较了两个算法的可靠性代价,超时比率和可调度性;并研究了任务的计算时间与可靠性代价的关系以及调度长度阈值与最小处理机个数的关系。实验结果表明,算法RTFTRC的性能优于算法RTFTNO。 相似文献
6.
《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518
Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献
7.
Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes the sample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger. 相似文献
8.
异步检查点容错PVM 总被引:1,自引:0,他引:1
以工作站簇为代表的计算环境是当前分布式系统和并行计算的研究重点之一,PVM所提供的消息传递机制支持了高效的异构网络计算。但标准PVM缺乏对系统容错的支持,这可以通过使用检查点的回滚恢复方式予以弥补。该文对如何在用户级实现PVM全局容错,分析其设计思想和实现技术。主要思想是使用进行消息记录的异步检查点算法,并利用PVM守护进程和全局调度进程进行控制,所有操作对应用程序都是透明的。利用该系统还可以进一步实现PVM的透明进程迁移和负载均衡。 相似文献
9.
Best-Case Response Times and Jitter Analysis of Real-Time Tasks 总被引:6,自引:0,他引:6
In this paper, we present a simple recursive equation and an iterative procedure to determine the best-case response times of periodic tasks under fixed-priority preemptive scheduling and arbitrary phasings. The approach is of a similar nature as the one used to determine worst-case response times (Joseph and Pandya, 1986) in the sense that where a critical instant is considered to determine the latter, we base our analysis on an optimal instant. Such an optimal instant occurs when all higher priority tasks have a simultaneous release that coincides with the completion of an execution of the task under consideration. The resulting recursive equation closely resembles the one for worst-case response times. The iterative procedure is illustrated by means of a small example. Next, we apply the best-case response times to analyze jitter in distributed multiprocessor systems. To this end, we discuss the effect of the best-case response times on completion jitter, as well as the effect of release jitter on the best-case response times. The newly derived best-case response times generally result in tighter bounds on jitter, in turn leading to tighter worst-case response time bounds. 相似文献
10.
11.
This paper presents an efficient, writer-based logging scheme for recoverable distributed shared memory systems, in which logging of a data item is performed by its writer process, instead of every process that accesses the item logging it. Since the writer process maintains the log of data items, volatile storage can be used for logging. Only the readers' access information needs to be logged into the stable storage of the writer process to tolerate multiple failures. Moreover, to reduce the frequency of stable logging, only the data items accessed by multiple processes are logged with their access information when the items are invalidated, and also semantic-based optimization in logging is considered. Compared with the earlier schemes in which stable logging was performed whenever a new data item was accessed or written by a process, the size of the log and the logging frequency can be significantly reduced in the proposed scheme. 相似文献
12.
For distributed databases, checkpointing is used to ensure an efficient way to perform global reconstruction. However, the need for global reconstruction is infrequent. Most current checkpointing approaches for distributed databases are too expensive during run time. Some of them allow the checkpointing process to run in parallel with normal transactions at the cost of more data and resource contention, which in turn causes longer response time for normal transactions. Thus, an efficient way to checkpoint distributed databases is needed to avoid degrading the system performance. This paper presents a low-cost solution, called Loosely Synchronized Local Fuzzy Checkpointing (LSLFC), to these problems. LSLFC supports global reconstruction, and our performance study shows that LSLFC has little overhead during run time. 相似文献
13.
应用级checkpointing是一种在大规模科学计算领域中备受关注的容错技术.但是应用级checkpointing技术要求用户决定哪些是需要保存的关键数据,这增加了用户的负担.介绍一个基于MPI并行程序活跃变量分析的源到源的预编译工具ALEC,它可用于辅助应用级checkpointing.在一个512处理器的Cluster系统上,对经过ALEC编译的5个Fortran/MPI应用进行了性能评测.结果表明,ALEC能够有效减小checkpoint的大小和应用级checkpointing保存和恢复的开销. 相似文献
14.
Lukasz Ziarek Philip Schatz Suresh Jagannathan 《Electronic Notes in Theoretical Computer Science》2007,174(9):85
Transient faults that arise in large-scale software systems can often be repaired by re-executing the code in which they occur. Ascribing a meaningful semantics for safe re-execution in multi-threaded code is not obvious, however. For a thread to correctly re-execute a region of code, it must ensure that all other threads which have witnessed its unwanted effects within that region are also reverted to a meaningful earlier state. If not done properly, data inconsistencies and other undesirable behavior may result. However, automatically determining what constitutes a consistent global checkpoint is not straightforward since thread interactions are a dynamic property of the program.In this paper, we present a safe and efficient checkpointing mechanism for Concurrent ML (CML) that can be used to recover from transient faults. We introduce a new linguistic abstraction called stabilizers that permits the specification of per-thread monitors and the restoration of globally consistent checkpoints. Global states are computed through lightweight monitoring of communication events among threads (e.g. message-passing operations or updates to shared variables). Our checkpointing abstraction provides atomicity and isolation guarantees during state restoration ensuring restored global states are safe.Our experimental results on several realistic, multithreaded, server-style CML applications, including a web server and a windowing toolkit, show that the overheads to use stabilizers are small, and lead us to conclude that they are a viable mechanism for defining safe checkpoints in concurrent functional programs. Our experiments conclude with a case study illustrating how to build open nested transactions from our checkpointing mechanism. 相似文献
15.
分布式系统检查点算法中程序卷回时文件系统的状态恢复 总被引:3,自引:0,他引:3
检查点技术,也称为“回溯恢复”,是软件容错的重要手段,它主要用于保存和恢复程序的运行状态。在分布式计算和并行计算系统中有十分重要的作用。该文从减少检查点的开销角度,对分布式系统检查点算法中关于程序卷回时文件系统状态的恢复问题进行了分析讨论和进一步的研究。 相似文献
16.
实时CORBA规范分析与评述 总被引:2,自引:0,他引:2
实时系统是一类应用极为广泛的系统,而通用CORBA规范对实时应用支持不足,所以OMG制订了实时COR-BA1.0规范.。该规范支持固定优先级的实时CORBA应用,提供了实时CORBA应用中对象调用操作端到端的可预测性。文章对实时CORBA规范中所定义的控制和管理系统资源的策略和机制进行了较为详尽的阐述,并对其进行了分析与评述。 相似文献
17.
Ahn Jinho Min Sung-Gi Hwang Chong-Sun Yu Heonchang 《The Journal of supercomputing》2002,22(2):175-196
This paper presents three garbage collection schemes for causal message logging with independent checkpointing. The first scheme allows each process to autonomously remove useless log information in its volatile storage by piggybacking only some additional information without requiring any extra message and forced checkpoint. Additionally, it supports faster output commit than traditional schemes. The second scheme enables each process to remove a part of log information in the storage if more empty space is required. It reduces the number of processes participating in the garbage collection by using the size of the log information of each process. The third scheme is a hybrid scheme having the advantages of the two proposed schemes. Simulation results show that the third scheme significantly reduces the garbage collection overhead compared with the traditional schemes regardless of specific communication patterns of distributed applications. 相似文献
18.
Communication-Induced Checkpointing (CIC) protocols are classified into two categories in the literature: Index-based and Model-based. In this paper, we discuss two data structures being used in these two kinds of CIC protocols, and their different roles in helping the checkpointing algorithms to enforce Z-cycle Free (ZCF) property. Then, we present our Fully Informed aNd Efficient (FINE) communication-induced checkpointing algorithm, which not only has less checkpointing overhead than the well-known Fully Informed (FI) CIC protocol proposed by Helary et al. but also has less message overhead. Performance evaluation indicates that our protocol performs better than many of the other existing CIC protocols. 相似文献
19.
非线性动态系统的容错控制 总被引:4,自引:0,他引:4
首先概要介绍了非线性动态系统容错控制技术的发展现状;然后分类介绍了几种非线性系统的容错控制技术,重点分析了基于人工智能和参数估计的主动容错控制方法和基于Hamilton-Jacobi方程的非线性被动容错控制设计方法。对于其它的方法,则简要讨论了它们的适用范围和优缺点;并探讨了该领域的难点问题和可能的研究方向。 相似文献
20.
Low-Power Design for Real-Time Systems 总被引:1,自引:0,他引:1
Real-time Systems often are located in the special environments where the power consumption is a big concern. Upon presence of timing constraints, the low power design on the real-time systems has significant impact on the performance as well as the schedulability of the systems. The system developers are facing the challenges for reducing the power consumption and meeting the timing constraints in the real-time systems.This paper represents one of few attempts to address the issue of the low power design on real-time systems. We present two power reduction methods: one is at the software compilation level and the other at the operating system level. Given a real-time program, an inter-instruction power reduction technique is proposed to transform the program to another one with lower power consumption. In addition, a scheduling algorithm for real-time operating systems is proposed to reschedule real-time programs when the execution time of the programs is changed. Therefore, the proposed scheduling algorithm works together with the proposed power reduction technique to make sure all programs meet their deadlines and to improve the system schedulability. We also evaluate the performance of the proposed inter-instruction reduction method by comparing it with the cold scheduling algorithm and show that the proposed method outperforms the cold scheduling algorithm and reduces more energy power. 相似文献