期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

丁俊童维勤《小型微型计算机系统》2002,23(6):731-735

工作站机群系统已成为分布式并行处理发展的主流方向之一，随着机群系统应用领域的逐渐拓展和规模的不断扩大，人们对其可靠性的要求日益提高，设计高可靠的群机系统，需要着重研究其系统容错技术，本文叙述了并行异构回卷恢复和检查点派生，实现透明的可移植容错和负载均衡能力，避免调整检查点就构成全局一致性状态，不仅使BSP应用程序自治容错能力，而且能够在机群（Clusters）间迁移，保持系统负载均衡，重点介绍了检查点设置，检查点派生、卷回、进程迁移技术。相似文献

2.

一种基于检查点的卷回恢复与进程迁移系统^* 总被引：14，自引：2，他引：12

汪东升沈美明郑纬民裴丹《软件学报》1999,10(1):68-73

ChaRM是一种并行程序后向故障恢复与进程迁移系统.它不仅实现了对工作站机群系统瞬时故障的恢复,而且通过检查点设置时的Mirror存储技术和进程迁移技术,实现了对机群系统结点永久故障的恢复,并支持系统软硬件的在线维护、处理机资源的排他/限时使用和动态负载平衡等功能.文章主要介绍ChaRM系统的检查点设置与回卷恢复、进程迁移等实现技术,并给出了部分性能评测结果. 相似文献

3.

集群系统中进程迁移的研究

曹伟王雷《现代计算机》2007,(9):9-11,15

进程迁移对集群系统的动态负载均衡、容错和系统管理具有重要意义.说明分布式系统中进程迁移的目的和一般步骤,对比了两个典型的集群操作系统Kerrighed和openMosix中的进程迁移机制,分析其对系统性能的影响. 相似文献

4.

进程迁移研究 总被引：1，自引：0，他引：1

下载免费PDF全文

庞毅林蒋翠玲《计算机工程与科学》2001,23(5):47-50

进程迁移在分布式系统中的应用,提高了系统的负载平衡,实现了高效的容错性能。本文介绍了进程的状态迁移算法以及检查点的设置和其它状态的迁移技术。相似文献

5.

Linux下系统V共享内存的保存与恢复

杨升春代征方蕾《计算机与数字工程》2005,33(9):125-128

设置检查点是保存和恢复进程运行状态的一种重要技术,是实现容错、卷回调试和进程迁移的重要手段。本文研究了全透明检查点系统Epckpt在系统Ⅴ共享内存方面的实现方法和不足,给出了自己的改进,从而更好地实现了系统Ⅴ共享内存的保存与恢复。相似文献

6.

群机系统的容错和恢复

丁俊童维勤《计算机应用》2001,21(6):90-92

工作站群机系统已成为分布式并行处理发展的主流方向之一。随着群机系统应用领域的逐渐拓展和规模的不断扩大,人们对其可靠性的要求日益提高。设计高可靠的群机系统,需要着重研究其系统容错技术。本文主要论述Linux群机分布式系统进程的容错和恢复。重点讲述用户层中的检查点设置、卷回和进程迁移关键技术。相似文献

7.

异步检查点容错PVM 总被引：1，自引：0，他引：1

余洋陆鑫达《计算机工程与应用》1999,35(11):34-37

以工作站簇为代表的计算环境是当前分布式系统和并行计算的研究重点之一,ＰＶＭ所提供的消息传递机制支持了高效的异构网络计算。但标准ＰＶＭ缺乏对系统容错的支持,这可以通过使用检查点的回滚恢复方式予以弥补。该文对如何在用户级实现ＰＶＭ全局容错,分析其设计思想和实现技术。主要思想是使用进行消息记录的异步检查点算法,并利用ＰＶＭ守护进程和全局调度进程进行控制,所有操作对应用程序都是透明的。利用该系统还可以进一步实现ＰＶＭ的透明进程迁移和负载均衡。相似文献

8.

机群系统中检查点卷回恢复协议分析 总被引：2，自引：0，他引：2

下载免费PDF全文

张怡胡建平《计算机工程与科学》2001,23(5):66-69

检查点机制作为一种软件容错机制,可以很好地满足机群系统的容错要求,本文详细分析了各类检查点卷回恢复协议,并比较它们的性能和特点。相似文献

9.

基于剩余计算能力的动态负载均衡系统

张文昌夏学知《计算机与数字工程》2010,38(9):135-139

基于剩余计算能力的动态负载均衡系统是一种基于新型负载向量的动态负载均衡系统。该系统使用一种新的负载评价指标：剩余计算能力,它兼顾节点的资源使用情况及节点本身的性能特征两个方面,更好地体现了集群系统的处理能力和系统正在处理的负载情况,比常用的其它负载向量更加灵活、准确。系统还将任务调度和进程迁移结合起来,以达到更有效的系统负载均衡,同时,也减小系统负载均衡带来的额外开销。相似文献

10.

基于数据库的机群检查点的研究与实现 总被引：1，自引：0，他引：1

武剑锋戈弋李三立《小型微型计算机系统》2002,23(3):257-261

本文提出一种新的应用级机群检查点实现方案 .给出了与现有方案不同的方法 :首先 ,采用关系数据库系统来代替以前采用文件的方式来存储机群系统的检查点、管理数据、资源情况等信息 ,便于数据的索引与归一化 ,并且 ,当数据规模非常大时 ,数据库支持的访问速度要优于基于文件系统的访问速度 .其次 ,采用独立的服务器 ,使得这些检查点以及其他相关操作对机群系统本身的运算影响最小 ,并且对这个独立的管理服务器作镜像容错处理 ,在成本和效率上要优于为每个计算节点作镜像容错处理相似文献

11.

The performance of cache-based error recovery in multiprocessors

Janssens B. Fuchs W.K. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(10):1033-1043

Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval 相似文献

12.

一种降低并行程序检查点开销的方法

下载免费PDF全文

周小成孙凝晖霍志刚马捷《计算机工程》2007,33(12):84-86

检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。相似文献

13.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

14.

基于虚拟文件操作的文件检查点设置 总被引：1，自引：0，他引：1

刘少锋汪东升朱晶《软件学报》2002,13(8):1528-1533

实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点. 相似文献

15.

Error recovery in shared memory multiprocessors using privatecaches

Wu K.-L. Fuchs W.K. Patel J.H. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(2):231-240

The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented 相似文献

16.

WindowsNT环境下的进程检查点设置与回卷恢复 总被引：6，自引：0，他引：6

张悠慧汪东升郑纬民《计算机研究与发展》2001,38(1):50-55

阐述了WindowsNT环境下应用程序的检查点设置与回卷恢复机制,并介绍了设计和实现的检查点设置与恢复工具WinNTCkp.WinNTCkpt采用标准WindowsAPI函数,通过代码动态注入和对系统调用进行包裹的方法进行检查点设置与回卷恢复。与同类工具相比,WinNTCkpt具有不需修改应用程序源代码,不需对应用程序进行重新编译或连接,支持对用户文件内容的检查设置与回卷恢复的特点。WinNTCkpt是正在研制开发的高可用性机群计算环境的核心,也是在机群环境下实现进程迁移和负载平衡的技术基础。相似文献

17.

An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows

Rafael Tolosana-Calasanz José Ángel Bañares Pedro Álvarez Joaquín Ezpeleta Omer Rana 《Journal of Computer and System Sciences》2010,76(6):403-415

Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different fault tolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint – a snapshot of the whole workflow enactment state at normal execution (without failures) – has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows in which every workflow node in the hierarchy accomplishes its own local checkpoint autonomously and in an uncoordinated way after its enactment. In contrast to other proposals, we utilise the Reference net formalism for expressing the scheme. Reference nets are a particular type of Petri nets which can more effectively provide the abstractions to support and to express hierarchical workflows and their dynamic adaptability. 相似文献

18.

Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database system

Jiang Wu Bhavani Thuraisingham 《Information Sciences》2009,179(20):3659-3672

Checkpointing and rollback recovery are well-known techniques for handling failures in distributed systems. The issues related to the design and implementation of efficient checkpointing and recovery techniques for distributed systems have been thoroughly understood. For example, the necessary and sufficient conditions for a set of checkpoints to be part of a consistent global checkpoint has been established for distributed computations. In this paper, we address the analogous question for distributed database systems. In distributed database systems, transaction-consistent global checkpoints are useful not only for recovery from failure but also for audit purposes. If each data item of a distributed database is checkpointed independently by a separate transaction, none of the checkpoints taken may be part of any transaction-consistent global checkpoint. However, allowing individual data items to be checkpointed independently results in non-intrusive checkpointing. In this paper, we establish the necessary and sufficient conditions for the checkpoints of a set of data items to be part of a transaction-consistent global checkpoint of the distributed database. Such conditions can also help in the design and implementation of non-intrusive checkpointing algorithms for distributed database systems. 相似文献

19.

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

D. Manivannan Q. Jiang Jianchang Yang M. Singhal 《Information Sciences》2008,178(15):3110-3117

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead. 相似文献

20.

MPI容错机制的研究

崔丽青徐炜民《计算机工程》2004,30(16):88-90

MPI是广泛应用于集群系统的并行程序开发环境，MPI的容错是集群系统可靠性的关键问题。该文讨论了MPI标准中的容错，结合协调设置检查点和同步卷回等机制设计了基于检查点的卷回恢复系统MPIChaRR、该系统应用于Linux集群机，MPICH应用程序运行中的节点故障恢复是对用户透明的。相似文献