期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

基于Lustre文件系统的MPI检查点系统实现技术与性能测试 总被引：1，自引：0，他引：1

谢旻卢宇彤周恩强曹宏嘉杨学军《计算机研究与发展》2007,44(10):1709-1716

基于协同式检查点的回卷恢复是在大规模并行计算机系统中得到采用的一项重要容错技术,其性能开销主要为协同协议和检查点映像存储所决定.描述了一个在MPICH2中实现的应用透明的并行检查点系统,相比已有的技术,该系统有以下特点：1）协同协议操作利用了并行应用的近邻通信特性,通过虚连接方法减少协议的处理开销;2）采用Lustre文件系统简化检查点映像文件管理的复杂性;3）通过并行I/O操作提高性能,优化检查点映像的存储过程.实际应用的测试表明,该检查点系统具有较小的运行时间开销和良好的可扩展性. 相似文献

2.

面向大规模计算系统的Cache式并行检查点

刘勇燕刘勇鹏冯华迟万庆《计算机科学》2011,38(5):287-289,305

检查点机制是高性能并行计算系统中重要的容错手段,随着系统规模的增大,并行检查点的可扩展性受文件访问的制约。针对大规模并行计算系统的多级文件系统结构,提出了cache式并行检查点技术。它将全局同步并行检查点转化为局部文件操作,并利用多处理器结构进行乱序流水线式写回调度,将检查点的写回时机合理分布,从而有效地隐藏了检查点的写回开销,保证了并行检查点文件访问的高性能和高可扩展性。相似文献

3.

基于共享内存的机群服务检查点机制研究 总被引：1，自引：0，他引：1

梁毅王磊樊建平方娟《计算机研究与发展》2010,47(4)

针对既有基于稳定存储的机群服务检查点存在的系统成本高、恢复时间长的问题,提出了一种基于共享内存的机群服务检查点机制;设计了一套面向基于共享内存的检查点信息主-备存储模式的检查点信息管理协议,确保机群服务检查点信息一致性;设计了一套基于单向逻辑环的检查点组管理协议,确保检查点逻辑备份环中检查点进程的成员视图一致性.性能实验结果表明,该检查点机制具有较好的检查点信息读写性能,组管理协议系统开销小,较好地满足了机群服务检查点需求. 相似文献

4.

基于Linux的检查点系统的设计和实现

覃颖杨中志《计算机与数字工程》2004,32(4):6-9

检查点机制在现代并行分布式计算中有着重要的应用。本文介绍了一种基于Linux的检查点系统的设计和实现方法，它对系统容错、进程迁移和动态负载平衡的研究都具有重要的意义。相似文献

5.

超步诱导的回卷恢复

丁俊童维勤《小型微型计算机系统》2002,23(6):731-735

工作站机群系统已成为分布式并行处理发展的主流方向之一，随着机群系统应用领域的逐渐拓展和规模的不断扩大，人们对其可靠性的要求日益提高，设计高可靠的群机系统，需要着重研究其系统容错技术，本文叙述了并行异构回卷恢复和检查点派生，实现透明的可移植容错和负载均衡能力，避免调整检查点就构成全局一致性状态，不仅使BSP应用程序自治容错能力，而且能够在机群（Clusters）间迁移，保持系统负载均衡，重点介绍了检查点设置，检查点派生、卷回、进程迁移技术。相似文献

6.

基于智能网卡支持的并行通信协议

林基周小成孟丹《计算机研究与发展》2005,42(6):971-978

网络通信系统是机群的一个重要组成部分,也是影响机群整机处理效率的关键因素．随着单个结点计算能力的增强,网络通信能力也需要相应地提高．一种提高网络通信能力的方法是引入多个网卡同时进行消息发送,即并行通信．通常,并行通信是基于RMA机制实现的,对于小于17KB的消息,由于RMA机制的握手过程使得并行通信性能的提高很有限．提出了基于智能网卡支持的并行通信协议．该协议将消息重组所需的握手过程下移到网卡上实现,从而减少了握手开销,扩展了并行通信的范围．实验数据表明,与基于RMA机制的并行协议相比,该协议提高了3KB-17KB消息段的通信性能;对应用程序,如FT程序,该协议将其执行时间减少了9．4％,而基于RMA机制的并行协议只减少了7．8％．最后分析了限制并行通信性能提高的主要因素．相似文献

7.

一种基于机群的分布式数据采集系统 总被引：2，自引：0，他引：2

下载免费PDF全文

张淑英刘淑芬包铁《计算机工程》2006,32(14):46-48

提出了一种基于机群的分布式数据采集系统，即采用任务调度中心、机群管理中心、采集代理多层次结构实现分布式数据采集；同时提出了机群的3种分配策略、管理中心选择机制及采集代理的设计方案。实验结果表明：该文所提出的方法适用于大规模网络的数据采集，能更加有效地提高数据采集的速度，解决传统的采集方法在大规模网络中所存在的数据传输负载过重和数据处理的瓶颈问题，在数据采集领域具有很好的应用前景，具有一定的应用价值。相似文献

8.

一个适合大规模集群并行计算的检查点系统 总被引：4，自引：1，他引：4

周恩强卢宇彤沈志宇《计算机研究与发展》2005,42(6):987-992

分布式检查点系统是大规模并行计算系统容错的重要手段．协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈．针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性．初步测试结果表明,C系统的设计策略适合大规模并行计算的容错．相似文献

9.

可恢复的软件DSM系统JIACKPT

下载免费PDF全文

章隆兵张福新胡伟武唐志敏《软件学报》2005,16(2):165-173

软件DSM(distributed shared memory)系统在机群上构造了共享存储编程环境,结合了共享存储的易编程性和机群的可扩展性,引起了广泛的研究.由于软件DSM系统是一个分布式系统,系统失败风险大,需要实现容错技术以促进其实用化.利用用户级检查点技术,在支持域存储一致模型的软件DSM系统JIAJIA的基础上,设计并实现了一个可恢复的高可移植的软件DSM系统JIACKPT(JIAjia with ChecKPoinTing).由于采用适合软件DSM系统的强全局一致状态以及多种优化措施,JIACKPT易于实现且获得很好的性能.在一个8节点的PC机群上的应用测试表明,即使每分钟做一次检查点,大部分应用的检查点开销也小于10%.此外,JIACKPT还具有高可移植性.这些都表明JIACKPT已经成为一个比较实用的系统. 相似文献

10.

超步透导的回卷恢复

丁俊童维勤《小型微型计算机系统》2002,23(6):731-735

工作站机群系统已成为分布式并行处理发展的主流方向之一 .随着机群系统应用领域的逐渐拓展和规模的不断扩大 ,人们对其可靠性的要求日益提高 .设计高可靠的群机系统 ,需要着重研究其系统容错技术 .本文叙述了并行异构环境回卷恢复和检查点派生 .实现透明的可移植容错和负载均衡能力 .避免调整检查点就可以构成全局一致性状态 .不仅使 BSP应用程序自治容错能力 ,而且能够在机群 (Clusters)间迁移 ,保持系统负载均衡 .重点介绍检查点设置、检查点派生、卷回、进程迁移技术相似文献

11.

可靠的分布式系统生存性保障模型

耿技陈非聂鹏陈伟秦志光《计算机应用》2012,32(10):2748-2751

基于检查点的协同式回滚恢复机制是一种针对分布式系统生存性保障的有效机制,现有分布式系统中基于检查点的回滚恢复机制以分布式信道可靠作为假设前提,而实际应用场景中,该假设并不总是成立。针对分布式系统实际的应用环境,提出了适用于信道不可靠的分布式计算环境的协同式系统生存性保障模型。该模型在保留检查点回滚恢复机制优点的基础上,通过建立冗余通信链路和进程迁移来保障不可靠通信信道环境下分布式系统的生存性。相似文献

12.

Error detection and diagnosis for fault tolerance in distributed systems

Kassem Saleh Khaled Al-Saqabi 《Information and Software Technology》1998,39(14-15)

The early error detection and the understanding of the nature and conditions of an error occurrence can be useful to make an effective and efficient recovery in distributed systems. Various distributed system extensions were introduced for the implementation of fault tolerance in distributed software systems. These extensions rely mainly on the exchange of contextual information appended to every transmitted application specific message. Ideally, this information should be used for checkpointing, error detection, diagnosis and recovery should a transient failure occur later during the distributed program execution. In this paper, we present a generalized extension suitable for fault-tolerant distributed systems such as communication software systems and its detection capabilities are shown. Our extension is based on the execution of message validity test prior to the transmission of messages and the piggybacking of contextual information to facilitate the detection and diagnosis of transient faults in the distributed system. 相似文献

13.

分布式系统检查点算法中程序卷回时文件系统的状态恢复 总被引：3，自引：0，他引：3

沙丽杰武秀川韦鹓《计算机工程与应用》2002,38(17):131-134

检查点技术,也称为“回溯恢复”,是软件容错的重要手段,它主要用于保存和恢复程序的运行状态。在分布式计算和并行计算系统中有十分重要的作用。该文从减少检查点的开销角度,对分布式系统检查点算法中关于程序卷回时文件系统状态的恢复问题进行了分析讨论和进一步的研究。相似文献

14.

Comparative analysis of different models of checkpointing andrecovery

Nicola V.F. van Spanje J.M. 《IEEE transactions on pattern analysis and machine intelligence》1990,16(8):807-821

Different checkpointing strategies are combined with recovery models of different refinement levels in the database systems. The complexity of the resulting model increases with its accuracy in representing a realistic system. Three different analytic approaches are used depending on the complexity of the model: analytic, numerical and simulation. A Markovian queuing model is developed, resulting in a combined Poisson and load-dependent checkpointing strategy with stochastic recovery. A state-space analysis approach is used to derive semianalytic expressions for the performance variables in terms of a set of unknown boundary state probabilities. An efficient numerical algorithm for evaluating unknown probabilities is outlined. The validity of the numerical solution is checked against simulation results and shown to be of acceptable accuracy, particularly in the stable operating range. Simulations have shown that realistic load-dependent checkpointing results in performance close to the optimal deterministic checkpointing. Furthermore, the stochastic recovery model is an accurate representation of a realistic recovery 相似文献

15.

An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows

Rafael Tolosana-Calasanz José Ángel Bañares Pedro Álvarez Joaquín Ezpeleta Omer Rana 《Journal of Computer and System Sciences》2010,76(6):403-415

Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different fault tolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint – a snapshot of the whole workflow enactment state at normal execution (without failures) – has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows in which every workflow node in the hierarchy accomplishes its own local checkpoint autonomously and in an uncoordinated way after its enactment. In contrast to other proposals, we utilise the Reference net formalism for expressing the scheme. Reference nets are a particular type of Petri nets which can more effectively provide the abstractions to support and to express hierarchical workflows and their dynamic adaptability. 相似文献

16.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

17.

基于影子页面和混合日志的MMDB恢复方法

鲍程锋杨小虎《计算机工程与设计》2011,32(7):2373-2376

为了提升内存数据库从各种故障中恢复的速度,提出了基于影子页面技术、混合日志策略以及模糊检查点思想的内存数据库恢复方法。在分析内存数据库运行过程中主要的时间消耗点的基础上建立了内存数据库的系统模型,通过分析事务过程和检查点过程,讨论了该恢复策略的执行过程以及优点,讲述了内存数据库在此系统模型和恢复策略下的事务故障和系统故障的恢复过程以及系统的性能分析。相似文献

18.

An analytical model for hybrid checkpointing in time warpdistributed simulation

Soliman H.M. Elmaghraby A.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(10):947-951

The Time Warp distributed simulation algorithm uses checkpointing to save process states after certain event executions for later recovery at the time of a rollback. Two main techniques have been used for checkpointing: periodic state saving and incremental state saving. The former technique introduces large overheads in reconstructing a desired state by coasting forward from an earlier checkpointed state if the computational granularity is large. The latter technique also has large overheads in applications with large rollback distances. A hybrid checkpointing technique is proposed which uses both periodic and incremental state saving simultaneously in such a way that it reduces checkpointing time overheads. A detailed analytical model is developed for the hybrid technique, and comparisons are made using similar analytical models with periodic and incremental state saving techniques. Results show that when the system parameters are chosen to represent large and complex simulated systems, the hybrid approach has less checkpointing time overhead than the other two techniques 相似文献