期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Run-time selection of the checkpoint interval in Time Warp based simulations

《Simulation Practice and Theory》1998,6(5):461-478

Time warp discrete event simulators take advantage of the parallel processing of simulation events. On the other hand, they suffer from the overhead required to enforce the causality relation. This overhead consists of the time for saving the states of logical processes, the time for the rollback procedures and the wasted simulation time, that is the time spent for the processing of events which are undone because of rollback. Two techniques have been developed for state saving: periodic and incremental. In this paper we study the periodic technique, and we present an analytical model describing the simulation execution time in function of both the state saving cost and the rollback cost. Furthermore, we derive a methodology that allows each logical process to adapt its state saving period on line in order to reduce the simulation execution time. Experimental results show that, in some simulation scenarios, our methodology improves performance, in comparison to already existing proposals. 相似文献

2.

Accelerating incremental checkpointing for extreme-scale computing

《Future Generation Computer Systems》2014

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems. 相似文献

3.

Error recovery in shared memory multiprocessors using privatecaches

Wu K.-L. Fuchs W.K. Patel J.H. 《Parallel and Distributed Systems, IEEE Transactions on》1990,1(2):231-240

The problem of recovering from processor transient faults in shared memory multiprocessor systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented 相似文献

4.

一种基于检查点的并行程序调试器的设计与实现 总被引：4，自引：1，他引：4

刘建汪东升沈美明郑纬民《计算机研究与发展》2002,39(12):1580-1586

为支持大规模长时间运行并行程序的调试，有必要将检查点机制引入到并行程序调试器中，检查点设置与卷回应用中需要解决中途消息，孤儿消息和多米诺效应，活锁4个问题，并行程序调试中需要解决不确定性问题，提出的基于状态冻结的确定性检查点设置方法，可以避免检查点应用中孤儿消息和多米诺效应，活锁3个问题，通过消化记录的方法处理中途消息问题，采用记录／重放方法解决并行调试中的不确定性问题，基于状态冻结的确定性检查点设置方法，有效地解决了并行程序调试器和检查点结合时产生的诸多问题，该方法具有结构清晰，易于实现的优点，基于此技术，设计并实现了一个并行调试工具－DENNET。相似文献

5.

An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows

Rafael Tolosana-Calasanz José Ángel Bañares Pedro Álvarez Joaquín Ezpeleta Omer Rana 《Journal of Computer and System Sciences》2010,76(6):403-415

Scientific workflow systems often operate in unreliable environments, and have accordingly incorporated different fault tolerance techniques. One of them is the checkpointing technique combined with its corresponding rollback recovery process. Different checkpointing schemes have been developed and at various levels: task- (or activity-) level and workflow-level. At workflow-level, the usually adopted approach is to establish a checkpointing frequency in the system which determines the moment at which a global workflow checkpoint – a snapshot of the whole workflow enactment state at normal execution (without failures) – has to be accomplished. We describe an alternative workflow-level checkpointing scheme and its corresponding rollback recovery process for hierarchical scientific workflows in which every workflow node in the hierarchy accomplishes its own local checkpoint autonomously and in an uncoordinated way after its enactment. In contrast to other proposals, we utilise the Reference net formalism for expressing the scheme. Reference nets are a particular type of Petri nets which can more effectively provide the abstractions to support and to express hierarchical workflows and their dynamic adaptability. 相似文献

6.

基于虚拟文件操作的文件检查点设置 总被引：1，自引：0，他引：1

刘少锋汪东升朱晶《软件学报》2002,13(8):1528-1533

实现分布/并行系统容错的基础是单进程检查点设置和卷回恢复技术,而对活动文件信息进行保存和恢复则是这种技术的重要方面.提出一种虚拟文件操作策略,实现了对用户文件的检查点设置,有效地解决了发生故障时用户文件内容与进程全局状态的不一致的问题.该方法通过文件块式管理、检查点分布操作等技术,使得在空间开销、正常运行时间、恢复时间等性能指标上优于其他方法,并且具有对用户透明、可最大限度地保留已完成工作的特点. 相似文献

7.

Design and analysis of an integrated checkpointing and recoveryscheme for distributed applications

Ramamurthy B. Upadhyaya S. Bhargava B. 《Knowledge and Data Engineering, IEEE Transactions on》2000,12(2):174-186

An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency, which is the main source of multistep rollback in distributed systems, is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation using an object-oriented test framework 相似文献

8.

Improving Scalability of Application-Level Checkpoint-Recovery by Reducing Checkpoint Sizes

Iván Cores Gabriel Rodríguez Mará J. martín Patricia González Roberto R. Osorio 《New Generation Computing》2013,31(3):163-185

The execution times of large-scale parallel applications on nowadays multi/many-core systems are usually longer than the mean time between failures. Therefore, parallel applications must tolerate hardware failures to ensure that not all computation done is lost on machine failures. Checkpointing and rollback recovery is one of the most popular techniques to implement fault-tolerant applications. However, checkpointing parallel applications is expensive in terms of computing time, network utilization and storage resources. Thus, current checkpoint-recovery techniques should minimize these costs in order to be useful for large scale systems. In this paper three different and complementary techniques to reduce the size of the checkpoints generated by application-level checkpointing are proposed and implemented. Detailed experimental results obtained on a multicore cluster show the effectiveness of the proposed methods to reduce checkpointing cost. 相似文献

9.

超步透导的回卷恢复

丁俊童维勤《小型微型计算机系统》2002,23(6):731-735

工作站机群系统已成为分布式并行处理发展的主流方向之一 .随着机群系统应用领域的逐渐拓展和规模的不断扩大 ,人们对其可靠性的要求日益提高 .设计高可靠的群机系统 ,需要着重研究其系统容错技术 .本文叙述了并行异构环境回卷恢复和检查点派生 .实现透明的可移植容错和负载均衡能力 .避免调整检查点就可以构成全局一致性状态 .不仅使 BSP应用程序自治容错能力 ,而且能够在机群 (Clusters)间迁移 ,保持系统负载均衡 .重点介绍检查点设置、检查点派生、卷回、进程迁移技术相似文献

10.

Staggered consistent checkpointing

Vaidya N.H. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(7):694-702

A consistent checkpointing algorithm saves a consistent view of a distributed application's state on stable storage. The traditional consistent checkpointing algorithms require different processes to save their state at about the same time. This causes contention for the stable storage, potentially resulting in large overheads. Staggering the checkpoints taken by various processes can reduce checkpoint overhead. This paper presents a simple approach to arbitrarily stagger the checkpoints. Our approach requires that the processes take consistent logical checkpoints, as compared to consistent physical checkpoints enforced by existing algorithms. Experimental results on nCube-2 are presented 相似文献

11.

An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Qiangfeng Yi D. 《Journal of Parallel and Distributed Computing》2008,68(12):1575-1589

Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm. 相似文献

12.

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

D. Manivannan Q. Jiang Jianchang Yang M. Singhal 《Information Sciences》2008,178(15):3110-3117

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead. 相似文献

13.

The performance of cache-based error recovery in multiprocessors

Janssens B. Fuchs W.K. 《Parallel and Distributed Systems, IEEE Transactions on》1994,5(10):1033-1043

Several variations of cache-based checkpointing for rollback error recovery from transient errors in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated in this paper. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, we evaluate the performance effect of integrating the recovery schemes in the cache coherence protocol. Our results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead, but with uncontrollable high variability in the checkpoint interval 相似文献

14.

Simulating Spatially Explicit Problems on High Performance Architectures

《Journal of Parallel and Distributed Computing》2002,62(3):446-467

This paper addresses issues of implementation and performance optimization of simulations designed to model spatially explicit problems with the use of parallel discrete event simulation. A simulation system is presented that uses the optimistic protocol and runs on a distributed memory machine—the IBM SP. The efficiency of parallel discrete event simulations that use the optimistic protocol is strongly dependent on the overhead incurred by rollbacks. This paper introduces a novel approach to rollback processing which limits the number of events rolled back as a result of a straggler or antimessage. The method, called Breadth-First Rollback (BFR), is suitable for spatially explicit problems where the space is discretized and distributed among processes and simulation objects move freely in the space. The BFR uses incremental state saving, allowing the recovery of causal relationships between events during rollback. These relationships are then used to determine which events need to be rolled back. This paper presents an application of BFR to the simulation of Lyme disease. Our results demonstrate and almost linear speedup—a dramatic improvement over the traditional approach to rollback processing. Additionally, BFR is used as a basis of a dynamic load balancing algorithm that migrates load between the simulation processes. A brief outline of the algorithm and its potential performance are presented. 相似文献

15.

Rollback overhead reduction methods for time warp distributed simulation

《Simulation Practice and Theory》1998,6(8):689-702

Parallel discrete event simulation is a useful technique to improve performance of sequential discrete event simulation. We consider the time warp algorithm for asynchronous distributed discrete event simulation. Time warp is an optimistic synchronization mechanism for asynchronous distributed systems that allows a system to violate the synchronization constraint and, in this case, make the system rollback to a correct state. We focus on the kernel of the time warp algorithm, that is the rollback operation, and we propose some techniques to reduce the overhead due to this operation. In particular, we propose a method to reduce the overhead involved in state saving operation, two methods to reduce the overhead of a single rollback operation and a method to reduce the overall number of rollbacks. These methods have been implemented in a distributed simulation environment on a distributed memory system. Some experimental results show the effectiveness of the proposed techniques. 相似文献

16.

Use of common time base for checkpointing and rollback recovery ina distributed system

Ramanathan P. Shin K.G. 《IEEE transactions on pattern analysis and machine intelligence》1993,19(6):571-583

An approach to checkpointing and rollback recovery in a distributed computing system using a common time base is proposed. A common time base is established in the system using a hardware clock synchronization algorithm. This common time base is coupled with the idea of pseudo-recovery points to develop a checkpointing algorithm that has the following advantages: reduced wait for commitment for establishing recovery lines, fewer messages to be exchanged, and less memory requirement. These advantages are assessed quantitatively by developing a probabilistic model 相似文献

17.

A cost model for selecting checkpoint positions in time warpparallel simulation

Quaglia F. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(4):346-362

Recent papers have shown that the performance of Time Warp simulators can be improved by appropriately selecting the positions of checkpoints, instead of taking them on a periodic basis. In this paper, we present a checkpointing technique in which the selection of the positions of checkpoints is based on a checkpointing-recovery cost model. Given the current state S, the model determines the convenience of recording S as a checkpoint before the next event is executed. This is done by taking into account the position of the last taken checkpoint, the granularity (i.e., the execution time) of intermediate events, and using an estimate of the probability that S will have to be restored due to rollback in the future of the execution. A synthetic benchmark in different configurations is used for evaluating and comparing this approach to classical periodic techniques. As a testing environment we used a cluster of PCs connected through a Myrinet switch coupled with a fast communication layer specifically designed to exploit the potential of this type of switch. The obtained results point out that our solution allows faster execution and, in some cases, exhibits the additional advantage that less memory is required for recording state vectors. This possibly contributes to further performance improvements when memory is a critical resource for the specific application. A performance study for the case of a cellular phone system simulation is finally reported to demonstrate the effectiveness of this solution for a real world application 相似文献

18.

并行仿真中全状态保存技术研究

下载免费PDF全文

王学慧曹璐张磊《计算机工程与科学》2012,34(9):47-50

乐观时间同步机制能够显著提高并行仿真的性能,但是在乐观时间推进过程中会出现因果错误,需要根据保存的状态对事件进行回退,因此事件状态保存机制是影响乐观时间推进效率的重要因素。本文首先简要介绍了并行仿真中逻辑进程的执行过程,讨论了乐观时间推进的状态保存与回退机制;然后对全状态保存技术进行了建模和理论分析,并通过实验测试了全状态保存算法的性能,测试结果验证了理论分析的正确性。相似文献

19.

改进的快速N＋1奇偶校验检查点

周军海张大方杨金民《计算机工程与科学》2005,27(4):11-13

本文运用缓冲区和增量有盘检查点相结合的技术提出了一个快速可靠的改进N+1奇偶校验检查点方案。在N个应用进程运行时，通过设置一个专用的检查点进程来实现N+1的奇偶校验，并且利用检查点机在检查点间隔的空闲时间将增量的奇偶校验检查点信息保存到稳定的存储器中。改进的算法利用了无盘检查点方案的快速及磁盘检查点的高可靠性，减少了一台备份处理机，并且可容忍一个应用进程及一个检查点进程的两个并发错误。相似文献

20.

基于并发性发掘的低开销回卷恢复实现方法

袁功彪杨金民白树仁《计算机工程》2013,(11):46-51

现有的回卷恢复容错技术存在同步约束和阻塞问题,其时间开销随系统节点规模的增大而剧增。为此,提出一种基于并发性发掘的低开销回卷恢复实现方法。利用消息传递附带跟踪消息依赖的策略解除消息日志中的同步约束,解析进程负载以发掘进程负载的并发性,构建进程负载并发执行的实现架构,采用数据缓存策略和多线程技术实现进程内部各负载的并发执行,以降低故障恢复开销。3个NASNPB2．3标准性能检测程序的实验结果表明,该方法可使检查点开销从0．63S、3．19S、1．21S分别降低到0．18S、O．67S、0．19S,日志开销率从13．4％、3．5％、18．3％分别降低到0．7％、0．1％、1．0％。相似文献