期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

基于Lustre文件系统的MPI检查点系统实现技术与性能测试 总被引：1，自引：0，他引：1

谢旻卢宇彤周恩强曹宏嘉杨学军《计算机研究与发展》2007,44(10):1709-1716

基于协同式检查点的回卷恢复是在大规模并行计算机系统中得到采用的一项重要容错技术,其性能开销主要为协同协议和检查点映像存储所决定.描述了一个在MPICH2中实现的应用透明的并行检查点系统,相比已有的技术,该系统有以下特点：1）协同协议操作利用了并行应用的近邻通信特性,通过虚连接方法减少协议的处理开销;2）采用Lustre文件系统简化检查点映像文件管理的复杂性;3）通过并行I/O操作提高性能,优化检查点映像的存储过程.实际应用的测试表明,该检查点系统具有较小的运行时间开销和良好的可扩展性. 相似文献

2.

Optimizing checkpoint-based fault-tolerance in distributed stream processing systems: Theory to practice

Sachini Jayasekara Shanika Karunasekera Aaron Harwood 《Software》2022,52(1):296-315

Fault-tolerance is an essential part of a stream processing system that guarantees data analysis could continue even after failures. State-of-the-art distributed stream processing systems use checkpointing to support fault-tolerance for stateful computations where the state of the computations is periodically persisted. However, the frequency of performing checkpoints impacts the performance (utilization, latency, and throughput) of the system as the checkpointing process consumes resources and time that can be used for actual computations. In practice, systems are often configured to perform checkpoints based on crude values ignoring factors such as checkpoint and restart costs, leading to suboptimal performance. In our previous work, we proposed a theoretical optimal checkpoint interval that maximizes the system utilization for stream processing systems to minimize the impact of checkpointing on system performance. In this article, we investigate the practical benefits of our proposed theoretical optimal by conducting experiments in a real-world cloud setting using different streaming applications; we use Apache Flink, a well-known stream processing system for our experiments. The experiment results demonstrate that an optimal interval can achieve better utilization, confirming the practicality of the theoretical model when applied to real-world applications. We observed utilization improvements from 10% to 200% for a range of failure rates from 0.3 failures per hour to 0.075 failures per minute. Moreover, we explore how performance measures: latency and throughput are affected by the optimal interval. Our observations demonstrate that significant improvements can be achieved using the optimal interval for both latency and throughput. 相似文献

3.

FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing

Yang Xuejun Du Yunfei Wang Panfeng Fu Hongyi Jia Jia 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(10):1471-1486

As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant Parallel Algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed Get it Fault-Tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach. 相似文献

4.

Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems

《Journal of Parallel and Distributed Computing》2001,61(11):1570-1590

Performance prediction of checkpointing systems in the presence of failures is a well-studied research area. While the literature abounds with performance models of checkpointing systems, none addresses the issue of selecting runtime parameters other than the optimal checkpointing interval. In particular, the issue of processor allocation is typically ignored. In this paper, we present a performance model for long-running parallel computations that execute with checkpointing enabled. We then discuss how it is relevant to today's parallel computing environments and software, and present case studies of using the model to select runtime parameters. 相似文献

5.

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

D. Manivannan Q. Jiang Jianchang Yang M. Singhal 《Information Sciences》2008,178(15):3110-3117

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead. 相似文献

6.

支持分布式合作实时事务处理的协同检验点方法 总被引：1，自引：0，他引：1

李国徽王洪亚陈基雄刘云生《计算机学报》2004,27(9):1207-1212

在实时事务执行时,事务故障或数据竞争会导致事务重启,为减少事务重启损失的工作量,可以采用检验点技术保证事务的时间正确性．在一类分布式实时数据库应用中,不同结点的事务通过消息交换形成合作关系,为保证合作事务间的全局一致性,当某一事务记检验点时,相关事务也要记检验点．传统协同检验点方法没有考虑应用的定时约束,不能很好地支持分布式合作实时事务处理．该文提出了一种基于图论的协同检验点方法,利用在每个计算结点上为每个合作事务集维护的局部有向图,使用一个基于图论的计算过程标识出应记检验点的事务,该方法既具有最小协同检验点特性,又使全局检验点的时延最小．实验表明该算法减少了全局检验点时延,有利于实时事务截止期的满足．相似文献

7.

A fully informed model-based checkpointing protocol for preventing useless checkpoints

《International Journal of Parallel, Emergent and Distributed Systems》2013,28(6):485-518

Checkpointing and rollback recovery are widely used techniques for handling failures in distributed systems. When processes involved in a distributed computation are allowed to take checkpoints independently without any coordination with each other, some or all of the checkpoints taken may not be part of any consistent global checkpoint, and hence, are useless for recovery. Communication-induced checkpointing algorithms allow processes to take checkpoints independently and also ensure that each checkpoint taken is part of a consistent global checkpoint by forcing processes to take some additional checkpoints. It is well known that it is impossible to design an optimal communication-induced checkpointing algorithm (i.e. a checkpointing algorithm that takes minimum number of forced checkpoints). So, researchers have designed communication-induced checkpointing algorithms that reduce forced checkpoints using different heuristics. In this paper, we present a communication-induced checkpointing algorithm which takes less number of forced checkpoints when compared to some of the existing checkpointing algorithms in its class. 相似文献

8.

Diskless checkpointing 总被引：4，自引：0，他引：4

Plank J.S. Kai Li Puening M.A. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(10):972-986

Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures 相似文献

9.

一个适合大规模集群并行计算的检查点系统 总被引：4，自引：1，他引：4

周恩强卢宇彤沈志宇《计算机研究与发展》2005,42(6):987-992

分布式检查点系统是大规模并行计算系统容错的重要手段．协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈．针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性．初步测试结果表明,C系统的设计策略适合大规模并行计算的容错．相似文献

10.

一种高效的合作实时事务并行检验点算法

李国徽王洪亚刘云生《计算机科学》2005,32(7):69-71

许多数据和活动上都有很强时间性的应用在地理上同时具有分布性,这种应用需求使得分布式实时数据库的研完成为数据库研究领域的热点。在实时事务执行时,事务故障或数据竞争会导致事务重启,为了减少因重启而损失的工作量,可以采用检验点技术以利于事务时间正确性的满足。在一些分布式实时数据库应用中,不同结点的事务通过消息交换形成合作关系,当某一事务记检验点时,为保证合作事务间的全局一致性,相关事务也要相应地记检验点。传统的协同检验点方法没有考虑应用的定时约束,不能很好地支持分布式实时事务处理。本文提出了一种高效的并行协同检验点方法,该算法既具有最小协同检验点特性又使全局检验点过程延时最小。实验表明该算法减少了全局检验点阻塞时间,有利于分布式实时事务截止期的满足。相似文献

11.

一种基于检查点的并行程序调试器的设计与实现 总被引：4，自引：1，他引：4

刘建汪东升沈美明郑纬民《计算机研究与发展》2002,39(12):1580-1586

为支持大规模长时间运行并行程序的调试，有必要将检查点机制引入到并行程序调试器中，检查点设置与卷回应用中需要解决中途消息，孤儿消息和多米诺效应，活锁4个问题，并行程序调试中需要解决不确定性问题，提出的基于状态冻结的确定性检查点设置方法，可以避免检查点应用中孤儿消息和多米诺效应，活锁3个问题，通过消化记录的方法处理中途消息问题，采用记录／重放方法解决并行调试中的不确定性问题，基于状态冻结的确定性检查点设置方法，有效地解决了并行程序调试器和检查点结合时产生的诸多问题，该方法具有结构清晰，易于实现的优点，基于此技术，设计并实现了一个并行调试工具－DENNET。相似文献

12.

An analytical model for hybrid checkpointing in time warpdistributed simulation

Soliman H.M. Elmaghraby A.S. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(10):947-951

The Time Warp distributed simulation algorithm uses checkpointing to save process states after certain event executions for later recovery at the time of a rollback. Two main techniques have been used for checkpointing: periodic state saving and incremental state saving. The former technique introduces large overheads in reconstructing a desired state by coasting forward from an earlier checkpointed state if the computational granularity is large. The latter technique also has large overheads in applications with large rollback distances. A hybrid checkpointing technique is proposed which uses both periodic and incremental state saving simultaneously in such a way that it reduces checkpointing time overheads. A detailed analytical model is developed for the hybrid technique, and comparisons are made using similar analytical models with periodic and incremental state saving techniques. Results show that when the system parameters are chosen to represent large and complex simulated systems, the hybrid approach has less checkpointing time overhead than the other two techniques 相似文献

13.

TCASM: An asynchronous shared memory interface for high-performance application composition

《Parallel Computing》2017

This paper addresses the growing need for mechanisms supporting intra-node application composition in high-performance computing (HPC) systems. It provides a novel shared memory interface that allows composite applications, two or more coupled applications, to share internal data structures without blocking. This allows independent progress of the applications such that they can proceed in a parallel, overlapped fashion. Composite applications using in-node shared memory can reduce the amount of data to be communicated between nodes, allowing checkpointing and data reduction or analytics to be performed locally and in parallel. The approach is implemented in Linux, and evaluated using benchmarks that represent typical composite applications on a large HPC testbed. The results show that the proposed approach significantly outperforms the traditional ones (up to a 15-fold speed increase on a 200 node machine). 相似文献

14.

Memory exclusion: optimizing the performance of checkpointing systems

James S. Plank Yuqun Chen Kai Li Micah Beck Gerry Kingsley 《Software》1999,29(2):125-142

Checkpointing systems are a convenient way for users to make their programs fault‐tolerant by intermittently saving program state to disk and restoring that state following a failure. The main concern with checkpointing is the overhead that it adds to running time of the program. This paper describes memory exclusion, an important class of optimizations that reduce the overhead of checkpointing. Some forms of memory exclusion are well‐known in the checkpointing community. Others are relatively new. In this paper, we describe all of them within the same framework. We have implemented these optimization techniques in two checkpointers: libckpt , which works on Unix‐based workstations, and CLIP , which works on the Intel Paragon. Both checkpointers are publicly available at no cost. We have checkpointed various long‐running applications with both checkpointers and have explored the performance improvements that may be gained through memory exclusion. Results from these experiments are presented and show the improvements in time and space overhead. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

15.

一种降低并行程序检查点开销的方法

下载免费PDF全文

周小成孙凝晖霍志刚马捷《计算机工程》2007,33(12):84-86

检查点设置和卷回恢复是提高系统可靠性和实现容错计算的有效途径,其性能通常用开销率来评价,而检查点开销是影响开销率的主要因素。针对目前并行程序运行时存在较多通信阻塞时间的现状,该文在写时复制检查点缓存的基础上提出了一种进一步降低检查点开销的方法。通过控制状态保存线程的调度和选择合适的状态保存粒度,该方法能很好地利用通信阻塞时间隐藏状态保存线程运行时带来的开销,从而能进一步降低开销率。相似文献

16.

Analysis of a service facility with periodic checkpointing

François Baccelli 《Acta Informatica》1980,15(1):67-81

Summary This paper is an application of the theory of Markov renewal and semi regenerative processes into checkpointing problems. Its main practical contribution consists in the analytic expression of mean response time of systems under checkpointing and in the presence of intermittent failures (data bases, file systems ...). 相似文献

17.

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Ifeanyi P. Egwutuoha David Levy Bran Selic Shiping Chen 《The Journal of supercomputing》2013,65(3):1302-1326

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions. 相似文献

18.

一个基于通信系统支持的并行检查点系统

下载免费PDF全文

霍志刚马捷孙凝晖《计算机工程》2007,33(5):217-219

在大规模机群环境下，检查点和恢复机制是一种必不可少的容错技术。该文提出一种基于机群通信系统的可靠性机制，在不作全局同步的情况下获取通信系统全局状态的方法，并利用该方法实现了一个对应用程序透明的并行检查点系统。该系统通过底层通信系统的支持降低了并行检查点的实现复杂度和执行开销，适用于大规模机群应用。相似文献

19.

Necessary and sufficient conditions for transaction-consistent global checkpoints in a distributed database system

Jiang Wu Bhavani Thuraisingham 《Information Sciences》2009,179(20):3659-3672

Checkpointing and rollback recovery are well-known techniques for handling failures in distributed systems. The issues related to the design and implementation of efficient checkpointing and recovery techniques for distributed systems have been thoroughly understood. For example, the necessary and sufficient conditions for a set of checkpoints to be part of a consistent global checkpoint has been established for distributed computations. In this paper, we address the analogous question for distributed database systems. In distributed database systems, transaction-consistent global checkpoints are useful not only for recovery from failure but also for audit purposes. If each data item of a distributed database is checkpointed independently by a separate transaction, none of the checkpoints taken may be part of any transaction-consistent global checkpoint. However, allowing individual data items to be checkpointed independently results in non-intrusive checkpointing. In this paper, we establish the necessary and sufficient conditions for the checkpoints of a set of data items to be part of a transaction-consistent global checkpoint of the distributed database. Such conditions can also help in the design and implementation of non-intrusive checkpointing algorithms for distributed database systems. 相似文献

20.

Accelerating incremental checkpointing for extreme-scale computing

《Future Generation Computer Systems》2014

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems. 相似文献