期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Vaidya N.H. 《Parallel and Distributed Systems, IEEE Transactions on》1999,10(7):694-702

A consistent checkpointing algorithm saves a consistent view of a distributed application's state on stable storage. The traditional consistent checkpointing algorithms require different processes to save their state at about the same time. This causes contention for the stable storage, potentially resulting in large overheads. Staggering the checkpoints taken by various processes can reduce checkpoint overhead. This paper presents a simple approach to arbitrarily stagger the checkpoints. Our approach requires that the processes take consistent logical checkpoints, as compared to consistent physical checkpoints enforced by existing algorithms. Experimental results on nCube-2 are presented 相似文献

2.

Diskless checkpointing 总被引：4，自引：0，他引：4

Plank J.S. Kai Li Puening M.A. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(10):972-986

Diskless Checkpointing is a technique for checkpointing the state of a long-running computation on a distributed system without relying on stable storage. As such, it eliminates the performance bottleneck of traditional checkpointing on distributed systems. In this paper, we motivate diskless checkpointing and present the basic diskless checkpointing scheme along with several variants for improved performance. The performance of the basic scheme and its variants is evaluated on a high-performance network of workstations and compared to traditional disk-based checkpointing. We conclude that diskless checkpointing is a desirable alternative to disk-based checkpointing that can improve the performance of distributed applications in the face of failures 相似文献

3.

Mutable checkpoints: a new checkpointing approach for mobilecomputing systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(2):157-172

Mobile computing raises many new issues such as lack of stable storage, low bandwidth of wireless channel, high mobility, and limited battery life. These new issues make traditional checkpointing algorithms unsuitable. Coordinated checkpointing is an attractive approach for transparently adding fault tolerance to distributed applications since it avoids domino effects and minimizes the stable storage requirement. However, it suffers from high overhead associated with the checkpointing process in mobile computing systems. Two approaches have been used to reduce the overhead: First is to minimize the number of synchronization messages and the number of checkpoints; the other is to make the checkpointing process nonblocking. These two approaches were orthogonal previously until the Prakash-Singhal algorithm combined them. However, we found that this algorithm may result in an inconsistency in some situations and we proved that there does not exist a nonblocking algorithm which forces only a minimum number of processes to take their checkpoints. In this paper; we introduce the concept of “mutable checkpoint,” which is neither a tentative checkpoint nor a permanent checkpoint, to design efficient checkpointing algorithms for mobile computing systems. Mutable checkpoints can be saved anywhere, e.g., the main memory or local disk of MHs. In this way, taking a mutable checkpoint avoids the overhead of transferring large amounts of data to the stable storage at MSSs over the wireless network. We present techniques to minimize the number of mutable checkpoints. Simulation results show that the overhead of taking mutable checkpoints is negligible. Based on mutable checkpoints, our nonblocking algorithm avoids the avalanche effect and forces only a minimum number of processes to take their checkpoints on the stable storage 相似文献

4.

A Low-Cost Checkpointing Technique for Distributed Databases

Jun-Lin Lin Margaret H. Dunham 《Distributed and Parallel Databases》2001,10(3):241-268

For distributed databases, checkpointing is used to ensure an efficient way to perform global reconstruction. However, the need for global reconstruction is infrequent. Most current checkpointing approaches for distributed databases are too expensive during run time. Some of them allow the checkpointing process to run in parallel with normal transactions at the cost of more data and resource contention, which in turn causes longer response time for normal transactions. Thus, an efficient way to checkpoint distributed databases is needed to avoid degrading the system performance. This paper presents a low-cost solution, called Loosely Synchronized Local Fuzzy Checkpointing (LSLFC), to these problems. LSLFC supports global reconstruction, and our performance study shows that LSLFC has little overhead during run time. 相似文献

5.

Replication-Based Fault Tolerance for MPI Applications

Walters John Paul Chaudhary Vipin 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(7):997-1010

As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable, particularly after 64 CPUs. We demonstrate the low overhead of our checkpointing and replication scheme with the NAS Parallel Benchmarks and the High-Performance LINPACK benchmark with tests up to 256 nodes while demonstrating that checkpointing and replication can be achieved with a much lower overhead than that provided by current techniques. Finally, we show that the monetary cost of our solution is as low as 25 percent of that of a typical SAN/parallel-file-system-equipped storage system. 相似文献

6.

支持分布式合作实时事务处理的协同检验点方法 总被引：1，自引：0，他引：1

李国徽王洪亚陈基雄刘云生《计算机学报》2004,27(9):1207-1212

在实时事务执行时，事务故障或数据竞争会导致事务重启，为减少事务重启损失的工作量，可以采用检验点技术保证事务的时间正确性．在一类分布式实时数据库应用中，不同结点的事务通过消息交换形成合作关系，为保证合作事务间的全局一致性，当某一事务记检验点时，相关事务也要记检验点．传统协同检验点方法没有考虑应用的定时约束，不能很好地支持分布式合作实时事务处理．该文提出了一种基于图论的协同检验点方法，利用在每个计算结点上为每个合作事务集维护的局部有向图，使用一个基于图论的计算过程标识出应记检验点的事务，该方法既具有最小协同检验点特性，又使全局检验点的时延最小．实验表明该算法减少了全局检验点时延，有利于实时事务截止期的满足．相似文献

7.

On coordinated checkpointing in distributed systems

Guohong Cao Singhal M. 《Parallel and Distributed Systems, IEEE Transactions on》1998,9(12):1213-1225

Coordinated checkpointing simplifies failure recovery and eliminates domino effects in case of failures by preserving a consistent global checkpoint on stable storage. However, the approach suffers from high overhead associated with the checkpointing process. Two approaches are used to reduce the overhead: first is to minimize the number of synchronization messages and the number of checkpoints, the other is to make the checkpointing process nonblocking. These two approaches were orthogonal in previous years until the Prakash-Singhal algorithm combined them. In other words, the Prakash-Singhal algorithm forces only a minimum number of processes to take checkpoints and it does not block the underlying computation. However, we found two problems in this algorithm. In this paper, we identify these problems and prove a more general result: there does not exist a nonblocking algorithm that forces only a minimum number of processes to take their checkpoints. Based on this general result, we propose an efficient algorithm that neither forces all processes to take checkpoints nor blocks the underlying computation during checkpointing. Also, we point out future research directions in designing coordinated checkpointing algorithms for distributed computing systems 相似文献

8.

An optimistic checkpointing and message logging approach for consistent global checkpoint collection in distributed systems

Qiangfeng Yi D. 《Journal of Parallel and Distributed Computing》2008,68(12):1575-1589

Checkpointing and rollback recovery are widely used techniques for achieving fault-tolerance in distributed systems. In this paper, we present a novel checkpointing algorithm which has the following desirable features: A process can independently initiate consistent global checkpointing by saving its current state, called a tentative checkpoint. Other processes come to know about a consistent global checkpoint initiation through information piggy-backed with the application messages or limited control messages if necessary. When a process comes to know about a new consistent global checkpoint initiation, it takes a tentative checkpoint after processing the message (not before processing the message as in existing communication-induced checkpointing algorithms). After a process takes a tentative checkpoint, it starts logging the messages sent and received in memory. When a process comes to know that every other process has taken a tentative checkpoint corresponding to current consistent global checkpoint initiation, it flushes the tentative checkpoint and the message log to the stable storage. The tentative checkpoints together with the message logs stored in the stable storage form a consistent global checkpoint. Two or more processes can concurrently initiate consistent global checkpointing by taking a new tentative checkpoint; in that case, the tentative checkpoints taken by all these processes will be part of the same consistent global checkpoint. The sequence numbers assigned to checkpoints by a process increase monotonically. Checkpoints with the same sequence number form a consistent global checkpoint. We also present the performance evaluation of our algorithm. 相似文献

9.

An implementation of using remote memory to checkpoint processes

Shang‐Te Hsu Ruei‐Chuan Chang 《Software》1999,29(11):985-1004

Process checkpointing is a procedure which periodically saves the process states into stable storage. Most checkpointing facilities select hard disks for archiving. However, the disk seek time is limited by the speed of the read‐write heads, thus checkpointing process into a local disk requires extensive disk bandwidth. In this paper, we propose an approach that exploits the memory on idle workstations as a faster storage for checkpointing. In our scheme, autonomous machines which submit jobs to the computation server offer their physical memory to the server for job checkpointing. Eight applications are used to measure the remote memory performance in four checkpointing policies. Experimental results show that remote memory reduces at least 34.5 per cent of the overhead for sequential checkpointing and 32.1 per cent for incremental checkpointing. Additionally, to checkpoint a running process into a remote memory requires only 60 per cent of the local disk checkpoint latency time. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

10.

改进的快速N＋1奇偶校验检查点

周军海张大方杨金民《计算机工程与科学》2005,27(4):11-13

本文运用缓冲区和增量有盘检查点相结合的技术提出了一个快速可靠的改进N+1奇偶校验检查点方案。在N个应用进程运行时，通过设置一个专用的检查点进程来实现N+1的奇偶校验，并且利用检查点机在检查点间隔的空闲时间将增量的奇偶校验检查点信息保存到稳定的存储器中。改进的算法利用了无盘检查点方案的快速及磁盘检查点的高可靠性，减少了一台备份处理机，并且可容忍一个应用进程及一个检查点进程的两个并发错误。相似文献

11.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献

12.

Optimization of checkpointing-related I/O for high-performance parallel and distributed computing

Rajagopal Subramaniyan Eric Grobelny Scott Studham Alan D. George 《The Journal of supercomputing》2008,46(2):150-180

Checkpointing, the process of saving program/application state, usually to a stable storage, has been the most common fault-tolerance methodology for high-performance applications. The rate of checkpointing (how often) is primarily driven by the failure rate of the system. If the checkpointing rate is low, fewer resources are consumed but the chance of high computational loss is increased and vice versa if the checkpointing rate is high. It is important to strike a balance, and an optimum rate of checkpointing is required. In this paper, we analytically model the process of checkpointing in terms of mean-time-between-failure of the system, amount of memory being checkpointed, sustainable I/O bandwidth to the stable storage, and frequency of checkpointing. We identify the optimum frequency of checkpointing to be used on systems with given specifications thereby making way for efficient use of available resources and maximum performance of the system without compromising on the fault-tolerance aspects. Further, we develop discrete-event models simulating the checkpointing process to verify the analytical model for optimum checkpointing. Using the analytical model, we also investigate the optimum rate of checkpointing for systems of varying resource levels ranging from small embedded cluster systems to large supercomputers.

Alan D. GeorgeEmail:

相似文献

13.

Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing

James S. Plank Youngbae Kim Jack J. Dongarra 《Journal of Parallel and Distributed Computing》1997,43(2):427

Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at leastnprocessors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet. 相似文献

14.

一种优化的分布式系统的失效恢复策略 总被引：1，自引：0，他引：1

刘云龙陈俊亮《计算机学报》1999,22(3):249-257

本文对确定性进程组的分布式系统的失效恢复策略做了深入的研究,独到地提出了应用数据流分析来静态地计算进程的最小备查点数据集的方法。相似文献

15.

Checkpointing for optimistic concurrency control methods

Thomasian A. 《Knowledge and Data Engineering, IEEE Transactions on》1995,7(2):332-339

Restart-oriented concurrency control (CC) methods, such as optimistic CC, outperform blocking-oriented methods, such as standard two-phase locking in a high data contention environment, but this is at the cost of wasted processing due to restarts. Volatile savepoints are considered in this study as a method to reduce this wasted processing and to improve response time. There is the usual tradeoff between the checkpointing overhead and the wasted processing when a transaction is restarted. Our study shows that in a system where objects are accessed and updated uniformly during the lifetime of transactions, significant improvement in performance at high data conflict levels are attainable only when the checkpointing cost is low. This implies few optimally placed checkpoints per transaction. It is observed that checkpointing may result in a significant improvement in performance when access to database hot-spots are deferred to the final steps of transaction execution. The parametric studies reported in this paper are facilitated by closed-form analytic solutions expressing system performance, as well as an iterative solution which takes into account hardware resource contention in addition to data contention 相似文献

16.

Asynchronous recovery without using vector timestamps

《Journal of Parallel and Distributed Computing》2002,62(12):1695-1728

A checkpoint of a process involved in a distributed computation is said to be useful if it is part of a consistent global checkpoint. In this paper, we present a quasi-synchronous checkpointing algorithm that makes every checkpoint useful. We also present an efficient asynchronous recovery algorithm based on the checkpointing algorithm. The checkpointing algorithm allows the processes to take checkpoints asynchronously and also forces the processes to take additional checkpoints in order to make every checkpoint useful. The recovery algorithm can handle concurrent failure of multiple processes. The recovery algorithm has no domino effect and a failed process needs only to roll back to its latest checkpoint and request the other processes to roll back to a consistent checkpoint. Messages are only selectively logged to cope with various types of message abnormalities that arise due to rollback and hence results in low message logging overhead. Unlike some existing algorithms, our algorithm does not use vector timestamps for tracking dependency between checkpoints and hence results in low message overhead during failure-free operation. Moreover, a process can asynchronously decide garbage checkpoints and delete them from the stable storage—garbage checkpoints are the checkpoints that are no longer required for the purpose of recovery. 相似文献

17.

一个适合大规模集群并行计算的检查点系统 总被引：5，自引：1，他引：4

周恩强卢宇彤沈志宇《计算机研究与发展》2005,42(6):987-992

分布式检查点系统是大规模并行计算系统容错的重要手段．协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈．针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性．初步测试结果表明,C系统的设计策略适合大规模并行计算的容错．相似文献

18.

基于共享内存的机群服务检查点机制研究 总被引：1，自引：0，他引：1

梁毅王磊樊建平方娟《计算机研究与发展》2010,47(4)

针对既有基于稳定存储的机群服务检查点存在的系统成本高、恢复时间长的问题,提出了一种基于共享内存的机群服务检查点机制;设计了一套面向基于共享内存的检查点信息主-备存储模式的检查点信息管理协议,确保机群服务检查点信息一致性;设计了一套基于单向逻辑环的检查点组管理协议,确保检查点逻辑备份环中检查点进程的成员视图一致性.性能实验结果表明,该检查点机制具有较好的检查点信息读写性能,组管理协议系统开销小,较好地满足了机群服务检查点需求. 相似文献

19.

Communication-based prevention of useless checkpoints in distributed computations

J.-M. Hélary A. Mostefaoui R.H.B. Netzer M. Raynal 《Distributed Computing》2000,13(1):29-43

Summary. A useless checkpoint is a local checkpoint that cannot be part of a consistent global checkpoint. This paper addresses the following problem. Given a set of processes that take (basic) local checkpoints in an independent and unknown way, the problem is to design communication-induced checkpointing protocols that direct processes to take additional local (forced) checkpoints to ensure no local checkpoint is useless. The paper first proves two properties related to integer timestamps which are associated with each local checkpoint. The first property is a necessary and sufficient condition that these timestamps must satisfy for no checkpoint to be useless. The second property provides an easy timestamp-based determination of consistent global checkpoints. Then, a general communication-induced checkpointing protocol is proposed. This protocol, derived from the two previous properties, actually defines a family of timestamp-based communication-induced checkpointing protocols. It is shown that several existing checkpointing protocols for the same problem are particular instances of the general protocol. The design of this general protocol is motivated by the use of communication-induced checkpointing protocols in “consistent global checkpoint”-based distributed applications such as the detection of stable or unstable properties and the determination of distributed breakpoints. Received: July 1997 / Accepted: August 1999 相似文献

20.

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids 总被引：1，自引：0，他引：1

Chtepen M. Claeys F.H.A. Dhoedt B. De Turck F. Demeester P. Vanrolleghem P.A. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(2):180-190

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency. 相似文献