期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

王一拙陈旭计卫星苏岩王小军石峰《软件学报》2016,27(7):1789-1804

任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献

2.

Modeling of hierarchical distributed systems with fault-tolerance

Shieh Y.-B. Ghosal D. Chintamaneni P.R. Tripathi S.K. 《IEEE transactions on pattern analysis and machine intelligence》1990,16(4):444-457

Since each of the levels in a hierarchical system could have various characteristics, different fault-tolerant schemes could be appropriate at different levels. A stochastic Petri net (SPN) is used to investigate various fault-tolerant schemes in this context. The basic SPN is augmented by parameterized subnet primitives to model the fault-tolerant schemes. Both centralized and distributed fault-tolerant schemes are considered. The two schemes are investigated by considering the individual levels in a hierarchical system independently. In the case of distributed fault tolerance, two different checkpointing strategies are considered. The first scheme is called the arbitrary checkpointing strategy. Each process in this scheme does its checkpointing independently; thus, the domino effect may occur. The second scheme is called the planned strategy. Here, process checkpointing is constrained to ensure no domino effect. The results show that, under certain conditions, an arbitrary checkpointing strategy can perform better than a planned strategy. The effect of integration on the fault-tolerant strategies of the various levels of a hierarchy are studied 相似文献

3.

改进的快速N＋1奇偶校验检查点

周军海张大方杨金民《计算机工程与科学》2005,27(4):11-13

本文运用缓冲区和增量有盘检查点相结合的技术提出了一个快速可靠的改进N+1奇偶校验检查点方案。在N个应用进程运行时，通过设置一个专用的检查点进程来实现N+1的奇偶校验，并且利用检查点机在检查点间隔的空闲时间将增量的奇偶校验检查点信息保存到稳定的存储器中。改进的算法利用了无盘检查点方案的快速及磁盘检查点的高可靠性，减少了一台备份处理机，并且可容忍一个应用进程及一个检查点进程的两个并发错误。相似文献

4.

Fault-Tolerant Matrix Operations for Networks of Workstations Using Diskless Checkpointing

James S. Plank Youngbae Kim Jack J. Dongarra 《Journal of Parallel and Distributed Computing》1997,43(2):427

Networks of workstations (NOWs) offer a cost-effective platform for high-performance, long-running parallel computations. However, these computations must be able to tolerate the changing and often faulty nature of NOW environments. We present high-performance implementations of several fault-tolerant algorithms for distributed scientific computing. The fault-tolerance is based on diskless checkpointing, a paradigm that uses processor redundancy rather than stable storage as the fault-tolerant medium. These algorithms are able to run on clusters of workstations that change over time due to failure, load, or availability. As long as there are at leastnprocessors in the cluster, and failures occur singly, the computation will complete in an efficient manner. We discuss the details of how the algorithms are tuned for fault-tolerance and present the performance results on a PVM network of Sun workstations connected by a fast, switched ethernet. 相似文献

5.

An efficient checkpointing method for multicomputers with wormhole routing

Kai Li Jeffrey F. Naughton James S. Plank 《International journal of parallel programming》1991,20(3):159-180

Efficient checkpointing and resumption of multicomputer applications is essential if multicomputers are to support time-sharing and the automatic resumption of jobs after a system failure. We present a checkpointing scheme that is transparent, imposes overhead only during checkpoints, requires minimal message logging, and allows for quick resumption of execution from a checkpointed image. Furthermore, the checkpointing algorithm allows each processorp to continue running the application being checkpointed except during the time thatp is actively taking a local snapshot, and requires no global stop or freeze of the multicomputer. Since checkpointing multicomputer applications poses requirements different from those posed by checkpointing general distributed systems, existing distributed checkpointing schemes are inadequate for multicomputer checkpointing. Our checkpointing scheme makes use of special properties of wormhole routing networks to satisfy this new set of requirements. 相似文献

6.

一种基于索引的准同步检查点协议 总被引：3，自引：0，他引：3

罗元盛闵应骅张大方《计算机学报》2005,28(10):1620-1625

在基于索引的分布式检查点算法中,尽量减少全局一致性检查点和强制检查点的数目对提高计算效率具有重要意义．该文在已有的基于索引的检查点算法的基础上,提出了一种新的检查点协议,既减少检查点的数目,又使各个进程的检查点之间实时同步,以免程序出错后回卷执行的开销太大,丢失过多有效计算．模拟实验表明,按该文所提协议,平均每条消息导致的强制检查点数比传统方法平均减少23．2％．相似文献

7.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献

8.

Checkpointing in Distributed Computing Systems

《Journal of Parallel and Distributed Computing》1996,35(1):67-75

This paper examines the performance of synchronous checkpointing in a distributed computing environment with and without load redistribution. Performance models are developed, and optimum checkpoint intervals are determined. The analysis extends earlier work by allowing for multiple nodes, state-dependent checkpoint intervals, and a performance metric which is coupled with failure-free performance and the speedup functions associated with implementation of parallel algorithms. The analytic results for synchronous checkpointing without load redistribution are compared to measurements of a synthetic parallel algorithm with user-level checkpointing. Expressions for the optimum checkpoint intervals for synchronous checkpointing with and without load redistribution are used to determine when load redistribution is advantageous. 相似文献

9.

An implementation of using remote memory to checkpoint processes

Shang‐Te Hsu Ruei‐Chuan Chang 《Software》1999,29(11):985-1004

Process checkpointing is a procedure which periodically saves the process states into stable storage. Most checkpointing facilities select hard disks for archiving. However, the disk seek time is limited by the speed of the read‐write heads, thus checkpointing process into a local disk requires extensive disk bandwidth. In this paper, we propose an approach that exploits the memory on idle workstations as a faster storage for checkpointing. In our scheme, autonomous machines which submit jobs to the computation server offer their physical memory to the server for job checkpointing. Eight applications are used to measure the remote memory performance in four checkpointing policies. Experimental results show that remote memory reduces at least 34.5 per cent of the overhead for sequential checkpointing and 32.1 per cent for incremental checkpointing. Additionally, to checkpoint a running process into a remote memory requires only 60 per cent of the local disk checkpoint latency time. Copyright © 1999 John Wiley & Sons, Ltd. 相似文献

10.

Design and analysis of an integrated checkpointing and recoveryscheme for distributed applications

Ramamurthy B. Upadhyaya S. Bhargava B. 《Knowledge and Data Engineering, IEEE Transactions on》2000,12(2):174-186

An integrated checkpointing and recovery scheme which exploits the low latency and high coverage characteristics of a concurrent error detection scheme is presented. Message dependency, which is the main source of multistep rollback in distributed systems, is minimized by using a new message validation technique derived from the notion of concurrent error detection. The concept of a new global state matrix is introduced to track error checking and message dependency in a distributed system and assist in the recovery. The analytical model, algorithms and data structures to support an easy implementation of the new scheme are presented. The completeness and correctness of the algorithms are proved. A number of scenarios and illustrations that give the details of the analytical model are presented. The benefits of the integrated checkpointing scheme are quantified by means of simulation using an object-oriented test framework 相似文献

11.

Efficient Garbage Collection Schemes for Causal Message Logging with Independent Checkpointing

Ahn Jinho Min Sung-Gi Hwang Chong-Sun Yu Heonchang 《The Journal of supercomputing》2002,22(2):175-196

This paper presents three garbage collection schemes for causal message logging with independent checkpointing. The first scheme allows each process to autonomously remove useless log information in its volatile storage by piggybacking only some additional information without requiring any extra message and forced checkpoint. Additionally, it supports faster output commit than traditional schemes. The second scheme enables each process to remove a part of log information in the storage if more empty space is required. It reduces the number of processes participating in the garbage collection by using the size of the log information of each process. The third scheme is a hybrid scheme having the advantages of the two proposed schemes. Simulation results show that the third scheme significantly reduces the garbage collection overhead compared with the traditional schemes regardless of specific communication patterns of distributed applications. 相似文献

12.

A New Approach for High Performance Computing Systems with Various Checkpointing Schemes

Gyung-Leen?Park Email author Hee?Youn?Yong 《The Journal of supercomputing》2005,33(1):65-78

Roll-forward recovery schemes were proposed to enhance the performance of fault tolerant systems employing checkpointing approach. In the roll-forward schemes, multiple processors are used for simultaneous roll-forward and validation processing. This paper proposes the sample comparison approach along with the checkpointing, which further improves the performance by reducing the overhead imposed by the checkpointing. We also develop general analytical models for estimating the availability, which are applicable for any checkpointing scheme. Performance comparisons reveal that the availabilities of the checkpointing schemes with sample comparison are higher than those of the schemes without it, while the required checkpoint interval is larger. 相似文献

13.

A Low-Cost Checkpointing Technique for Distributed Databases

Jun-Lin Lin Margaret H. Dunham 《Distributed and Parallel Databases》2001,10(3):241-268

For distributed databases, checkpointing is used to ensure an efficient way to perform global reconstruction. However, the need for global reconstruction is infrequent. Most current checkpointing approaches for distributed databases are too expensive during run time. Some of them allow the checkpointing process to run in parallel with normal transactions at the cost of more data and resource contention, which in turn causes longer response time for normal transactions. Thus, an efficient way to checkpoint distributed databases is needed to avoid degrading the system performance. This paper presents a low-cost solution, called Loosely Synchronized Local Fuzzy Checkpointing (LSLFC), to these problems. LSLFC supports global reconstruction, and our performance study shows that LSLFC has little overhead during run time. 相似文献

14.

Unix进程检查点设置关键技术 总被引：4，自引：0，他引：4

王春露汪东升《计算机工程与应用》2002,38(1):90-93,136

Unix进程的检查点设置是实现分布/并行系统容错、重播调试、进程迁移、系统模拟和作业切换等功能的基础。该论文主要论述UNIX进程检查点基本信息的保存与恢复、文件检查点、检查点信息的优化等关键技术,最后介绍Libckpt、Condor以及自行设计的Libcsm等检查点设置工具。相似文献

15.

面向更新密集型应用的内存数据库高效检查点技术

覃雄派肖艳芹曹巍王珊《计算机学报》2009,32(11)

面向更新密集型应用的内存数据库系统,其检查点技术应符合几个关键的要求,包括检查点操作对正常事务处理的干扰尽可能小、能够处理存取倾斜状况、支持数据库系统的快速恢复、提供恢复过程中的系统可用性等.该文提出一种事务一致的分区检查点技术,采用基于元组的动态多版本并发控制机制,避免了读写事务的加锁冲突,提高系统吞吐能力;检查点操作以只读事务形式实现,存多版本并发控制下,避免检查点操作对正常事务处理的堵塞;由于检查点文件是事务一致的,只需要记录事务的Redo 日志信息,在系统恢复过程中,只需要对日志文件进行一遍扫描处理,加快恢复过程;基于优先级的数据分区装载和恢复,使得恢复过程中新事务的数据存取请求迅速得到满足,保证了恢复过程中的系统可用性.由于采用两级版本管理机制以及动态版本共享技术,多版本管理的空间开销降低到可以接受的水平.实验结果表明,文中提出的检查点技术方案获得比模糊检查点技术高27%的系统吞吐量,同时版本管理的空间开销在可接受的范围之内,满足高性能应用的要求. 相似文献

16.

Accelerating incremental checkpointing for extreme-scale computing

《Future Generation Computer Systems》2014

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the past 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we describe libhashckpt, a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads and a model outlining the viability and application efficiency increase of this technique, we show that hash-based incremental checkpointing can have significantly lower overheads and increased efficiency than traditional coordinated checkpointing approaches at the scales expected for future extreme-class systems. 相似文献

17.

A quasi-synchronous checkpointing algorithm that prevents contention for stable storage

D. Manivannan Q. Jiang Jianchang Yang M. Singhal 《Information Sciences》2008,178(15):3110-3117

Checkpointing and rollback recovery are established techniques for handling failures in distributed systems. Under synchronous checkpointing, each process involved in the distributed computation takes checkpoint almost simultaneously. This causes contention for network stable storage and hence degrades performance as processes may have to wait for long time for the checkpointing operation to complete. In this paper, we propose a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead. 相似文献

18.

A multi-cycle checkpointing protocol that ensures strict 1-rollback

Yi-Wei Ci Zhan Zhang De-Cheng Zuo Zhi-Bo Wu Xiao-Zong Yang 《Information Processing Letters》2012,112(20):788-793

In this paper, a checkpointing protocol based on loose synchronization is proposed. The protocol enables processes to take checkpoints at different frequencies so that each process can control its rollback distance. In traditional asynchronous and quasi-synchronous checkpointing protocols, the checkpoints that are not up-to-date may be used for recovery. As a result, the rollback distance is often difficult to control. In the proposed protocol, the checkpoint cycle of each process is dynamically adjusted using a pessimistic scheme so that strict 1-rollback is achieved; namely, one of the last two checkpoints of each process can be utilized for recovery. 相似文献

19.

Independent checkpointing in a heterogeneous grid environment

Eugen FellerAuthor Vitae John Mehnert-SpahnAuthor Vitae Michael SchoettnerAuthor Vitae Christine MorinAuthor Vitae 《Future Generation Computer Systems》2012,28(1):163-170

The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability. 相似文献

20.

Design and analysis of a fault tolerant hybrid mobile scheme

Mostafa I.H. Abd-El-Barr Salman A. Khan 《Information Sciences》2007,177(12):2602-2620

Mobile computing systems provide users with access to information regardless of their geographical location. In these systems, Mobile Support Stations (MSSs) play the role of providing reliable and uninterrupted communication and computing facilities to mobile hosts. The failure of a MSS can cause interruption of services provided by the mobile system. Two basic schemes for tolerating the failure of MSSs exist in the literature. The first scheme is based on the principle of checkpointing used in distributed systems. The second scheme is based on state information replication of mobile hosts in a number of secondary support stations. Depending on the replication scheme used, the second approach is further classified as a pessimistic or an optimistic technique. In this paper, we propose a hybrid scheme which combines the pessimistic and the optimistic replication schemes. In the proposed scheme, an attempt is made to strike a balance between the long delay caused by the pessimistic and the high memory requirements of the optimistic schemes. In order to find the best ratio between the number of pessimistic to the number of optimistic secondary stations in the proposed scheme, we used fuzzy logic. We also used simulation to compare the performance of the proposed scheme with those of the optimistic and the pessimistic schemes. Simulation results showed that the proposed scheme performs better than either schemes in terms of delay and memory requirements. 相似文献