期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Performance evaluation of fault tolerance techniques in grid computing system

Fiaz Gul Khan Kalim Qureshi Babar Nazir 《Computers & Electrical Engineering》2010,36(6):1110-1122

As fault tolerance is the ability of a system to perform its function correctly even in the presence of faults. Therefore, different fault tolerance techniques (FTTs) are critical for improving the efficient utilization of expensive resources in high performance grid computing systems, and an important component of grid workflow management system.This paper presents a performance evaluation of most commonly used FTTs in grid computing system. In this study, we considered different system centric parameters, such as throughput, turnaround time, waiting time and network delay for the evaluation of these FTTs. For comprehensive evaluation we setup various conditions in which we vary the average percentage of faults in a system, along with different workloads in order to find out the behavior of FTTs under these conditions. The empirical evaluation shows that the workflow level alternative task techniques have performance priority on task level checkpointing techniques. This comparative study will help to grid computing researchers in order to understand the behavior and performance of different FTTs in detail. 相似文献

2.

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

Dawei Sun Guiran Chang Changsheng Miao Xingwei Wang 《The Journal of supercomputing》2013,66(1):193-228

Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments. 相似文献

3.

A reliable checkpoint storage strategy for grid

Sana Malik Babar Nazir Kalim Qureshi Imran Ali Khan 《Computing》2013,95(7):611-632

Computational grids are composed of heterogeneous autonomously managed resources. In such environment, any resource can join or leave the grid at any time. It makes the grid infrastructure unreliable in nature resulting in delay and failure of executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reliability, availability and quality-of-service. The most common technique, for achieving fault tolerance, used in High Performance Computing is rollback recovery. It relies on the availability of checkpoints and stability of storage media. Thus the checkpoints are replicated on storage media. It increases the job execution time, if replication is not done in proper manner. Furthermore, dedicating powerful resources solely as checkpoint storage results in loss of computation power of these resources. It may results in bottlenecks, when the load on the network is high. To address the problem, in this paper checkpoint replication based fault tolerance strategy named as Reliable Checkpoint Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicated on all checkpoint servers in the grid in distributed manner. It decreases the checkpoint replication time and in turn improves the overall job execution time. Additionally, if a resource fails during execution of a job, the RCSS restarts the job from its last valid checkpoint taken from any checkpoint server in the grid. Furthermore to increase the grid performance, CPU cycles of checkpoint servers are also utilized during high load on network. To evaluate the performance of RCSS simulations are carried out using GridSim. The simulation results show that RCSS outperforms in intra-cluster Checkpoint wave completion time by 12.5 % with varying number of checkpoint servers. RCSS also reduces checkpoint wave completion time by 50 % with varying number of clusters. Additionally RCSS reduces replication time within cluster by 39.5 %. 相似文献

4.

Adaptive checkpointing strategy to tolerate faults in economy based grid 总被引：3，自引：2，他引：1

Babar Nazir Kalim Qureshi Paul Manuel 《The Journal of supercomputing》2009,50(1):1-18

In this paper, we develop a fault tolerant job scheduling strategy in order to tolerate faults gracefully in an economy based grid environment. We propose a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. The proposed strategy maintains a fault index of grid resources. It dynamically updates the fault index based on successful or unsuccessful completion of an assigned task. Whenever a grid resource broker has tasks to schedule on grid resources, it makes use of the fault index from the fault tolerant schedule manager in addition to using a time optimization heuristic. While scheduling a grid job on a grid resource, the resource broker uses fault index to apply different intensity of task checkpointing (inserting checkpoints in a task at different intervals). To simulate and evaluate the performance of the proposed strategy, this paper enhances the GridSim Toolkit-4.0 to exhibit fault tolerance related behavior. We also compare “checkpointing fault tolerant job scheduling strategy” with the well-known time optimization heuristic in an economy based grid environment. From the measured results, we conclude that even in the presence of faults, the proposed strategy effectively schedules grid jobs tolerating faults gracefully and executes more jobs successfully within the specified deadline and allotted budget. It also improves the overall execution time and minimizes the execution cost of grid jobs. 相似文献

5.

Independent checkpointing in a heterogeneous grid environment

Eugen FellerAuthor Vitae John Mehnert-SpahnAuthor Vitae Michael SchoettnerAuthor Vitae Christine MorinAuthor Vitae 《Future Generation Computer Systems》2012,28(1):163-170

The EU-funded XtreemOS project implements an open-source grid operating system based on Linux. In order to provide fault tolerance and migration for grid applications, it integrates a distributed grid-checkpointing service called XtreemGCP. This service is designed to support various checkpointing protocols and different checkpointer packages (e.g. BLCR, LinuxSSI, OpenVZ, etc.) in a transparent manner through a uniform checkpointer interface. In this paper, we present the integration of a backward error recovery protocol based on independent checkpointing into the XtreemGCP service. The solution we propose is not checkpointer bound and thus can be transparently used on top of any checkpointer package.To evaluate the prototype we run it within a heterogeneous environment composed of single-PC nodes and a Single System Image (SSI) cluster. The experimental results demonstrate the capability of the XtreemGCP service to integrate different checkpointing protocols and independently checkpoint a distributed application within a heterogeneous grid environment. Moreover, the performance evaluation also shows that our solution outperforms the existing coordinated checkpointing protocol in terms of scalability. 相似文献

6.

一个适合大规模集群并行计算的检查点系统 总被引：5，自引：1，他引：4

周恩强卢宇彤沈志宇《计算机研究与发展》2005,42(6):987-992

分布式检查点系统是大规模并行计算系统容错的重要手段．协议开销和检查点映像存储成为困扰并行检查点系统可伸缩性的两大瓶颈．针对并行应用程序的执行特征和高性能集群的体系结构特点,C系统分别采用动态虚连接技术和分布存储检查点映像的方法来有效降低协同式检查点的开销,增强检查点系统的可伸缩性．初步测试结果表明,C系统的设计策略适合大规模并行计算的容错．相似文献

7.

Unix进程检查点设置关键技术 总被引：4，自引：0，他引：4

王春露汪东升《计算机工程与应用》2002,38(1):90-93,136

Unix进程的检查点设置是实现分布/并行系统容错、重播调试、进程迁移、系统模拟和作业切换等功能的基础。该论文主要论述UNIX进程检查点基本信息的保存与恢复、文件检查点、检查点信息的优化等关键技术,最后介绍Libckpt、Condor以及自行设计的Libcsm等检查点设置工具。相似文献

8.

Survey of fault tolerant techniques for grid

S. Siva Sathya K. Syam Babu 《Computer Science Review》2010,4(2):101-120

Besides the dynamic nature of grids, which means that resources may enter and leave the grid at any time, in many cases outside of the applications’ control, grid resources are also heterogeneous in nature. Many grid applications will be running in environments where interaction faults are more likely to occur between disparate grid nodes. As resources may also be used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious. Due to the diverse faults and failure conditions, developing, deploying, and executing long running applications over the grid remains a challenge. So fault tolerance is an essential factor for grid computing. This paper presents an extensive survey of different fault tolerant techniques such as replication strategies, check-pointing mechanisms, scheduling policies, failure detection mechanisms and finally malleability and migration support for divide-and-conquer applications. These techniques are used according to the needs of the computational grid and the type of environment, resources, virtual organizations and job profile it is supposed to work with. Each has its own merits and demerits which forms the subject matter of this survey. 相似文献

9.

A resource management and fault tolerance services in grid computing

《Journal of Parallel and Distributed Computing》2005,65(11):1305-1317

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur. 相似文献

10.

Supporting Cost-Effective Fault Tolerance in Distributed Message-Passing Applications with File Operations 总被引：1，自引：0，他引：1

Ouyang Jinsong Maheshwari Piyush 《The Journal of supercomputing》1999,14(3):207-232

In this paper we present an approach to reliable distributed computing, which incorporates fault tolerance into applications at low cost, in terms of both run-time performance and programming effort required to construct reliable application software. In our model fault tolerance is based on distributed consistent checkpointing and rollback-recovery integrated with a user-level reliable transmission protocol. By employing novel techniques 8and algorithms, our approach is distinguished from other consistent checkpointing schemes by the following features: first, minimum communication overhead for constructing a consistent distributed checkpoint and catching messages in transit during checkpointing; second, tolerance to message losses due to site failures or unreliable non-FIFO networks; and third, efficient checkpointing and recovery of persistent state, i.e., user files. Based on the model, a software library prototype called Libra has been implemented for supporting fault tolerance in distributed message-passing applications with file operations. The library provides an easy to use programming interface including message-passing and file I/O primitives, which hides the complexity of both fault-tolerant network communications and checkpointing and recovering user files from the application level. Experience with a number of long-running distributed applications shows that Libra can provide fault tolerance in a cost-effective manner. 相似文献

11.

Performance and effectiveness trade‐off for checkpointing in fault‐tolerant distributed systems

Panagiotis Katsaros Lefteris Angelis Constantine Lazos 《Concurrency and Computation》2007,19(1):37-63

Checkpointing has a crucial impact on systems' performance and fault‐tolerance effectiveness: excessive checkpointing results in performance degradation, while deficient checkpointing incurs expensive recovery. In distributed systems with independent checkpoint activities there is no easy way to determine checkpoint frequencies optimizing response‐time and fault‐tolerance costs at the same time. The purpose of this paper is to investigate the potentialities of a statistical decision‐making procedure. We adopt a simulation‐based approach for obtaining performance metrics that are afterwards used for determining a trade‐off between checkpoint interval reductions and efficiency in performance. Statistical methodology including experimental design, regression analysis and optimization provides us with the framework for comparing configurations, which use possibly different fault‐tolerance mechanisms (replication‐based or message‐logging‐based). Systematic research also allows us to take into account additional design factors, such as load balancing. The method is described in terms of a standardized object replication model (OMG FT‐CORBA), but it could also be applied in other (e.g. process‐based) computational models. Copyright © 2006 John Wiley & Sons, Ltd. 相似文献

12.

An integrated security-aware job scheduling strategy for large-scale computational grids

Chao-Chin Ren-Yi 《Future Generation Computer Systems》2010,26(2):198-206

All existing fault-tolerance job scheduling algorithms for computational grids were proposed under the assumption that all sites apply the same fault-tolerance strategy. They all ignored that each grid site may have its own fault-tolerance strategy because each site is itself an autonomous domain. In fact, it is very common that there are multiple fault-tolerance strategies adopted at the same time in a large-scale computational grid. Various fault-tolerance strategies may have different hardware and software requirements. For instance, if a grid site employs the job checkpointing mechanism, each computation node must have the following ability. Periodically, the computational node transmits the transient state of the job execution to the server. If a job fails, it will migrate to another computational node and resume from the last stored checkpoint. Therefore, in this paper we propose a genetic algorithm for job scheduling to address the heterogeneity of fault-tolerance mechanisms problem in a computational grid. We assume that the system supports four kinds fault-tolerance mechanisms, including the job retry, the job migration without checkpointing, the job migration with checkpointing, and the job replication mechanisms. Because each fault-tolerance mechanism has different requirements for gene encoding, we also propose a new chromosome encoding approach to integrate the four kinds of mechanisms in a chromosome. The risk nature of the grid environment is also taken into account in the algorithm. The risk relationship between jobs and nodes are defined by the security demand and the trust level. Simulation results show that our algorithm has shorter makespan and more excellent efficiencies on improving the job failure rate than the Min–Min and sufferage algorithms. 相似文献

13.

Computing in the RAIN: a reliable array of independent nodes

Bohossian V. Fan C.C. LeMahieu P.S. Riedel M.D. Lihao Xu Bruck J. 《Parallel and Distributed Systems, IEEE Transactions on》2001,12(2):99-114

The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology 相似文献

14.

容错计算网格作业调度模型的研究 总被引：14，自引：1，他引：14

金海陈刚赵美平《计算机研究与发展》2004,41(8):1382-1388

网格技术的发展对网格系统的效率和服务质量提出了更高要求．在综合研究目前网格作业调度环境的基础上，提出一种容错计算网格作业调度的随机Petri网模型，并给出了网格作业分派策略和计算站点内的作业选择策略，以及容错计算网格的性能评价指标．仿真实验对容错计算网格的性能进行有效的分析，反映故障对网格中不同类别作业的影响．相似文献

15.

Software approaches for resilience of high performance computing systems: a survey

Jie JIA Yi LIU Guozhen ZHANG Yulin GAO Depei QIAN 《Frontiers of Computer Science》2023,17(4):174105

With the scaling up of high-performance computing systems in recent years, their reliability has been descending continuously. Therefore, system resilience has been regarded as one of the critical challenges for large-scale HPC systems. Various techniques and systems have been proposed to ensure the correct execution and completion of parallel programs. This paper provides a comprehensive survey of existing software resilience approaches. Firstly, a classification of software resilience approaches is presented; then we introduce major approaches and techniques, including checkpointing, replication, soft error resilience, algorithm-based fault tolerance, fault detection and prediction. In addition, challenges exposed by system-scale and heterogeneous architecture are also discussed. 相似文献

16.

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Ifeanyi P. Egwutuoha David Levy Bran Selic Shiping Chen 《The Journal of supercomputing》2013,65(3):1302-1326

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions. 相似文献

17.

Transparent three-phase Byzantine fault tolerance for parallel and distributed simulations

《Simulation Modelling Practice and Theory》2016

A parallel and distributed simulation (federation) is composed of a number of simulation components (federates). Since the federates may be developed by different participants and executed on different platforms, they are subject to Byzantine failures. Moreover, the failure may propagate in the federation, resulting in epidemic effect. In this article, a three-phase (i.e., detection, location, and recovery) Byzantine Fault Tolerance (BFT) mechanism is proposed based on a transparent middleware approach. The replication, checkpointing and message logging techniques are integrated in the mechanism for the purpose of enhancing simulation performance and reducing fault tolerance cost. In addition, mechanisms are provided to remove the epidemic effects of Byzantine failures. Our experiments have verified the correctness of the three-phase BFT mechanism and illustrated its high efficiency and good scalability. For some simulation executions, the BFT mechanism may even achieve performance enhancement and Byzantine fault tolerance simultaneously. 相似文献

18.

A survey of recoverable distributed shared virtual memory systems

Morin C. Puaut I. 《Parallel and Distributed Systems, IEEE Transactions on》1997,8(9):959-969

Distributed Shared Virtual Memory (DSVM) systems provide a shared memory abstraction on distributed memory architectures. Such systems ease parallel application programming because the shared-memory programming model is often more natural than the message-passing paradigm. However, the probability of failure of a DSVM increases with the number of sites. Thus, fault tolerance mechanisms must be implemented in order to allow processes to continue their execution in the event of a failure. This paper gives an overview of recoverable DSVMs (RDSVMs) that provide a checkpointing mechanism to restart parallel computations in the event of a site failure 相似文献

19.

FTRP: a new fault tolerance framework using process replication and prefetching for high-performance computing

Wei Hu Guang-Ming Liu Yan-Huang Jiang 《浙江大学学报:C卷英文版》2018,19(10):1273-1290

As the scale of supercomputers rapidly grows, the reliability problem dominates the system availability. Existing fault tolerance mechanisms, such as periodic checkpointing and process redundancy, cannot effectively fix this problem. To address this issue, we present a new fault tolerance framework using process replication and prefetching (FTRP), combining the benefits of proactive and reactive mechanisms. FTRP incorporates a novel cost model and a new proactive fault tolerance mechanism to improve the application execution efficiency. The novel cost model, called the ‘work-most’ (WM) model, makes runtime decisions to adaptively choose an action from a set of fault tolerance mechanisms based on failure prediction results and application status. Similar to program locality, we observe the failure locality phenomenon in supercomputers for the first time. In the new proactive fault tolerance mechanism, process replication with process prefetching is proposed based on the failure locality, significantly avoiding losses caused by the failures regardless of whether they have been predicted. Simulations with real failure traces demonstrate that the FTRP framework outperforms existing fault tolerance mechanisms with up to 10% improvement in application efficiency for common failure prediction accuracy, and is effective for petascale systems and beyond. 相似文献

20.

一种支持容错的任务并行程序设计模型

王一拙陈旭计卫星苏岩王小军石峰《软件学报》2016,27(7):1789-1804

任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持. 相似文献