期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Adaptive checkpointing strategy to tolerate faults in economy based grid 总被引：3，自引：2，他引：1

Babar Nazir Kalim Qureshi Paul Manuel 《The Journal of supercomputing》2009,50(1):1-18

In this paper, we develop a fault tolerant job scheduling strategy in order to tolerate faults gracefully in an economy based grid environment. We propose a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. The proposed strategy maintains a fault index of grid resources. It dynamically updates the fault index based on successful or unsuccessful completion of an assigned task. Whenever a grid resource broker has tasks to schedule on grid resources, it makes use of the fault index from the fault tolerant schedule manager in addition to using a time optimization heuristic. While scheduling a grid job on a grid resource, the resource broker uses fault index to apply different intensity of task checkpointing (inserting checkpoints in a task at different intervals). To simulate and evaluate the performance of the proposed strategy, this paper enhances the GridSim Toolkit-4.0 to exhibit fault tolerance related behavior. We also compare “checkpointing fault tolerant job scheduling strategy” with the well-known time optimization heuristic in an economy based grid environment. From the measured results, we conclude that even in the presence of faults, the proposed strategy effectively schedules grid jobs tolerating faults gracefully and executes more jobs successfully within the specified deadline and allotted budget. It also improves the overall execution time and minimizes the execution cost of grid jobs. 相似文献

2.

检查点系统中进程地址空间的优化存储策略

李艳红孟丹周应超武林平《计算机工程与应用》2005,41(29):94-96,113

机群系统的规模增大,部件增多,导致了机群的组合错误率也不断上升。节点失效使运行于机群节点上的作业面临中途失败,从而造成巨大的资源浪费,甚至导致大量的作业无法完成。检查点系统为节点提供了较好的容错性能,因此成为机群操作系统软件的重要组成部分。进程的地址空间是检查点系统需要记录的一部分重要内容,对它的存储效率直接影响检查点操作的性能。论文提出了两种检查点系统中进程地址空间的优化存储策略。其中组合式检查点文件写策略解决了并发写机制在应用内存接近物理内存时的性能突降问题,A-O(Access-Order)进程地址空间存储策略调整传统地址空间的存储顺序,使大内存应用的检查点操作性能得到了大幅度提升。在实验中,A-O进程地址空间存储策略最高可以将传统的存储策略的时间开销缩减至原来的50%。相似文献

3.

Replication based fault tolerant job scheduling strategy for economy driven grid

Babar Nazir Kalim Qureshi Paul Manuel 《The Journal of supercomputing》2012,62(2):855-873

In this paper, the problem of fault tolerance in grid computing is addressed and a novel adaptive task replication based fault tolerant job scheduling strategy for economy driven grid is proposed. The proposed strategy maintains fault history of the resources termed as resource fault index. Fault index entry for the resource is updated based on successful completion or failure of an assigned task by the grid resource. Grid Resource Broker then replicates the task (submitting the same task to different backup resources) with different intensity, based on vulnerability of resource towards faults suggested by resource fault index. Consequently, in case of possible fault at a resource the results of replicated task(s) on other backup resource(s) can be used. Hence, user job(s) can be completed within specified deadline and assigned budget, even on the event of faults at the grid resource(s). Through extensive simulations, performance of the proposed strategy is evaluated and compared with the Time Optimization and Checkpointing based Strategy in an economy driven grid environment. The experimental results demonstrate that in the presence of faults, proposed fault tolerant strategy improves the number of tasks completed with varied deadline and fixed budget as well as number of tasks completed with varied budget and fixed deadline. Additionally, the proposed strategy used a smaller percentage of deadline time as compare to both Time Optimization and Checkpointing based Strategy. Although the proposed strategy has a percentage of budget spent greater than that of Time Optimization Strategy and Checkpointing based Strategy, it is accepted as a proposed strategy in time optimization where the main objective is to maximize tasks completed within a given deadline. It can be concluded from the experiments that the proposed strategy shows improvement in satisfying the user QoS requirements. It can effectively schedule tasks and tolerate faults gracefully even in the presence of failures, but the costs are slightly higher in terms of budget consumption. Hence, the proposed fault tolerant strategy helps in sustaining user??s faith in the grid, by enabling the grid to deliver reliable and consistent performance in the presence of faults. 相似文献

4.

A resource management and fault tolerance services in grid computing

《Journal of Parallel and Distributed Computing》2005,65(11):1305-1317

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur. 相似文献

5.

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids 总被引：1，自引：0，他引：1

Chtepen M. Claeys F.H.A. Dhoedt B. De Turck F. Demeester P. Vanrolleghem P.A. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(2):180-190

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency. 相似文献

6.

A survey of dynamic replication strategies for improving data availability in data grids

Tehmina Amjad^{Author Vitae} Muhammad Sher Author VitaeAli Daud Author Vitae 《Future Generation Computer Systems》2012,28(2):337-349

Data grid is a distributed collection of storage and computational resources that are not bounded within a geophysical location. It is a fast growing area of research and providing efficient data access and maximum data availability is a challenging task. To achieve this task, data is replicated to different sites. A number of data replication techniques have been presented for data grids. All replication techniques address some attributes like fault tolerance, scalability, improved bandwidth consumption, performance, storage consumption, data access time etc. In this paper, different issues involved in data replication are identified and different replication techniques are studied to find out which attributes are addressed in a given technique and which are ignored. A tabular representation of all those parameters is presented to facilitate the future comparison of dynamic replication techniques. The paper also includes some discussion about future work in this direction by identifying some open research problems. 相似文献

7.

MidCloud: an agent‐based middleware for effective utilization of replicated Cloud services

Nader Mohamed Jameela Al‐Jaroodi 《Software》2015,45(3):343-363

The Cloud relies heavily on resource replication to support the demands of the clients efficiently. Replicated Cloud services are distributed across large geographic areas and are accessible via the Internet. This paper describes MidCloud; an agent‐based middleware that provides Cloud clients with dynamic load balancing and fault tolerance mechanisms for effective utilization of replicated Cloud services and resources. MidCloud can be used to connect clients with multiple replicated Cloud services and provide fast and reliable service delivery from multiple replicas. Several approaches for load balancing and fault tolerance in distributed systems were introduced; however, they require prior knowledge of the environment's operating conditions and/or constant monitoring of these conditions at run time that allows the applications to adjust the load and redistribute the tasks when operational conditions change and when failures occur. These techniques work well when there is no high communication delay. Yet, this is not true in the Cloud, where data storage and computation servers are scattered all over the world and communication delays are usually very high. MidCloud deploys approaches to reduce the negative impact of high and dynamic delays on the Cloud servers and the Internet. The experimental results show the positive effects of using MidCloud to provide efficient load balancing and fault tolerance. Copyright © 2013 John Wiley & Sons, Ltd. 相似文献

8.

A hybrid fault tolerance technique in grid computing system 总被引：1，自引：0，他引：1

Kalim Qureshi Fiaz Gul Khan Paul Manuel Babar Nazir 《The Journal of supercomputing》2011,56(1):106-128

In order to achieve high level of reliability and availability, the grid infrastructure should be a foolproof fault tolerant. Fault tolerance plays a key role in order to assert availability and reliability of a grid system. Since the failure of resources affects job execution fatally, fault tolerance service is essential to satisfy QoS requirement in grid computing. 相似文献

9.

Mohammad Shorfuzzaman Peter Graham Rasit Eskicioglu 《The Journal of supercomputing》2010,51(3):374-392

Data grids support access to widely distributed storage for large numbers of users accessing potentially many large files. Efficient access is hindered by the high latency of the Internet. To improve access time, replication at nearby sites may be used. Replication also provides high availability, decreased bandwidth use, enhanced fault tolerance, and improved scalability. Resource availability, network latency, and user requests in a grid environment may vary with time. Any replica placement strategy must be able to adapt to such dynamic behavior. In this paper, we describe a new dynamic replica placement algorithm, Popularity Based Replica Placement (PBRP), for hierarchical data grids which is guided by file “popularity”. Our goal is to place replicas close to clients to reduce data access time while still using network and storage resources efficiently. The effectiveness of PBRP depends on the selection of a threshold value related to file popularity. We also present Adaptive-PBRP (APBRP) that determines this threshold dynamically based on data request arrival rates. We evaluate both algorithms using simulation. Results for a range of data access patterns show that our algorithms can shorten job execution time significantly and reduce bandwidth consumption compared to other dynamic replication methods. 相似文献

10.

Spark效用感知的检查点缓存并行清理策略

宋一鑫于俊洋何欣王锦江《计算机系统应用》2022,31(4):253-259

针对Spark检查点缓存数据清理需要等待作业运行完成后由编程人员清理, 可能导致产生失效数据累积占用内存问题, 本文分析检查点执行机制, 建模推导出随着检查点数量增多, 检查点缓存清理方法不可扩展, 提出使用检查点缓存效用熵模型感知检查点缓存和内存槽的匹配度, 并利用效用最佳匹配原则, 推导出最佳检查点缓存清理最佳时机. 基于效用熵的检查点缓存并行清理(PCC)策略, 通过使检查点缓存清理时刻近似等于检查点写入HDFS时刻优化内存资源. 实验结果表明, 在基于公平调度的多作业执行环境下, 随着检查点数量增加, 未优化程序执行效率变差, 使用PCC策略后, 在程序执行时长、耗电量、GC时间3个指标上最大分别能降低10.1%、9.5%、19.5% , 有效提升多检查点时的程序执行效率. 相似文献

11.

Survey of fault tolerant techniques for grid

S. Siva Sathya K. Syam Babu 《Computer Science Review》2010,4(2):101-120

Besides the dynamic nature of grids, which means that resources may enter and leave the grid at any time, in many cases outside of the applications’ control, grid resources are also heterogeneous in nature. Many grid applications will be running in environments where interaction faults are more likely to occur between disparate grid nodes. As resources may also be used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious. Due to the diverse faults and failure conditions, developing, deploying, and executing long running applications over the grid remains a challenge. So fault tolerance is an essential factor for grid computing. This paper presents an extensive survey of different fault tolerant techniques such as replication strategies, check-pointing mechanisms, scheduling policies, failure detection mechanisms and finally malleability and migration support for divide-and-conquer applications. These techniques are used according to the needs of the computational grid and the type of environment, resources, virtual organizations and job profile it is supposed to work with. Each has its own merits and demerits which forms the subject matter of this survey. 相似文献

12.

网格检查点恢复服务及其应用编程接口分析

张琳杨静《计算机应用》2004,24(7):16-17,21

检查点机制作为一种软件容错机制，可以与新出现的广域分布式系统网格相结合，更好地满足网格系统的容错要求。文中详细分析了检查点回卷恢复协议的关键点，并对数据网格中GridCPR API进行了解析，提出一些改进，这样就更有利于网格系统的故障检测和容错服务。相似文献

13.

A model of checkpoint behavior for applications that have I/O

León Betzabeth Méndez Sandra Franco Daniel Rexachs Dolores Luque Emilio 《The Journal of supercomputing》2022,78(13):15404-15436

Due to the increase and complexity of computer systems, reducing the overhead of fault tolerance techniques has become important in recent years. One technique in fault tolerance is checkpointing, which saves a snapshot with the information that has been computed up to a specific moment, suspending the execution of the application, consuming I/O resources and network bandwidth. Characterizing the files that are generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and their impact on the I/O system. It is also important to characterize the application that performs checkpoints, and one of these characteristics is whether the application does I/O. In this paper, we present a model of checkpoint behavior for parallel applications that performs I/O; this depends on the application and on other factors such as the number of processes, the mapping of processes and the type of I/O used. These characteristics will also influence scalability, the resources consumed and their impact on the IO system. Our model describes the behavior of the checkpoint size based on the characteristics of the system and the type (or model) of I/O used, such as the number I/O aggregator processes, the buffering size utilized by the two-phase I/O optimization technique and components of collective file I/O operations. The BT benchmark and FLASH I/O are analyzed under different configurations of aggregator processes and buffer size to explain our approach. The model can be useful when selecting what type of checkpoint configuration is more appropriate according to the applications’ characteristics and resources available. Thus, the user will be able to know how much storage space the checkpoint consumes and how much the application consumes, in order to establish policies that help improve the distribution of resources.

相似文献

14.

一种低费用的协调检查点算法

党红恩赵尔平雒伟群《数字社区&智能家居》2014,(4):2394-2396

检查点算法作为一种有效的故障技术及容错手段,已广泛地运用在网格、分布式和云计算系统中。该文提出了一种非阻塞协调检查点算法,该算法增加了系统的可靠性,并允许检查点灵活设置,充分缩减了同步信息数量,加速了检查点形成时间。和典型的相关算法比较,该文提出的算法使用更少的同步控制消息,具有更低的费用,引入同步控制消息的时间复杂度由一般的O（n2）降到O（n）,且同步消息数仅仅为n-1。相似文献

15.

Enhanced Dynamic Hierarchical Replication and Weighted Scheduling Strategy in Data Grid

Najme Mansouri Gholam Hosein Dastghaibyfard 《Journal of Parallel and Distributed Computing》2013

The Data Grid provides massive aggregated computing resources and distributed storage space to deal with data-intensive applications. Due to the limitation of available resources in the grid as well as production of large volumes of data, efficient use of the Grid resources becomes an important challenge. Data replication is a key optimization technique for reducing access latency and managing large data by storing data in a wise manner. Effective scheduling in the Grid can reduce the amount of data transferred among nodes by submitting a job to a node where most of the requested data files are available. In this paper two strategies are proposed, first a novel job scheduling strategy called Weighted Scheduling Strategy (WSS) that uses hierarchical scheduling to reduce the search time for an appropriate computing node. It considers the number of jobs waiting in a queue, the location of the required data for the job and the computing capacity of the sites Second, a dynamic data replication strategy, called Enhanced Dynamic Hierarchical Replication (EDHR) that improves file access time. This strategy is an enhanced version of the Dynamic Hierarchical Replication strategy. It uses an economic model for file deletion when there is not enough space for the replica. The economic model is based on the future value of a data file. Best replica placement plays an important role for obtaining maximum benefit from replication as well as reducing storage cost and mean job execution time. So, it is considered in this paper. The proposed strategies are implemented by OptorSim, the European Data Grid simulator. Experiment results show that the proposed strategies achieve better performance by minimizing the data access time and avoiding unnecessary replication. 相似文献

16.

DAGMap: efficient and dependable scheduling of DAG workflow job in Grid 总被引：1，自引：1，他引：0

Haijun Cao Hai Jin Xiaoxin Wu Song Wu Xuanhua Shi 《The Journal of supercomputing》2010,51(2):201-223

DAG has been extensively used in Grid workflow modeling. Since Grid resources tend to be heterogeneous and dynamic, efficient and dependable workflow job scheduling becomes essential. It poses great challenges to achieve minimum job accomplishing time and high resource utilization efficiency, while providing fault tolerance. Based on list scheduling and group scheduling, in this paper, we propose a novel scheduling heuristic called DAGMap. DAGMap consists of two phases, namely Static Mapping and Dependable Execution. Four salient features of DAGMap are: (1) Task grouping is based on dependency relationships and task upward priority; (2) Critical tasks are scheduled first; (3) Min-Min and Max-Min selective scheduling are used for independent tasks; and (4) Checkpoint server with cooperative checkpointing is designed for dependable execution. The experimental results show that DAGMap can achieve better performance than other previous algorithms in terms of speedup, efficiency, and dependability. 相似文献

17.

Linux Support for Fast Transparent General Purpose Checkpoint/Restart of Multithreaded Processes in Loadable Kernel Module

Amirreza Zarrabi Khairulmizam Samsudin Wan Azizun Wan Adnan 《Journal of Grid Computing》2013,11(2):187-210

Checkpoint/Restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. These are techniques with many potential applications, including establishment of a fault-tolerant environment, improving system resource utilization, and true migration of a process. With increasing hardware speed and size of clusters the average time between failures has been reduced. Therefore, fault tolerance and ability to checkpoint a process have become inevitable. Almost all platforms deployed for high-performance computing support process checkpoint/restart. Linux as one of the popular operating systems does not provide a general purpose implementation. Some are limited to specific type of parallel programming library, confined to some unique well-behaved type of applications, or reliant on specific features in kernel which could be missing on many occasions. Most of implementations demand elaborate practice of recompiling a whole kernel to apply required patches. In this paper, we describe the design and implementation of multithreaded process checkpoint/restart system for Linux which provide capability of dynamic extension to increase compatibility and reduce system overhead. It does not impose any requirement on the existence of a special facility in the operating system and can do checkpoint/restart of an application independent of their behavior and fully transparent. The entire system is absolutely implemented in multiple kernel loadable modules, which result in ease of use and eliminate the burden of complex system administration. 相似文献

18.

Estimation of fault-tolerance of the parallel control computing systems: A new approach

V. V. Eliseev V. V. Ignatushchenko I. Yu. Podshivalova 《Automation and Remote Control》2007,68(6):1083-1099

A new approach to estimating the fault-tolerance of the parallel control computing systems relies on the mathematical model-based determination of the probability of successful completion in a given schedule time of an arbitrary set of interdependent jobs (tasks) with random times of job execution and asynchronous job redundancy. The estimates were determined both for the standard execution of a set of tasks and for the case of single malfunction (fault or failure) of any computing system processor detected at execution of any job from the set. The basic distinction of this approach lies in that here the numerical values of the reliability parameters (probabilities or intensities of faults or failures) of the computing resources are neither given nor used. 相似文献

19.

一种基于繁忙时间的并行调度能耗优化算法

蔡立军潘江波陈磊何庭钦《计算机工程与科学》2017,39(1):42-48

减少服务器繁忙时间是云计算并行调度中节约能耗的一种有效途径,而现有基于繁忙时间的能耗节约策略大多以牺牲作业调度性能为代价,无法与其他有调度性能优势的作业调度算法结合使用。提出一种有效的基于繁忙时间的并行调度能耗优化算法——BTEOA。首先,将作业请求队列根据当前服务器可用资源划分为作业窗口和非作业窗口。其次,按照作业窗口中作业请求能使所有服务器总繁忙时间局部最优的原则匹配服务器进行调度。最后,作业窗口中所有作业请求执行完成后,继续将非作业窗口进行作业窗口与非作业窗口划分,直到所有作业请求执行完毕。作业调度过程中,始终保持作业排队模型不变,保证了作业调度性能不受影响。实例分析与实验结果表明,BTEOA算法能够在不影响作业调度性能的前提下,节约能耗,同时支持与其他作业调度算法结合使用。相似文献

20.

Fault-Management in P2P-MPI

Stéphane Genaud Emmanuel Jeannot Choopan Rattanapoka 《International journal of parallel programming》2009,37(5):433-461

We present in this paper a study on fault management in a grid middleware. The middleware is our home-grown software called P2P-MPI. This framework is MPJ compliant, allows users to execute message passing parallel programs, and its objective is to support environments using commodity hardware. Hence, running programs is failure prone and a particular attention must be paid to fault management. The fault management covers two issues: fault-tolerance and fault detection. Fault-tolerance deals with the program execution: P2P-MPI provides a transparent fault tolerance facility based on replication of computations. Fault detection concerns the monitoring of the program execution by the system. The monitoring is done through a distributed set of modules called failure detectors. The contribution of this paper is twofold. The first contribution is the evaluation of the failure probability of an application depending on the replication degree. The failure probability depends on the execution length, and we propose a model to evaluate the duration of a replicated parallel program. Then, we give an expression of the replication degree required to keep the failure probability of an execution under a given threshold. The second contribution is a study of the advantages and drawbacks of several fault detection systems found in the literature. The criteria of our evaluation are the reliability of the failure detection service and the failure detection speed. We retain the binary round-robin protocol for its failure detection speed, and we propose a variant of this protocol which is more reliable than the application execution in any case. Experiments involving of up to 256 processes, carried out on Grid’5000, show that the real detection times closely match the predictions. 相似文献