期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids 总被引：1，自引：0，他引：1

Chtepen M. Claeys F.H.A. Dhoedt B. De Turck F. Demeester P. Vanrolleghem P.A. 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(2):180-190

A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken into account. Commonly utilized techniques for providing fault tolerance in distributed systems are periodic job checkpointing and replication. While very robust, both techniques can delay job execution if inappropriate checkpointing intervals and replica numbers are chosen. This paper introduces several heuristics that dynamically adapt the above mentioned parameters based on information on grid status to provide high job throughput in the presence of failure while reducing the system overhead. Furthermore, a novel fault-tolerant algorithm combining checkpointing and replication is presented. The proposed methods are evaluated in a newly developed grid simulation environment dynamic scheduling in distributed environments (DSiDE), which allows for easy modeling of dynamic system and job behavior. Simulations are run employing workload and system parameters derived from logs that were collected from several large-scale parallel production systems. Experiments have shown that adaptive approaches can considerably improve system performance, while the preference for one of the solutions depends on particular system characteristics, such as load, job submission patterns, and failure frequency. 相似文献

2.

Adaptive checkpointing strategy to tolerate faults in economy based grid 总被引：3，自引：2，他引：1

Babar Nazir Kalim Qureshi Paul Manuel 《The Journal of supercomputing》2009,50(1):1-18

In this paper, we develop a fault tolerant job scheduling strategy in order to tolerate faults gracefully in an economy based grid environment. We propose a novel adaptive task checkpointing based fault tolerant job scheduling strategy for an economy based grid. The proposed strategy maintains a fault index of grid resources. It dynamically updates the fault index based on successful or unsuccessful completion of an assigned task. Whenever a grid resource broker has tasks to schedule on grid resources, it makes use of the fault index from the fault tolerant schedule manager in addition to using a time optimization heuristic. While scheduling a grid job on a grid resource, the resource broker uses fault index to apply different intensity of task checkpointing (inserting checkpoints in a task at different intervals). To simulate and evaluate the performance of the proposed strategy, this paper enhances the GridSim Toolkit-4.0 to exhibit fault tolerance related behavior. We also compare “checkpointing fault tolerant job scheduling strategy” with the well-known time optimization heuristic in an economy based grid environment. From the measured results, we conclude that even in the presence of faults, the proposed strategy effectively schedules grid jobs tolerating faults gracefully and executes more jobs successfully within the specified deadline and allotted budget. It also improves the overall execution time and minimizes the execution cost of grid jobs. 相似文献

3.

Prediction of Resource Availability in Fine-Grained Cycle Sharing Systems Empirical Evaluation

Xiaojuan Ren Seyong Lee Rudolf Eigenmann Saurabh Bagchi 《Journal of Grid Computing》2007,5(2):173-195

Fine-Grained Cycle Sharing (FGCS) systems aim at utilizing the large amount of computational resources available on the Internet. In FGCS, host computers allow guest jobs to utilize the CPU cycles if the jobs do not significantly impact the local users. Such resources are generally provided voluntarily and their availability fluctuates highly. Guest jobs may fail unexpectedly, as resources become unavailable. To improve this situation, we consider methods to predict resource availability. This paper presents empirical studies on resource availability in FGCS systems and a prediction method. From studies on resource contention among guest jobs and local users, we derive a multi-state availability model. The model enables us to detect resource unavailability in a non-intrusive way. We analyzed the traces collected from a production FGCS system for 3 months. The results suggest the feasibility of predicting resource availability, and motivate our method of applying semi-Markov Process models for the prediction. We describe the prediction framework and its implementation in a production FGCS system, named iShare. Through the experiments on an iShare testbed, we demonstrate that the prediction achieves an accuracy of 86% on average and outperforms linear time series models, while the computational cost is negligible. Our experimental results also show that the prediction is robust in the presence of irregular resource availability. We tested the effectiveness of the prediction in a proactive scheduler. Initial results show that applying availability prediction to job scheduling reduces the number of jobs failed due to resource unavailability. This work was supported, in part, by the National Science Foundation under Grants No. 0103582-EIA, 0429535-CCF, and 0650016-CNS. We thank Ruben Torres for his help with the reference prediction algorithms used in our experiments. 相似文献

4.

一种基于历史信息的自适应动态网格作业调度方法

许兰朱巧明贡正仙李培峰《计算机应用与软件》2008,25(10)

目前,国内外围绕着网格中的作业调度算法已做了大量研究,先后提出了很多调度算法.但是,这些算法并不能很好地适应网格的动态性、自治性和分布性等特征.对此,提出了一种动态的网格作业调度方法-基于历史信息的自适应动态网格作业调度方法ASHI.该方法利用每个资源上最近作业的执行信息自适应调整预测模型,然后再根据网格的动态性和实时性等因素,对资源进行反馈选择后将作业提交负载较轻的资源上执行.实验证明,ASHI不但能及时有效地对作业进行调度,而且还可有效提高整个网格的吞吐量和均衡系统的负载. 相似文献

5.

Semantic-enabled CARE Resource Broker (SeCRB) for managing grid and cloud environment

Thamarai Selvi Somasundaram Kannan Govindarajan Usha Kiruthika Rajkumar Buyya 《The Journal of supercomputing》2014,68(2):509-556

Grid computing is mainly helpful for executing high-performance computing applications. However, conventional grid resources sometimes fail to offer a dynamic application execution environment and this increases the rate at which the job requests of users are rejected. Integrating emerging virtualization technologies in grid and cloud computing facilitates the provision of dynamic virtual resources in the required execution environment. Resource brokers play a significant role in managing grid and cloud resources as well as identifying potential resources that satisfy users’ application requests. This research paper proposes a semantic-enabled CARE Resource Broker (SeCRB) that provides a common framework to describe grid and cloud resources, and to discover them in an intelligent manner by considering software, hardware and quality of service (QoS) requirements. The proposed semantic resource discovery mechanism classifies the resources into three categories viz., exact, high-similarity subsume and high-similarity plug-in regions. To achieve the necessary user QoS requirements, we have included a service level agreement (SLA) negotiation mechanism that pairs users’ QoS requirements with matching resources to guarantee the execution of applications, and to achieve the desired QoS of users. Finally, we have implemented the QoS-based resource scheduling mechanism that selects the resources from the SLA negotiation accepted list in an optimal manner. The proposed work is simulated and evaluated by submitting real-world bio-informatics and image processing application for various test cases. The result of the experiment shows that for jobs submitted to the resource broker, job rejection rate is reduced while job success and scheduling rates are increased, thus making the resource management system more efficient. 相似文献

6.

Incentive-Based Scheduling for Market-Like Computational Grids 总被引：1，自引：0，他引：1

《Parallel and Distributed Systems, IEEE Transactions on》2008,19(7):903-913

A sustainable, market-like computational grid has two characteristics: it must allow resource providers and resource consumers to make autonomous scheduling decisions; and both parties of providers and consumers must have sufficient incentives to stay and play in the market. In this paper, we formulate this intuition of optimizing incentives for both parties as a dual-objective scheduling problem. The two objectives identified are to maximize the success rate of job execution, and to minimize fairness deviation among resources. The challenge is to develop a grid scheduling scheme that enables individual participants to make autonomous decisions while produces a desirable emergent property in the grid system, namely, the two objectives are achieved simultaneously. We present an incentive-based scheduling scheme which utilizes a peer-to-peer decentralized scheduling framework, a set of local heuristic algorithms, and three market instruments of job announcement, price, competition degree. The performance of this scheme is evaluated via extensive simulation using synthetic and real workloads. The results show that our approach outperforms other scheduling schemes in optimizing incentives for both consumers and providers, leading to highly successful job execution and fair profit allocation. 相似文献

7.

A heterogeneous computing system for data mining workflows in multi-agent environments

Ping Luo Kevin Lü Rui Huang Qing He Zhongzhi Shi 《Expert Systems》2006,23(5):258-272

Abstract: The computing-intensive data mining (DM) process calls for the support of a heterogeneous computing system, which consists of multiple computers with different configurations connected by a high-speed large-area network for increased computational power and resources. The DM process can be described as a multi-phase pipeline process, and in each phase there could be many optional methods. This makes the workflow for DM very complex and it can be modeled only by a directed acyclic graph (DAG). A heterogeneous computing system needs an effective and efficient scheduling framework, which orchestrates all the computing hardware to perform multiple competitive DM workflows. Motivated by the need for a practical solution of the scheduling problem for the DM workflow, this paper proposes a dynamic DAG scheduling algorithm according to the characteristics of an execution time estimation model for DM jobs. Based on an approximate estimation of job execution time, this algorithm first maps DM jobs to machines in a decentralized and diligent (defined in this paper) manner. Then the performance of this initial mapping can be improved through job migrations when necessary. The scheduling heuristic used considers the factors of both the minimal completion time criterion and the critical path in a DAG. We implement this system in an established multi-agent system environment, in which the reuse of existing DM algorithms is achieved by encapsulating them into agents. The system evaluation and its usage in oil well logging analysis are also discussed. 相似文献

8.

A resource management and fault tolerance services in grid computing

《Journal of Parallel and Distributed Computing》2005,65(11):1305-1317

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur. 相似文献

9.

容错计算网格作业调度模型的研究 总被引：14，自引：1，他引：14

金海陈刚赵美平《计算机研究与发展》2004,41(8):1382-1388

网格技术的发展对网格系统的效率和服务质量提出了更高要求．在综合研究目前网格作业调度环境的基础上，提出一种容错计算网格作业调度的随机Petri网模型，并给出了网格作业分派策略和计算站点内的作业选择策略，以及容错计算网格的性能评价指标．仿真实验对容错计算网格的性能进行有效的分析，反映故障对网格中不同类别作业的影响．相似文献

10.

一种面向软件仓库挖掘的动态作业配置框架

史殿习尹刚米海波袁霖王怀民《计算机科学》2011,38(7):113-116

构造面向软件仓库挖掘的数据中心,是目前软件工程领域的研究热点。软件仓库数据处理作业的执行时间差异明显、资源消耗大等特点为其作业配置带来诸多挑战。提出一种面向软件仓库挖掘的作业配置框架TrustieSDC,该框架支持一种新型远程作业部署和服务模式,采用一种基于软件版本划分的动态作业配置算法以缩短长作业响应时间并提高系统资源利用率。基于Gnome项目SVN库的实验表明,TrusticSDC的性能和资源利用率与并行后的Alitheia相比有明显改进。相似文献

11.

A framework for adaptive execution in grids

Eduardo Huedo Ruben S. Montero Ignacio M. Llorente 《Software》2004,34(7):631-651

Grids offer a dramatic increase in the number of available processing and storing resources that can be delivered to applications. However, efficient job submission and management continue being far from accessible to ordinary scientists and engineers due to their dynamic and complex nature. This paper describes a new Globus based framework that allows an easier and more efficient execution of jobs in a ‘submit and forget’ fashion. The framework automatically performs the steps involved in job submission and also watches over its efficient execution. In order to obtain a reasonable degree of performance, job execution is adapted to dynamic resource conditions and application demands. Adaptation is achieved by supporting automatic application migration following performance degradation, ‘better’ resource discovery, requirement change, owner decision or remote resource failure. The framework is currently functional on any Grid testbed based on Globus because it does not require new system software to be installed in the resources. The paper also includes practical experiences of the behavior of our framework on the TRGP and UCM‐CAB testbeds. Copyright © 2004 John Wiley & Sons, Ltd. 相似文献

12.

Processor Allocation in Multiprogrammed Distributed-Memory Parallel Computer Systems

《Journal of Parallel and Distributed Computing》1997,46(1):28-47

In this paper, we examine three general classes of space-sharing scheduling policies under a workload representative of large-scale scientific computing. These policies differ in the way processors are partitioned among the jobs as well as in the way jobs are prioritized for execution on the partitions. We consider new static, adaptive and dynamic policies that differ from previously proposed policies by exploiting user-supplied information about the resource requirements of submitted jobs. We examine the performance characteristics of these policies from both the system and user perspectives. Our results demonstrate that existing static schemes do not perform well under varying workloads, and that the system scheduling policy for such workloads must distinguish between jobs with large differences in execution times. We show that obtaining good performance under adaptive policies requires somea prioriknowledge of the job mix in these systems. We further show that a judiciously parameterized dynamic space-sharing policy can outperform adaptive policies from both the system and user perspectives. 相似文献

13.

网格作业自适应迁移模型

下载免费PDF全文

王涛周兴社杨志义刘亮张海辉《计算机工程》2008,34(12):56-57

作业迁移是实现网格作业服务质量保证和系统高效能的重要方法。该文在分析传统进程迁移技术的基础上,根据网格系统的特点,提出一种全局作业与局部进程相结合的网格作业自适应迁移模型,给出网格作业自适应迁移策略、迁移对象选定原则、迁移时机确定机制和自适应迁移实现算法。试验结果以及在某校园计算网格中的应用验证了该模型的有效性。相似文献

14.

基于网格计算市场模型的资源与作业描述语言的研究 总被引：1，自引：0，他引：1

陈颖杨寿保《计算机科学》2005,32(2):90-92

网格计算市场模型是把经济学的概念应用到网格的资源管理和作业调度中的模型。本文分析了网格计算市场模型中资源和作业描述语言的需求,简要介绍了资源和作业描述语言Classified Advertisements(Classad),指出它在网格计算市场模型中描述资源和作业的不足之处,对它做了相应的改进和扩充．以实现在经济模型下对资源和作业更加灵活、细枉度的描述。相似文献

15.

一种支持服务网格的动态负载平衡系统

下载免费PDF全文

申德荣陈翔宇吕立昂邵一川于戈《计算机工程》2006,32(21):124-126,129

为了实现服务网格系统内负载的均衡分布,提高资源利用率和系统的吞吐率,设计并实现了一种基于服务网格环境的动态负载平衡系统。提出了层次式负载平衡调度模式,给出了本系统结构形式,设计并实现了一种综合考虑各局部代理作业数和各个局部代理性能以及当前的负载情况的动态双阈值作业分配算法。实验结果表明,此算法能有效地基于负载分派作业,达到了提高网格内分布资源的利用率和减少作业调度时间的目的。相似文献

16.

On the Building of a Job Scheduler System for Globus Grid Environment

Sugree Phatanapherom Putchong Uthayopas 《计算机工程》2002,28(Z1)

相似文献

17.

A Fair, Secure and Trustworthy Peer-to-Peer Based Cycle-Sharing System

Shuo Yang Ali R. Butt Xing Fang Y. Charlie Hu Samuel P. Midkiff 《Journal of Grid Computing》2006,4(3):265-286

The increased popularity of Grid systems and cycle sharing across organizations requires scalable systems that provide facilities to locate resources, to be fair in the use of those resources, to allow resource providers to host untrusted applications safely, and to allow resource consumers to monitor the progress and correctness of jobs executing on remote machines. This paper presents such a framework that locates computational resources with a peer-to-peer network, assures fair resource usage with a distributed credit accounting system, provides resource contributors a safe environment, for example Java Virtual Machine (JVM), to host untrusted applications, and provides the resource consumers a monitoring system, GridCop, to track the progress and correctness of remotely executing jobs. We present the details of the credit accounting subsystem and the GridCop remote job monitoring subsystem. GridCop and the distributed credit accounting system together enable incremental payments so that the risk for both resource providers and resource consumers is bounded.*This work was supported by NSF CAREER award grant ACI-0238379 and NSF grants CCR-0313026 and CCR-0313033. 相似文献

18.

基于SGE的仿真网格及其作业调度研究

张传富刘云生张童查亚兵《计算机仿真》2006,23(6):274-278

网格引擎是一个构建本地和集群网格的工具,其框架是由四种类型的主机及其对应的守护进程构成.该文主要研究了通过SGE框架构建分布式仿真网格平台的方法,描述了仿真网格平台上执行用户提交的仿真任务的工作流程.随后讨论了基于SGE仿真网格中的资源组织和作业调度,并分析了仿真网格中所使用的作业调度算法,包括确定作业顺序的FIFO算法、优先级算法、等额度和日历算法等;确定队列顺序的负载调整、队列号等算法等. 相似文献

19.

面向信息服务的网格资源管理器的设计 总被引：2，自引：0，他引：2

下载免费PDF全文

李培峰朱巧明支丽艳《计算机工程》2008,34(3):49-51,5

设计一个面向信息服务的网格资源管理器的架构，该架构分为全局和局部管理器。介绍一个新的作业调度算法，该算法的特点是根据历史作业执行时间来预测当前作业的执行时间，在调度时考虑作业执行时间和截止时间2个要素。试验证明该算法比目前常用的Max-Min和Min-Min算法具有更好的性能。相似文献

20.

Workload management of cooperatively federated computing clusters

Percival Xavier Wentong Cai Bu-Sung Lee 《The Journal of supercomputing》2006,36(3):309-322

Cooperative resource sharing enables distinct organizations to form a federation of computing resources. The motivation behind cooperation is that organizations are likely to serve each other by trading unused CPU cycles given the existence of irregular usage patterns of their local resources. In this way, resource sharing would enable organizations to purchase resources at a feasible level while meeting peak computational throughput requirements. This federation results in community grid that must be managed. A functional broker is deployed to facilitate remote resource access within the community grid. A major issue is the problem of correlations in job arrivals caused by seasonal usage and/or coincident resource usage demand patterns. These correlations incur high levels of burstiness in job arrivals causing the job queue of the broker to grow to an extent such that its performance becomes severely impaired. Since job arrivals cannot be controlled, management strategies must be employed to admit jobs in a manner that can sustain a fair level of resource allocation performance at all participating organizations in the community. In this paper, we present a theoretical analysis of the problem of job traffic burstiness on resource allocation performance in order to elicit the general job management strategies to be employed. Based on the analysis, we define and justify a job management strategies for the resource broker to cope with overload conditions caused by job arrival correlations. 相似文献