期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

Jing Mei Kenli Li Xu Zhou Keqin Li 《Journal of Grid Computing》2015,13(4):507-525

As the scale and complexity of heterogeneous computing systems grow, failures occur frequently and have an adverse effect on solving large-scale applications. Hence, fault-tolerant scheduling is an imperative step for large-scale computing systems. The existing fault-tolerant scheduling algorithms belong to static scheduling, and they allocate multiple copies of each task to several processors no matter whether processor failures affect the execution of tasks. Such active replication strategies not only waste resource but also sacrifice the makespan. What is more, they cannot guarantee the successful execution of applications. In this paper, we propose a fault-tolerant dynamic rescheduling algorithm named FTDR, which can overcome above drawbacks. FTDR keeps listening to the processor failure, and reschedules the suspended tasks once failures occur. Because FTDR reschedules the tasks that are suspended because of failures, it can tolerate an arbitrary number of failures. Randomly generated DAGs are tested in our experiments. Experimental results show that the proposed algorithm achieves good performance in terms of makespan and resource consumption compared with its direct competitors. 相似文献

2.

Pro-active failure handling mechanisms for scheduling in grid computing environments

B.T. Benjamin Khoo Bharadwaj Veeravalli 《Journal of Parallel and Distributed Computing》2010

In this paper, we consider designing pro-active failure handling strategies for grid environments. These strategies estimate the availability of resources in the Grid, and also preemptively calculate the expected long term capacity of the Grid. Using these strategies, we create modified versions of the backfill and replication algorithms to include all three pro-active strategies to ascertain each of their effectiveness in the prevention of job failures during execution. Also, we extend our earlier work on a co-ordinate based allocation strategy. The extended algorithm also shows continual improvement when operating under the same execution environment. In our experiments, we compare these enhanced algorithms to their original forms, and show that pro-active failure handling is able to, in some cases, avoid all job failures during execution. Also, we show that NSA provides the best balance of enhanced throughput and job failures during execution of the algorithms we have considered. 相似文献

3.

Adaptive fault-tolerant scheduling strategies for mobile cloud computing

Lee JongHyuk Gil JoonMin 《The Journal of supercomputing》2019,75(8):4472-4488

Mobile cloud computing is a form of cloud computing that incorporates mobile devices such as smartphones and tablet PCs into the cloud infrastructure. As mobile devices are resource-constrained in nature, new scheduling strategies are required when using them as resource providers. Based on our previous group-based scheduling algorithm, we present fault-tolerant scheduling algorithms considering checkpoint and replication mechanisms to actively cope with faults. We carried out the performance evaluation with simulation to demonstrate that our algorithm is more efficient than the existing one lacking fault tolerance in terms of accuracy rate, resource consumption, and average execution time. In particular, the average execution time was reduced by about 60%, resulting in the reduction of resource consumption.

相似文献

4.

Reputation-based dependable scheduling of workflow applications in Peer-to-Peer Grids

Mustafizur Rahman Rajiv Ranjan Rajkumar Buyya 《Computer Networks》2010,54(18):3341-3359

Grids facilitate creation of wide-area collaborative environment for sharing computing or storage resources and various applications. Inter-connecting distributed Grid sites through peer-to-peer routing and information dissemination structure (also known as Peer-to-Peer Grids) is essential to avoid the problems of scheduling efficiency bottleneck and single point of failure in the centralized or hierarchical scheduling approaches. On the other hand, uncertainty and unreliability are facts in distributed infrastructures such as Peer-to-Peer Grids, which are triggered by multiple factors including scale, dynamism, failures, and incomplete global knowledge.In this paper, a reputation-based Grid workflow scheduling technique is proposed to counter the effect of inherent unreliability and temporal characteristics of computing resources in large scale, decentralized Peer-to-Peer Grid environments. The proposed approach builds upon structured peer-to-peer indexing and networking techniques to create a scalable wide-area overlay of Grid sites for supporting dependable scheduling of applications. The scheduling algorithm considers reliability of a Grid resource as a statistical property, which is globally computed in the decentralized Grid overlay based on dynamic feedbacks or reputation scores assigned by individual service consumers mediated via Grid resource brokers. The proposed algorithm dynamically adapts to changing resource conditions and offers significant performance gains as compared to traditional approaches in the event of unsuccessful job execution or resource failure. The results evaluated through an extensive trace driven simulation show that our scheduling technique can reduce the makespan up to 50% and successfully isolate the failure-prone resources from the system. 相似文献

5.

A resource management and fault tolerance services in grid computing

《Journal of Parallel and Distributed Computing》2005,65(11):1305-1317

In grid computing, resource management and fault tolerance services are important issues. The availability of the selected resources for job execution is a primary factor that determines the computing performance. In this paper, we propose a resource manager for optimal resource selection. Our resource manager automatically selects the set of optimal resources among candidate resources that achieves optimal performance using a genetic algorithm. Typically, the probability of a failure is higher in the grid computing than in a traditional parallel computing and the failure of resources affects job execution fatally. Therefore, a fault tolerance service is essential in computational grids. And grid services are often expected to meet some minimum levels of Quality of Service (QoS) for a desirable operation. To address this issue, we also propose a fault tolerance service that satisfies QoS requirements. We extend the definition of failures from the conventional notion of failures in distribute systems in order to provide a fault tolerance service that deals with various types of resource failures, which include process failures, processor failures, and network failures. We also design and implement a fault detector and a fault manager. The implementation and simulation results indicate that our approaches are promising in that (1) the resource manager finds the optimal set of resources that guarantees efficient job execution, (2) the fault detector detects the occurrence of resource failures and (3) the fault manager guarantees that the submitted jobs complete and the performance of job execution is improved due to job migration even if some failures occur. 相似文献

6.

A Flexible Framework for Fault Tolerance in the Grid 总被引：2，自引：0，他引：2

Soonwook Hwang Carl Kesselman 《Journal of Grid Computing》2003,1(3):251-272

This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major challenge in providing such a generic failure detection service on the Grid is to detect those failures without requiring any modification to both the Grid protocol and the local policy of each Grid node. This paper describes how to overcome the challenge by using a notification mechanism which is based on the interpretation of notification messages being delivered from the underlying Grid resources. The Grid-WFS built on top of FDS allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. Central to the framework is flexibility in handling failures. This paper describes how to achieve the flexibility by the use of workflow structure as a high-level recovery policy specification, which enables support for multiple failure recovery techniques, the separation of failure handling strategies from the application code, and user-defined exception handlings. Finally, this paper presents an experimental evaluation of the Grid-WFS using a simulation, demonstrating the value of supporting multiple failure recovery techniques in Grid applications to achieve high performance in the presence of failures. 相似文献

7.

P-GRADE: A Grid Programming Environment

P. Kacsuk G. Dózsa J. Kovács R. Lovas N. Podhorszki Z. Balaton G. Gombás 《Journal of Grid Computing》2003,1(2):171-197

P-GRADE provides a high-level graphical environment to develop parallel applications transparently both for parallel systems and the Grid. P-GRADE supports the interactive execution of parallel programs as well as the creation of a Condor, Condor-G or Globus job to execute parallel programs in the Grid. In P-GRADE, the user can generate either PVM or MPI code according to the underlying Grid where the parallel application should be executed. PVM applications generated by P-GRADE can migrate between different Grid sites and as a result P-GRADE guarantees reliable, fault-tolerant parallel program execution in the Grid. The GRM/PROVE performance monitoring and visualisation toolset has been extended towards the Grid and connected to a general Grid monitor (Mercury) developed in the EU GridLab project. Using the Mercury/GRM/PROVE Grid application monitoring infrastructure any parallel application launched by P-GRADE can be remotely monitored and analysed at run time even if the application migrates among Grid sites. P-GRADE supports workflow definition and co-ordinated multi-job execution for the Grid. Such workflow management can provide parallel execution at both inter-job and intra-job level. Automatic checkpoint mechanism for parallel programs supports the migration of parallel jobs inside the workflow providing a fault-tolerant workflow execution mechanism. The paper describes all of these features of P-GRADE and their implementation concepts. 相似文献

8.

Web服务组合运行中的容错架构 总被引：1，自引：0，他引：1

下载免费PDF全文

邹方高春鸣《计算机工程》2008,34(18):89-92

对Web服务组合进行容错处理是提高服务可用性和可靠性的有效途径。该文分析服务组合流程运行时经常出现的故障,对其故障的严重程度进行分级,提出执行引擎WebJetFlow在执行过程中的的容错架构。针对Web服务组合运行时故障的轻重程度,有针对性地实现了相应的在线处理策略,当流程执行时使业务功能和服务质量同时得到了保障。相似文献

9.

A Framework for Adaptive Fault-Tolerant Execution of Workflows in the Grid: Empirical and Theoretical Analysis

Felipe Pontes Guimaraes Pedro Célestin Daniel Macedo Batista Genaína Nunes Rodrigues Alba Cristina Magalhaes Alves de Melo 《Journal of Grid Computing》2014,12(1):127-151

In this paper, we propose and evaluate a framework for fault tolerant workflow execution in Grid environments. Different from previous work in the literature, our system dynamically chooses an appropriate fault tolerance technique while using a user-defined rule-based system. We also provide a generic interface that can be used to add fault tolerance techniques to the framework. The results obtained with real workflows in an experimental Grid environment show that the overhead introduced by our framework in a failure-free execution is, in the worst evaluated case, approximately 10 %. Moreover, we show that, using our framework, workflows are able to execute successfully in the presence of failures and that the framework can dynamically choose an appropriate fault tolerance technique. The main contributions of our work are twofold: the developed framework and the model-based dependability analysis we performed on it. The purpose in carrying out a model-based dependability analysis consists on evaluating the interaction between our framework and the distributed Grid environment beyond the physical limitations of an empirical evaluation. By doing this, we provide means to plan the assurance of QoS in the Grid resource allocation, while applying the fault-tolerance mechanisms we implement in our framework regardless of the underlying middleware. 相似文献

10.

A novel fault-tolerant execution model by using of mobile agents

Wenyu Qu Masaru Kitsuregawa Hong Shen Zhiguang Shan 《Journal of Network and Computer Applications》2009,32(2):423-432

The exponential expansion of the Internet and the widespread popularity of the World Wide Web give a challenge to experts on reliable and secure system design, e.g., e-economy applications. New paradigms are on demand and mobile agent technology is one of the features. In this paper, we propose a fault-tolerance execution model by using of mobile agents, for the purpose of consistent and correct performance with a required function under stated conditions for a specified period of time. Failures are classified into two classes based on their intrinsic different effects on mobile agents. For each kind of failure, a specified handling method is adopted. The introduction of exceptional handling method allows performance improvements during mobile agents’ execution. The behaviors of mobile agents are statistically analyzed through several key parameters, including the migration time from node to node, the life expectancy of mobile agents, and the population distribution of mobile agents, to evaluate the performance of our model. The analytical results give new theoretical insights to the fault-tolerant execution of mobile agents and show that our model outperforms the existing fault-tolerant models. Our model provides an effective way to improve the reliability of computer systems. 相似文献

11.

基于软件容错的动态实时调度算法

韩建军李庆华 Abbas A.Essa 《小型微型计算机系统》2005,26(4):658-661

在硬实时系统中，由于任务超时完成将会导致灾难性后果，因而硬实时系统具有严格的时间及可靠性限制条件．目前实时容错调度算法大多针对硬件的容错，很少考虑软件运行的故障．提出了一种类似EDF的软件容错的动态实时调度算法PKSA(Probng-step Algorithm)，本算法在任务执行过程中，通过若干试探性检测步骤，提高了任务可执行性的预测，尽可能地避免了任务早期的失败对后续任务的影响，因此提高了任务的完成率，并同时有效地减少了浪费的CPU时间片．通过实验测试．同目前所知的同类算法相比，具有更佳的调度性能-调度成本比. 相似文献

12.

Prediction-based proactive load balancing approach through VM migration

Anju Bala Inderveer Chana 《Engineering with Computers》2016,32(4):581-592

The ever-growing intricacy and dynamicity of Cloud Computing Systems has created a need for Proactive Load Balancing which is an effective approach to improve the scalability of today’s Cloud services. In order to manage the load proactively on the Cloud system during application execution, load should be predicted through machine learning approaches and handled through VM migration approaches. Thus, this paper formulates an effort to focus on the research problem of designing a prediction-based approach for facilitating proactive load balancing through the prediction of multiple resource utilization parameters in Cloud. The involvement of this paper is twofold. Firstly, various machine learning approaches have been tested and compared for predicting host overutilization as well as underutilization. Secondly, the load prediction model having maximum accuracy from the tested models has been utilized for implementing the proactive VM migration using multiple resource utilization parameters. Further, the proposed technique has been validated through performance evaluation parameters using CloudSim and Weka toolkits. The simulation results clearly demonstrate that the proposed approach is effective for handling VM migration, reducing SLA Violations, VM migrations, execution mean and standard deviation time. 相似文献

13.

FTPA: Supporting Fault-Tolerant Parallel Computing through Parallel Recomputing

Yang Xuejun Du Yunfei Wang Panfeng Fu Hongyi Jia Jia 《Parallel and Distributed Systems, IEEE Transactions on》2009,20(10):1471-1486

As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single processor. As a result, the recovery time of all protocols is no less than the time between the last checkpoint and the crash. In this paper, we propose a new application-level fault-tolerant approach for parallel applications called the Fault-Tolerant Parallel Algorithm (FTPA), which provides fast self-recovery. When fail-stop failures occur and are detected, all surviving processes recompute the workload of failed processes in parallel. FTPA, however, requires the user to be involved in fault tolerance. In order to ease the FTPA implementation, we developed Get it Fault-Tolerant (GiFT), a source-to-source precompiler tool to automate the FTPA implementation. We evaluate the performance of FTPA with parallel matrix multiplication and five kernels of NAS Parallel Benchmarks on a cluster system with 1,024 CPUs. The experimental results show that the performance of FTPA is better than the performance of the traditional checkpointing approach. 相似文献

14.

A multi-dimensional scheduling scheme in a Grid computing environment

B.T. Benjamin Khoo Bharadwaj Veeravalli Terence Hung C.W. Simon See 《Journal of Parallel and Distributed Computing》2007

In this paper, we propose a novel distributed resource-scheduling algorithm capable of handling multiple resource requirements for jobs that arrive in a Grid computing environment. In our proposed algorithm, referred to as multiple resource scheduling (MRS) algorithm, we take into account both the site capabilities and the resource requirements of jobs. The main objective of the algorithm is to obtain a minimal execution schedule through efficient management of available Grid resources. We first propose a model in which the job and site resource characteristics can be captured together and used in the scheduling algorithm. To do so, we introduce the concept of a n-dimensional virtual map and resource potential. Based on the proposed model, we conduct rigorous simulation experiments with real-life workload traces reported in the literature to quantify the performance. We compare our strategy with most of the commonly used algorithms in place on performance metrics such as job wait times, queue completion times, and average resource utilization. Our combined consideration of job and resource characteristics is shown to render high-performance with respect to above-mentioned metrics in the environment. Our study also reveals the fact that MRS scheme has a capability to adapt to both serial and parallel job requirements, especially when job fragmentation occurs. Our experimental results clearly show that MRS outperforms other strategies and we highlight the impact and importance of our strategy. 相似文献

15.

Adaptive service scheduling for workflow applications in Service-Oriented Grid 总被引：1，自引：1，他引：0

Sung Ho Chin Taeweon Suh Heon Chang Yu 《The Journal of supercomputing》2010,52(3):253-283

When the workflow application is executed in Service-Oriented Grid (SOG), performance issues such as service scheduling should be considered, to achieve high and stable performance in execution. However, most of the prior works on workflow management neither study the performance issues nor provide evaluation methodologies on the performance of Grid Services. Therefore, it is infeasible to apply for the service scheduling problem in SOG. In this paper, we propose and model evaluation metrics for the Grid Service performance. The metrics are extracted based on common properties of Grid Services and are used to quantify and evaluate the performance of an individual Grid Service. With these metrics, we develop a service scheduling scheme with a list scheduling heuristic, to choose proper and optimal Grid Services for tasks in workflow applications. It ensures high performance in the execution of the workflow applications. In addition, we propose a low-overhead rescheduling method, referred to as Adaptive List Scheduling for Service (ALSS), to adapt to the dynamic nature of a grid environment. ALSS provides stable performance for workflow applications, even in abnormal circumstances. Finally, we design an experimental environment with actual traces and perform simulations to quantify the benefits of our approach. Throughout the experiments, we demonstrate that ALSS outperforms conventional scheduling methods. Our scheme produces a scheduling performance that is superior to AHEFT by 50.2%, SLACK by 50.8%, HEFT by 68.3%, MaxMin by 72.0%, MinMin by 71.0%, and Myopic by 69.8%. 相似文献

16.

Planning and Resource Allocation for Hard Real-time,Fault-Tolerant Plan Execution

Atkins Ella M. Abdelzaher Tarek F. Shin Kang G. Durfee Edmund H. 《Autonomous Agents and Multi-Agent Systems》2001,4(1-2):57-78

We describe the interface between a real-time resource allocation system with an AI planner in order to create fault-tolerant plans that are guaranteed to execute in hard real-time. The planner specifies the task set and all execution deadlines required to ensure system safety, then the resource utilization. A new interface module combines information from planning and resource allocation to enforce development of plans feasible for execution during a variety of internal system faults. Plans that over-utilize any system resource trigger feedback to the planner, which then searches for an alternate plan. A valid plan for each specified fault, including the nominal no-fault situation, is stored in a plan cache for subsequent real-time execution. We situate this work in the context of CIRCA, the Cooperative Intelligent Real-time Control Architecture, which focuses on developing and scheduling plans that make hard real-time safety guarantees, and provide an example of an autonomous aircraft agent to illustrate how our planner-resource allocation interface improves CIRCA performance. 相似文献

17.

Fault-tolerant deadlock avoidance algorithm for assembly processes

Fu-Shiung Hsieh 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》2004,34(1):65-79

Unreliable resources pose challenges in design of deadlock avoidance algorithms as resources failures have negative impacts on scheduled production activities and may bring the system to dead states or deadlocks. This paper focuses on the development of a suboptimal polynomial complexity deadlock avoidance algorithm that can operate in the presence of unreliable resources for assembly processes. We formulate a fault-tolerant deadlock avoidance controller synthesis problem for assembly processes based on controlled assembly Petri net (CAPN), a class of Petri nets (PNs) that can model such characteristics as multiple resources and subassembly parts requirement in assembly production processes. The proposed fault-tolerant deadlock avoidance algorithm consists of a nominal algorithm to avoid deadlocks for nominal system state and an exception handling algorithm to deal with resources failures. We analyze the fault-tolerant property of the nominal deadlock avoidance algorithm based on resource unavailability models. Resource unavailability is modeled as loss of tokens in nominal Petri Net models to model unavailability of resources in the course of time-consuming recovery procedures. We define three types of token loss to model 1) resource failures in a single operation, 2) resource failures in multiple operations of a production process and 3) resource failures in multiple operations of multiple production processes. For each type of token loss, we establish sufficient conditions that guarantee the liveness of a CAPN after some tokens are removed. An algorithm is proposed to conduct feasibility analysis by searching for recovery control sequences and to keep as many types of production processes as possible continue production so that the impacts on existing production activities can be reduced. 相似文献

18.

Adapting grid applications to safety using fault-tolerant methods: Design, implementation and evaluations 总被引：1，自引：0，他引：1

Xuanhua Jean-Louis Eric Hai Hongbo 《Future Generation Computer Systems》2010,26(2):236-244

Grid applications have been prone to encountering problems such as failures or malicious attacks during execution in recent years, due to their distributed and large-scale features. The application itself, however, has limited power to address these problems. This paper presents the design, implementation, and evaluation of an adaptive framework— Dynasa, which strives to handle security problems using adaptive fault-tolerance (i.e., checkpointing and replication) during the execution of applications according to the status of the Grid environments. We evaluate our adaptive framework experimentally using the Grid5000 testbed and the experimental results have demonstrated that Dynasa enables the application itself to handle the security problems efficiently. The starting of the adaptive component is less than 1 s and the adaptive action is less than 0.1 s with the checkpoint interval of 20 s. Compared with non-adaptive method, experimental results demonstrate that Dynasa achieves better performance in terms of execution time, network bandwidth consumed, and CPU load, resulting in up to a 50% lower overhead. 相似文献

19.

Dependable Grid Workflow Scheduling Based on Resource Availability

Yongcai Tao Hai Jin Song Wu Xuanhua Shi Lei Shi 《Journal of Grid Computing》2013,11(1):47-61

Due to the highly dynamic feature, dependable workflow scheduling is critical in the Grid environment. Various scheduling algorithms have been proposed, but seldom consider the resource reliability. Current Grid systems mainly exploit fault tolerance mechanism to guarantee the dependable workflow execution, which, however, wastes system resources. The paper proposes a dependable Grid workflow scheduling system (called DGWS). It introduces a Markov Chain-based resource availability prediction model. Based on the model, a reliability cost driven workflow scheduling algorithm is presented. The performance evaluation results, including the simulation on both parametric randomly generated DAGs and two real scientific workflow applications, demonstrate that compared to present workflow scheduling algorithms, DGWS improves the success ratio of tasks and diminishes the makespan of workflow, so improves the dependability of workflow execution in the dynamic Grid environments. 相似文献

20.

Quality of Service Negotiation for Commercial Medical Grid Services

S. E. Middleton M. Surridge S. Benkner G. Engelbrecht 《Journal of Grid Computing》2007,5(4):429-447

The GEMSS project has developed a service-oriented Grid that supports the provision of medical simulation services by service providers to clients such as hospitals. We outline the GEMSS architecture, legal framework and the security features that characterise the GEMSS infrastructure. High levels of quality of service are required and we describe a reservation-based approach to quality of service, employing a quality of service management system that iteratively finds suitable reservations and uses application specific performance models. The GEMSS Grid is a commercial environment so we support flexible pricing models and a FIPA reverse English auction protocol. Signed Web Service Level Agreement contracts are exchanged to commit parties to a quality of service agreement before job execution occurs. We run four experiments across European countries using high performance computing resources running advanced resource reservation schedulers. These experiments provide evidence for our Grid’s rational behaviour, both at the level of service provider quality of service management and at the higher level of the client choosing between competing service providers. The results lend support to our economic model and the technology we use for our medical application domain. 相似文献