首页 | 本学科首页   官方微博 | 高级检索  
     


GAMESH: A grid architecture for scalable monitoring and enhanced dependable job scheduling
Affiliation:1. Department of Computer Science and Engineering, University of California, Riverside, United States;2. Department of Electrical and Computer Engineering, University of California, Riverside, United States;1. Research School of Computer Science, Australian National University, Canberra, ACT 2601, Australia;2. Department of Electronic and Electrical Engineering, University College London, London, UK;1. Institute of Computer Science Cracow University of Technology, Poland;2. Universit degli Studi di Salerno Fisciano, Campania, Italy
Abstract:Grid computing is a largely adopted paradigm to federate geographically distributed data centers. Due to their size and complexity, grid systems are often affected by failures that may hinder the correct and timely execution of jobs, thus causing a non-negligible waste of computing resources. Despite the relevance of the problem, state-of-the-art management solutions for grid systems usually neglect the identification and handling of failures at runtime. Among the primary goals to be considered, we claim the need for novel approaches capable to achieve the objectives of scalable integration with efficient monitoring solutions and of fitting large and geographically distributed systems, where dynamic and configurable tradeoffs between overhead and targeted granularity are necessary. This paper proposes GAMESH, a Grid Architecture for scalable Monitoring and Enhanced dependable job ScHeduling. GAMESH is conceived as a completely distributed and highly efficient management infrastructure, concentrating on two crucial aspects for large-scale and multi-domain grid environments: (i) the scalable dissemination of monitoring data and (ii) the troubleshooting of job execution failures. GAMESH has been implemented and tested in a real deployment encompassing geographically distributed data centers across Europe. Experimental results show that GAMESH (i) enables the collection of measurements of both computing resources and conditions of task scheduling at geographically sparse sites, while imposing a limited overhead on the entire infrastructure, and (ii) provides a failure-aware scheduler able to improve the overall system performance, even in the presence of failures, by coordinating local job schedulers at multiple domains.
Keywords:Grid  Monitoring  Dependability  Scalability  Scheduling  Fault tolerance  DDS
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号