首页 | 本学科首页   官方微博 | 高级检索  
     


CloudFlow: A data-aware programming model for cloud workflow applications on modern HPC systems
Affiliation:1. Kavli Institute for Astrophysics and Space Research, Massachusetts Institute of Technology, Cambridge, MA 02139, USA;2. KINDI Center for Computing Research, Qatar University, Doha, Qatar;3. Department of Electrical and Computer Engineering, North Dakota State University, Fargo, ND 58108-6050, USA;4. Department of Computer Science, State University of New York, New Paltz, NY 12561, USA;5. School of Information Technologies, The University of Sydney, Sydney, NSW 2006, Australia;1. Department of Earth Sciences, University of Western Ontario, London, Ontario, Canada N6A 3K7;2. Department of Physics and Astronomy, University of Western Ontario, London, Ontario, Canada N6A 3K7;3. Centre for Planetary Science and Exploration, University of Western Ontario, London, Ontario, Canada N6A 5B7;1. Fujian Provincial Key Laboratory of Network Security and Cryptology, School of Mathematics and Computer Science, Fujian Normal University, Fuzhou, China;2. School of Computer Science and Software Engineering, University of Wollongong, Wollongong, NSW 2522, Australia
Abstract:Traditional High-Performance Computing (HPC) based big-data applications are usually constrained by having to move large amount of data to compute facilities for real-time processing purpose. Modern HPC systems, represented by High-Throughput Computing (HTC) and Many-Task Computing (MTC) platforms, on the other hand, intend to achieve the long-held dream of moving compute to data instead. This kind of data-aware scheduling, typically represented by Hadoop MapReduce, has been successfully implemented in its Map Phase, whereby each Map Task is sent out to the compute node where the corresponding input data chunk is located. However, Hadoop MapReduce limits itself to a one-map-to-one-reduce framework, leading to difficulties for handling complex logics, such as pipelines or workflows. Meanwhile, it lacks built-in support and optimization when the input datasets are shared among multiple applications and/or jobs. The performance can be improved significantly when the knowledge of the shared and frequently accessed data is taken into scheduling decisions.To enhance the capability of managing workflow in modern HPC system, this paper presents CloudFlow, a Hadoop MapReduce based programming model for cloud workflow applications. CloudFlow is built on top of MapReduce, which is proposed not only being data aware, but also shared-data aware. It identifies the most frequently shared data, from both task-level and job-level, replicates them to each compute node for data locality purposes. It also supports user-defined multiple Map- and Reduce functions, allowing users to orchestrate the required data-flow logic. Mathematically, we prove the correctness of the whole scheduling framework by performing theoretical analysis. Further more, experimental evaluation also shows that the execution runtime speedup exceeds 4X compared to traditional MapReduce implementation with a manageable time overhead.
Keywords:Concurrency  Data aware  MapReduce  HPC  Programming model
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号