The HaLoop approach to large-scale iterative data analysis期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

The HaLoop approach to large-scale iterative data analysis

Authors:	Yingyi Bu Bill Howe Magdalena Balazinska Michael D Ernst

Affiliation:	1. University of California-Irvine, Irvine, CA, 92697, USA 2. University of Washington, Seattle, WA, 98195, USA

Abstract:	The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph analysis, and model fitting. This paper (This is an extended version of the VLDB 2010 paper “HaLoop: Efficient Iterative Data Processing on Large Clusters” PVLDB 3(1):285–296, 2010.) presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor of 1.85 and shuffled only 4 % as much data between mappers and reducers in the applications that we tested.

Keywords:
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏