The HaLoop approach to large-scale iterative data analysis |
| |
Authors: | Yingyi Bu Bill Howe Magdalena Balazinska Michael D Ernst |
| |
Affiliation: | 1. University of California-Irvine, Irvine, CA, 92697, USA 2. University of Washington, Seattle, WA, 98195, USA
|
| |
Abstract: | The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design
new types of highly scalable data-intensive computing platforms. MapReduce has enjoyed particular success. However, MapReduce
lacks built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking,
graph analysis, and model fitting. This paper (This is an extended version of the VLDB 2010 paper “HaLoop: Efficient Iterative
Data Processing on Large Clusters” PVLDB 3(1):285–296, 2010.) presents HaLoop, a modified version of the Hadoop MapReduce framework, that is designed to serve these applications. HaLoop
allows iterative applications to be assembled from existing Hadoop programs without modification, and significantly improves
their efficiency by providing inter-iteration caching mechanisms and a loop-aware scheduler to exploit these caches. HaLoop
retains the fault-tolerance properties of MapReduce through automatic cache recovery and task re-execution. We evaluated HaLoop
on a variety of real applications and real datasets. Compared with Hadoop, on average, HaLoop improved runtimes by a factor
of 1.85 and shuffled only 4 % as much data between mappers and reducers in the applications that we tested. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|