MapReduce Workload Modeling with Statistical Approach |
| |
Authors: | Hailong Yang Zhongzhi Luan Wenjun Li Depei Qian |
| |
Affiliation: | 1. Sino-German Joint Software Institute, The State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing, China
|
| |
Abstract: | Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many
academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful
realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations,
the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster
effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant
effort to write an application with poor performance, either because they have no idea how these configurations would influence
the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive
analysis and performance modeling to ease MapReduce application development and guide performance optimization under different
Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload
characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis
to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different
Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads
under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance
are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the
performance of MapReduce workloads under different Hadoop configurations. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|