首页 | 本学科首页   官方微博 | 高级检索  
     

Hadoop MapReduce短作业执行性能优化
引用本文:顾 荣 严金双 杨晓亮 袁春风 黄宜华. Hadoop MapReduce短作业执行性能优化[J]. 计算机研究与发展, 2014, 51(6): 1270-1280.
作者姓名:顾荣  严金双  杨晓亮  袁春风  黄宜华
作者单位:1.(计算机软件新技术国家重点实验室(南京大学) 南京 210046) (gurongwalker@gmail.com)
基金项目:国家自然科学基金专项基金项目(61223003);国家“八六三”高技术研究发展计划基金项目(2011AA01A202);美国Intel Labs大学研究资助项目
摘    要:Hadoop MapReduce并行计算框架被广泛应用于大规模数据并行处理.近年来,由于其能较好地处理大规模数据,Hadoop MapReduce也被越来越多地使用在查询应用中.为了能够处理大规模数据集,Hadoop的基本设计更多地强调了数据的高吞吐率.然而在处理对短作业响应性能有较高要求的查询应用时,Hadoop MapReduce并行计算框架存在明显不足.为了提升Hadoop对于短作业的执行效率,对原有的Hadoop MapReduce作出以下3点优化:1)通过优化原有的setup和cleanup任务的执行方式,成功地缩短了作业初始化环境准备和作业结束环境清理的时间;2)将首次任务分配从“拉”模式转变为“推”模式;3)将作业执行过程中JobTracker和TaskTrackers之间的控制消息通信从现有的周期性心跳机制中分离出来,采用即时传递机制.最后,采用一种典型的基于MapReduce并行化的查询应用BLAST,对优化工作进行了评估.各种不同类型BLAST作业的测试实验表明,与现有的标准Hadoop相比,优化后的Hadoop平均执行性能提升约23%.

关 键 词:MapReduce  并行计算  短作业  性能优化  大数据处理

Performance Optimization for Short Job Execution in Hadoop MapReduce
Gu Rong, Yan Jinshuang, Yang Xiaoliang, Yuan Chunfeng, and Huang Yihua. Performance Optimization for Short Job Execution in Hadoop MapReduce[J]. Journal of Computer Research and Development, 2014, 51(6): 1270-1280.
Authors:Gu Rong  Yan Jinshuang  Yang Xiaoliang  Yuan Chunfeng  and Huang Yihua
Affiliation:1.(State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 210046)
Abstract:Hadoop MapReduce is a widely used parallel computing framework for solving data-intensive problems. Now days, for its good capability for processing large scale data, Hadoop MapReduce has also been adopted in many query applications. To be able to process large scale datasets, the fundamental design of the standard Hadoop places more emphasis on the high-throughput of data than on the job execution performance. This causes performance limitation when we use Hadoop MapReduce to execute short jobs. This paper proposes several optimization methods to improve the execution performance of MapReduce jobs, especially for short jobs. We make three major optimizations: 1) reduce the time cost during the initialization and termination stages of a job by optimizing its setup and cleanup tasks; 2) change the assignment model of the first batch of tasks from the pull model to the push model; 3) replace the heartbeat-base communication mechanism with an instant message communication mechanism for event notifications between the JobTracker and TaskTrackers. We also adopt a typical MapReduce-based parallel query application, BLAST, to evaluate the effects of our optimizations. Experimental results show that the job execution performance of our improved version of Hadoop is about 23% faster on average than the standard Hadoop for different types of BLAST MapReduce jobs.
Keywords:MapReduce  parallel computing  short job  performance optimization  big data processing
本文献已被 CNKI 等数据库收录!
点击此处可从《计算机研究与发展》浏览原始摘要信息
点击此处可从《计算机研究与发展》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号