首页 | 本学科首页   官方微博 | 高级检索  
     

基于MapReduce的数据倾斜连接算法
引用本文:梁俊杰,何利民. 基于MapReduce的数据倾斜连接算法[J]. 计算机科学, 2016, 43(9): 27-31
作者姓名:梁俊杰  何利民
作者单位:湖北大学计算机与信息工程学院 武汉430062,湖北大学计算机与信息工程学院 武汉430062
基金项目:本文受湖北省自然科学基金重点项目(2015CFA067,2013CFA115),湖北省教育厅科研项目计划(D20151001),武汉市科技攻关计划项目(2013012401010851)资助
摘    要:连接操作是大规模数据集在数据分析应用中最常用的操作,针对MapReduce自身不能有效地处理数据倾斜情况下的连接操作,提出了基于MapReduce的频次分类连接算法。根据数据在连接数据集中出现的频率将整个数据集分为3类,对倾斜数据利用分区算法和广播算法实现数据重分布,以消除数据倾斜的影响;对非倾斜数据采用Hash算法实现数据重分布。重分布后的数据在单节点内即可完成数据连接操作,避免了MapReduce框架下连接操作的跨节点传输代价;同时有效地均衡了MapReduce各节点的任务负载,从而提高了数据倾斜状态下连接操作的效率。通过与传统连接算法的对比,证明了所提算法的有效性和实用性。

关 键 词:数据倾斜  MapReduce  连接算法  负载均衡
收稿时间:2015-08-19
修稿时间:2015-10-29

Join Algorithm in Skewed Datasets Based on MapReduce
LIANG Jun-jie and HE Li-min. Join Algorithm in Skewed Datasets Based on MapReduce[J]. Computer Science, 2016, 43(9): 27-31
Authors:LIANG Jun-jie and HE Li-min
Affiliation:School of Computer Science and Information Engineering,HuBei University,WuHan 430062,China and School of Computer Science and Information Engineering,HuBei University,WuHan 430062,China
Abstract:Join operation is the most common operation in data analysis applications with large-scale datasets,and Map-Reduce can not support join operation perfectly in handling data skew problem.MapReduce frequecncy classified join algorithm was proposed,and datasets were classified into three categories according to the appeared data frequency.Data redistribution applying partitioning algorithm and broadcast algorithms eliminate the impact of skewed data.And data redistribution is realized by using hash algorithm for the non-skew data.Join operation can be completed in a single node,avoiding the cost of communications across the nodes under the MapReduce for the redistributed data,and balancing the workload of each node effectively,thereby improves the efficiency of join operations in skewed data.The effectiveness and practicality of the algorithms are proved by the comparison with traditional algorithms.
Keywords:Data skew  MapReduce  Join algorithm  Load balancing
点击此处可从《计算机科学》浏览原始摘要信息
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号