首页 | 本学科首页   官方微博 | 高级检索  
     

MapReduce连接查询的I/O代价研究
引用本文:宋杰,李甜甜,朱志良,鲍玉斌,于戈. MapReduce连接查询的I/O代价研究[J]. 软件学报, 2015, 26(6): 1438-1456
作者姓名:宋杰  李甜甜  朱志良  鲍玉斌  于戈
作者单位:东北大学 软件学院, 辽宁 沈阳 110819,东北大学 信息科学与工程学院, 辽宁 沈阳 110819,东北大学 软件学院, 辽宁 沈阳 110819,东北大学 信息科学与工程学院, 辽宁 沈阳 110819,东北大学 信息科学与工程学院, 辽宁 沈阳 110819
基金项目:国家自然科学基金(61433008, 61202088, 61402090); 教育部高等学校博士学科点专项科研基金(20130042120006);中国博士后科学基金面上项目(2013M540232); 中央高校基本科研业务费重大科技创新项目(N120817001); 辽宁省博士启动基金(201403314)
摘    要:数据的指数级增长给数据管理和分析带来了严峻的挑战.连接查询是数据分析中一种常用运算,而MapReduce是一种用于大规模数据集并行处理的编程模型,研究基于MapReduce的连接查询代价评估和查询优化,有着学术意义和应用价值.MapReduce连接查询算法的性能主要取决于I/O代价(包括本地和网络I/O),而I/O代价与数据集以及连接运算的特征参数相关,通过对二元连接的I/O代价评估可以优化多元连接执行计划.基于此,首先提出了二元连接查询的I/O代价模型;随后,对现有二元连接算法进行形式化定义和简单扩展,归纳出6种基于MapReduce连接查询算法,并通过算法白盒分析定义它们的I/O代价函数;最后,提出一种多元连接最优执行计划的选择算法.通过实验表明I/O代价模型的正确性且能够准确地反映算法的性能优劣.

关 键 词:连接查询  MapReduce  I/O代价模型  查询优化
收稿时间:2013-06-04
修稿时间:2014-01-21

Research on I/O Cost of MapReduce Join
SONG Jie,LI Tian-Tian,ZHU Zhi-Liang,BAO Yu-Bin and YU Ge. Research on I/O Cost of MapReduce Join[J]. Journal of Software, 2015, 26(6): 1438-1456
Authors:SONG Jie  LI Tian-Tian  ZHU Zhi-Liang  BAO Yu-Bin  YU Ge
Affiliation:Software College, Northeastern University, Shenyang 110819, China,School of Information and Engineering, Northeastern University, Shenyang 110819, China,Software College, Northeastern University, Shenyang 110819, China,School of Information and Engineering, Northeastern University, Shenyang 110819, China and School of Information and Engineering, Northeastern University, Shenyang 110819, China
Abstract:The exponential growth of data has posed serious challenges to the data management and analysis. Join query is a common data analysis operation, and MapReduce is a programming model implemented for parallel processing on large-scale datasets. Therefore the research on MapReduce based join algorithms and its cost model has a certain academic significance and application value. This study believes that the I/O (including the network and the local I/O) cost is the main factor affecting the performance of MapReduce based join algorithm. Furthermore, as the I/O cost is determined by the feature of both datasets and join operation, the executed plan of multi-ways join could be optimized by evaluating the I/O cost of two-ways join. In the study, an I/O cost model of two-ways join is proposed and then formally defined as a simple extension to the existing MapReduce based join algorithms, resulting in six join algorithms and their I/O cost functions through write-box analysis. In addition, an selection algorithm to find the best executed plan of multi-ways join is presented. The correctness and accuracy of the I/O cost model are validated through a series of experiments. The experiment results suggest that the I/O cost can accurately reflect the algorithm performance.
Keywords:join  MapReduce  I/O cost model  query optimization
本文献已被 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号