基于Spark的MapReduce相似度计算效率优化 Efficiency Optimization Method for MapReduce Similarity Computing Based on Spark期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于Spark的MapReduce相似度计算效率优化

引用本文：	廖彬,张陶,于炯,国冰磊,刘炎.基于Spark的MapReduce相似度计算效率优化[J].计算机科学,2017,44(8):46-53.

作者姓名：	廖彬张陶于炯国冰磊刘炎

作者单位：	新疆财经大学统计与信息学院乌鲁木齐 830012,新疆大学信息科学与工程学院乌鲁木齐 830046;新疆医科大学医学工程技术学院乌鲁木齐 830011,新疆大学信息科学与工程学院乌鲁木齐 830046,新疆大学信息科学与工程学院乌鲁木齐 830046,清华大学软件学院北京100084

基金项目：	本文受国家自然科学基金项目(61562078,61262088,71261025),新疆维吾尔自治区自然科学基金(2016D01B014),新疆财经大学博士启动基金(2015BS007)资助

摘要：	随着互联网的用户及内容呈指数级增长,大规模数据场景下的相似度计算对算法的效率提出了更高的要求。为提高算法的执行效率,对MapReduce架构下的算法执行缺陷进行了分析,结合Spark适于迭代型及交互型任务的特点,基于二维划分算法将算法从MapReduce平台移植到Spark平台；同时,通过参数调整、内存优化等方法进一步提高算法的执行效率。通过2组数据集分别在3组不同规模的集群上的实验表明,与MapReduce相比,在Spark平台下算法的执行效率平均提高了4.715倍,平均能耗效率只有Hadoop能耗的24.86%,能耗效率提升了4倍左右。
关键词：	相似度计算 MapReduce Spark优化能耗优化
收稿时间：	2016/6/20 0:00:00
修稿时间：	2016/9/29 0:00:00
Efficiency Optimization Method for MapReduce Similarity Computing Based on Spark

LIAO Bin,ZHANG Tao,YU Jiong,GUO Bing-lei and LIU Yan.Efficiency Optimization Method for MapReduce Similarity Computing Based on Spark[J].Computer Science,2017,44(8):46-53.

Authors:	LIAO Bin ZHANG Tao YU Jiong GUO Bing-lei and LIU Yan

Affiliation:	College of Statistics and Information,Xinjiang University of Finance and Economics,Urumqi 830012,China,School of Information Science and Engineering,Xinjiang University,Urumqi 830046,China;School of Medical Engineering Technology,Xinjiang Medical University,Urumqi 830011,China,School of Information Science and Engineering,Xinjiang University,Urumqi 830046,China,School of Information Science and Engineering,Xinjiang University,Urumqi 830046,China and School of Software,Tsinghua University,Beijing 100084,China

Abstract:	With the exponential growth of both internet users and contents,the similarity computation of big data needs more efficiency.In order to improve the performance of the algorithm,the implementation of the algorithm was analyzed,as the characteristics of the Spark is suitable for the iterative and interactive tasks.The algorithm based on the 2D partition algorithm was transplanted from the MapReduce to the Spark.And through the parameter adjustment,memory optimization etc.we improved the efficiency of the algorithm.The experimental results with 2 data sets on 3 different sizes of clusters indicated that compared Spark with MapReduce,the algorithm implementation efficiency of Spark platform is 4.715 times higher than MapReduce,and energy consumption is only 24.86% of the average energy consumption of Hadoop,which is about 4 times higher than Hadoop.

Keywords:	Similarity computing MapReduce Spark optimization Energy optimization

	点击此处可从《计算机科学》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏