首页 | 本学科首页   官方微博 | 高级检索  
     

基于Spark的大规模文本k-means并行聚类算法
引用本文:刘 鹏,滕家雨,丁恩杰,孟 磊.基于Spark的大规模文本k-means并行聚类算法[J].中文信息学报,2017,31(4):145-153.
作者姓名:刘 鹏  滕家雨  丁恩杰  孟 磊
作者单位:1. 中国矿业大学 物联网(感知矿山)研究中心, 江苏 徐州 221008;
2. 矿山互联网应用技术国家地方联合工程实验室, 江苏 徐州 221008;
3. 中国矿业大学 信息与电气工程学院, 江苏 徐州 221116
基金项目:国家自然科学基金(41302203)
摘    要:互联网文本数据量的激增使得对其作聚类运算的处理时间显著加长,虽有研究者利用Hadoop架构进行了k-means并行化研究,但由于很难有效满足k-means需要频繁迭代的特点,因此执行效率仍然不能让人满意。该文研究提出了基于新一代并行计算系统Spark的k-means文本聚类并行化算法,利用RDD编程模型充分满足了k-means频繁迭代运算的需求。实验结果表明,针对同一聚类文本大数据集和同样的计算环境,基于Spark的k-means文本聚类并行算法在加速比、扩展性等主要性能指标上明显优于基于Hadoop的实现,因此能更好地满足大规模文本数据挖掘算法的需求。

关 键 词:k-means  并行化  文本聚类  Spark  RDD  Hadoop  MapReduce  

Parallel K-means Algorithm for Massive Texts on Spark
LIU Peng,TENG Jiayu,DING Enjie,MENG Lei.Parallel K-means Algorithm for Massive Texts on Spark[J].Journal of Chinese Information Processing,2017,31(4):145-153.
Authors:LIU Peng  TENG Jiayu  DING Enjie  MENG Lei
Affiliation:1. Internet of Things Perception Mine Research Centre, China University of Mining and Technology, Xuzhou, Jiangsu 221008, China;
2. National and Local Joint Engineering Laboratory of Internet Application Technology on Mine, Xuzhou, Jiangsu 221008, China;
3. School of Information and Electrical Engineering, China University of Mining and Technology, Xuzhou, Jiangsu 221116,China
Abstract:Due to sharp increase of internet texts, the processing of k-means on such data is incredibly lengthened. Some classic parallel architectures, such as Hadoop, have not improved the execution efficiency of K-means, because the frequent iteration in such algorithms is hard to be efficiently handled. This paper proposed a parallelization algorithm of k-means based on Spark. It makes full use of in-memory-computing RDD model of Spark so as to well meet the frequent iteration requirement of k-means. Experimental results show that k-means executes much more efficiently in Spark than in Hadoop on the same datasets and the same computing environments.
Keywords:k-means  parallelization  text clustering  Spark  RDD  Hadoop  MapReduce  
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号