首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于MapReduce的文本聚类方法研究
引用本文:李钊,李晓,王春梅,李诚,杨春. 一种基于MapReduce的文本聚类方法研究[J]. 计算机科学, 2016, 43(1): 246-250, 269
作者姓名:李钊  李晓  王春梅  李诚  杨春
作者单位:北京交通大学软件学院 北京100044;山东省计算中心国家超级计算济南中心 济南250014;山东省计算机网络重点实验室 济南250014;山东省电子政务大数据工程技术研究中心 济南250014,山东省计算机网络重点实验室 济南250014;山东省电子政务大数据工程技术研究中心 济南250014,山东省计算中心国家超级计算济南中心 济南250014;山东省计算机网络重点实验室 济南250014,山东省计算机网络重点实验室 济南250014;山东省电子政务大数据工程技术研究中心 济南250014,山东省计算机网络重点实验室 济南250014;山东省电子政务大数据工程技术研究中心 济南250014
基金项目:本文受国家自然科学基金项目(61472230),山东省科技发展计划(2013GZC20102)资助
摘    要:在文本聚类中,相似性度量是影响聚类效果的重要因素。常用的相似性度量测度,如欧氏距离、相关系数等,只能描述文本间的低阶相关性,而文本间的关系非常复杂,基于低阶相关测度的聚类效果不太理想。一些基于复杂测度的文本聚类方法已被提出,但随着数据规模的扩展,文本聚类的计算量不断增加,传统的聚类方法已不适用于大规模文本聚类。针对上述问题,提出一种基于MapReduce的分布式聚类方法,该方法对传统K-means算法进行了改进,采用了基于信息损失量的相似性度量。为进一步提高聚类的效率,将该方法与基于MapReduce的主成分分析方法相结合,以降低文本特征向量的维数。实例分析表明,提出的大规模文本聚类方法的 聚类性能 比已有的聚类方法更好。

关 键 词:文本聚类  MapReduce  K-means  信息损失
收稿时间:2015-06-01
修稿时间:2015-10-24

Text Clustering Method Study Based on MapReduce
LI Zhao,LI Xiao,WANG Chun-mei,LI Cheng and YANG Chun. Text Clustering Method Study Based on MapReduce[J]. Computer Science, 2016, 43(1): 246-250, 269
Authors:LI Zhao  LI Xiao  WANG Chun-mei  LI Cheng  YANG Chun
Affiliation:School of Software Engineering,Beijing Jiaotong University,Beijing 100044,China;Shandong Computer Science CenterNational Supercomputing Center in Jinan,Jinan 250014,China;Shandong Provincial Key Laboratory of Computer Network,Jinan 250014,China;Shandong E-Government Big Data Engineering Technology Research Center,Jinan 250014,China,Shandong Provincial Key Laboratory of Computer Network,Jinan 250014,China;Shandong E-Government Big Data Engineering Technology Research Center,Jinan 250014,China,Shandong Computer Science CenterNational Supercomputing Center in Jinan,Jinan 250014,China;Shandong Provincial Key Laboratory of Computer Network,Jinan 250014,China,Shandong Provincial Key Laboratory of Computer Network,Jinan 250014,China;Shandong E-Government Big Data Engineering Technology Research Center,Jinan 250014,China and Shandong Provincial Key Laboratory of Computer Network,Jinan 250014,China;Shandong E-Government Big Data Engineering Technology Research Center,Jinan 250014,China
Abstract:Text clustering is the key technology of text organization,information extraction and topic retrieval.Appropriate similarity measure selection is an important task of clustering,which has great affection on the clustering results.Classical similarity measures,such as distance function and the correlation coefficient,can only describe the linear relationship between documents.However,clustering results based on classical clustering methods are usually unsatisfactory due to the complicated relationship among text documents.Some complicated clustering methods have been studied.But,with the growing scale of text data,the computational cost increases markedly with the increase of dataset size.Classical clustering methods are out of work in dealing with large scale dataset clustering problems.In this paper,a distributed clustering method based on MapReduce was proposed to deal with large scale text clustering.Furthermore,we proposed an improved version of k-means algorithm,which utilizes information loss as the similarity function.For improving clustering speed,parallel PCA method based on MapReduce was used to reduce the document vector dimension.The experimental results demonstrate that the proposed method is more efficient for text clustering than classic clustering methods.
Keywords:Text clustering  MapReduce  K-means  Information loss
点击此处可从《计算机科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号