基于MapReduce的大规模文本聚类并行化 Parallel clustering of very large document datasets with MapReduce期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于MapReduce的大规模文本聚类并行化

引用本文：	武森, 冯小东, 杨杰, 张晓楠. 基于MapReduce的大规模文本聚类并行化[J]. 工程科学学报, 2014, 36(10): 1411-1419. DOI: 10.13374/j.issn1001-053x.2014.10.019

作者姓名：	武森冯小东杨杰张晓楠

作者单位：	1.北京科技大学东凌经济管理学院, 北京 100083

基金项目：	国家自然科学基金;高等学校博士学科点专项科研基金;中央高校基本科研业务费专项

摘要：	建立快速有效的针对大规模文本数据的聚类分析方法是当前数据挖掘研究和应用领域中的一个热点问题.为了同时保证聚类效果和提高聚类效率，提出基于
关键词：	云计算文本聚类相似度
收稿时间：	2013-09-30
Parallel clustering of very large document datasets with MapReduce

WU Sen, FENG Xiao-dong, YANG Jie, ZHANG Xiao-nan. Parallel clustering of very large document datasets with MapReduce[J]. Chinese Journal of Engineering, 2014, 36(10): 1411-1419. DOI: 10.13374/j.issn1001-053x.2014.10.019

Authors:	WU Sen FENG Xiao-dong YANG Jie ZHANG Xiao-nan

Affiliation:	1.Dongling School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China

Abstract:	To develop fast and efficient methods to cluster mass document data is one of the hot issues of current data mining research and applications. In order to ensure the clustering result and simultaneously improve the clustering efficiency, a document clustering algorithm was proposed based on searching a document pair with minimum similarity for each other and its distributed parallel computing models were provided. Firstly a document similarity measure was presented using a vector space model (VSM); then bisecting clustering was raised combining the bisecting K-means and the proposed initial cluster center selection approach to find the optimized cluster centroids by once partitioning; finally a distributed parallel document clustering model was designed for cloud computing based on MapReduce framework. Experiments on Hadoop platform, using real document datasets, showed the obvious efficiency advantages of the novel document clustering algorithm compared to the original bisecting K-means with an equivalent clustering result, and the scalability of parallel clustering with different data sizes and different computation node numbers was also evaluated.

Keywords:	cloud computing documents clustering similarity
本文献已被万方数据等数据库收录！
	点击此处可从《工程科学学报》浏览原始摘要信息
	点击此处可从《工程科学学报》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏