基于向量空间模型的文档聚类研究 Vector Space Model-Based Document Clustering Research期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于向量空间模型的文档聚类研究

引用本文：	许伟佳.基于向量空间模型的文档聚类研究[J].数字社区&智能家居,2009,5(9):7281-7283,7286.

作者姓名：	许伟佳

作者单位：	同济大学软件学院,上海201804

摘要：	文档聚类在Web文本挖掘中占有重要地位．是聚类分析在文本处理领域的应用。文章介绍了基于向量空间模型的文本表示方法，分析并优化了向量空间模型中特征词条权重的评价函数，使基于距离的相似性度量更为准确。重点分析了Web文档聚类中普遍使用的基于划分的k-means算法．对于k-means算法随机选取初始聚类中心的缺陷．详细介绍了采用基于最大最小距离法的原则，结合抽样技术思想，来稳定初始聚类中心的选取，改善聚类结果。
关键词：	文档聚类 k-means算法向量空间模型权重评价函数最大最小距离
Vector Space Model-Based Document Clustering Research

XU Wei-jia.Vector Space Model-Based Document Clustering Research[J].Digital Community & Smart Home,2009,5(9):7281-7283,7286.

Authors:	XU Wei-jia

Affiliation:	XU Wei-jia (School of Software Engineering, Tongji University, Shanghai 201804, China)

Abstract:	Document clustering plays an important role in web text mining, which is applied in the fields of text processing. In this paper, first introduces the Vector Space Model which is aiming at how to define documents as vectors （or points） in a multidimensional space. In order to improve the accuracy of similarity measurement for different documents, defines a more reasonable way to evaluate the weight of terms contained in certain document. Then, detailed analyzes the partitioning-based K-means algorithm which is widely used in document clustering. Considering that K-means has deficiency in selecting initial start points randomly, adopts the iterative max-min distance method combined with sampling techniques to optimize the initial clustering points selection, which contributes to improve the final clustering result.

Keywords:	document clustering k-means algorithm Vector Space Model term weight evaluation max-mm distance means
本文献已被维普等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏