首页 | 本学科首页   官方微博 | 高级检索  
     

基于向量空间模型的文档聚类研究
引用本文:许伟佳.基于向量空间模型的文档聚类研究[J].数字社区&智能家居,2009,5(9):7281-7283,7286.
作者姓名:许伟佳
作者单位:同济大学软件学院,上海201804
摘    要:文档聚类在Web文本挖掘中占有重要地位.是聚类分析在文本处理领域的应用。文章介绍了基于向量空间模型的文本表示方法,分析并优化了向量空间模型中特征词条权重的评价函数,使基于距离的相似性度量更为准确。重点分析了Web文档聚类中普遍使用的基于划分的k-means算法.对于k-means算法随机选取初始聚类中心的缺陷.详细介绍了采用基于最大最小距离法的原则,结合抽样技术思想,来稳定初始聚类中心的选取,改善聚类结果。

关 键 词:文档聚类  k-means算法  向量空间模型  权重评价函数  最大最小距离

Vector Space Model-Based Document Clustering Research
XU Wei-jia.Vector Space Model-Based Document Clustering Research[J].Digital Community & Smart Home,2009,5(9):7281-7283,7286.
Authors:XU Wei-jia
Affiliation:XU Wei-jia (School of Software Engineering, Tongji University, Shanghai 201804, China)
Abstract:Document clustering plays an important role in web text mining, which is applied in the fields of text processing. In this paper, first introduces the Vector Space Model which is aiming at how to define documents as vectors (or points) in a multidimensional space. In order to improve the accuracy of similarity measurement for different documents, defines a more reasonable way to evaluate the weight of terms contained in certain document. Then, detailed analyzes the partitioning-based K-means algorithm which is widely used in document clustering. Considering that K-means has deficiency in selecting initial start points randomly, adopts the iterative max-min distance method combined with sampling techniques to optimize the initial clustering points selection, which contributes to improve the final clustering result.
Keywords:document clustering  k-means algorithm  Vector Space Model  term weight evaluation  max-mm distance means
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号