首页 | 本学科首页   官方微博 | 高级检索  
     

基于统计学习的自适应文本聚类
引用本文:王纵虎,刘志镜,陈东辉.基于统计学习的自适应文本聚类[J].四川大学学报(工程科学版),2012,44(1):106-111.
作者姓名:王纵虎  刘志镜  陈东辉
作者单位:西安电子科技大学计算机学院,陕西西安,710071
基金项目:国家科技支撑计划资助项目(2007BAH08802);陕西省13115科技创新工程重大专项资助项目(2007ZDKG-57)
摘    要:针对文本数据的高维性和稀疏性从而使传统的聚类算法在文本聚类应用中的表现不能让人满意的问题,通过计算文档相似度矩阵,在聚类过程中动态地统计学习已划分和未划分文本集合的相关信息,探测剩余未划分的数据集中的与已划分类簇覆盖度较小的最大密集区域,逐步生成预定数目的初始聚类中心集合,最后将剩余文档划分到最相似的初始聚类中心集合完成聚类,从而有效地减小了划分聚类算法对初始聚类中心的敏感性。算法中的一些阈值参数均通过在聚类过程中动态地对数据集进行统计学习得到,避免了多数聚类算法通过经验或实验设定阈值参数的盲目性,在不同

关 键 词:聚类  向量空间模型  相似度  划分  阈值
收稿时间:2011/5/25 0:00:00
修稿时间:9/19/2011 3:00:07 PM

Research of Adaptive Text Clustering Based on the Statistics of the Datasets
Wang Zonghu,Liu Zhijing and Chen Donghui.Research of Adaptive Text Clustering Based on the Statistics of the Datasets[J].Journal of Sichuan University (Engineering Science Edition),2012,44(1):106-111.
Authors:Wang Zonghu  Liu Zhijing and Chen Donghui
Affiliation:School of Computer Sci. and Technol.,Xidian Univ.;School of Computer Sci. and Technol.,Xidian Univ.;School of Computer Sci. and Technol.,Xidian Univ.
Abstract:Due to the high dimensionality and sparseness of text data, the performance of traditional clustering algorithm may not be satisfied in clustering text data. The largest dense region having a small coverage rate with the partitioned clusters was selected out as initial cluster centroid set gradually by learning the similarity information between the partitioned and remainning sets. After generating the predetermined number of initial cluster centroid set, the remaining documents were assigned to their nearest clusters. By this way, the sensitivity of the clustering algorithm to the initial cluster centroid was reduced. Some threshold values used in this algorithm were calculated by the automatic statistic of the dataset dynamically in the process of clustering to avoid the blindness of the threshold parameters by experience or experiment in most clustering algorithms. The experiments on several Chinese and English datasets showed that this algorithm performes better than clustering algorithms in CLUTO.
Keywords:clustering  VSM  similarity  partition  threshold
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《四川大学学报(工程科学版)》浏览原始摘要信息
点击此处可从《四川大学学报(工程科学版)》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号