快速混合Web文档聚类 Fast hybrid clustering for Web documents期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

快速混合Web文档聚类

引用本文：	杨瑞龙,朱庆生,谢洪涛.快速混合Web文档聚类[J].计算机工程与应用,2010,46(22):12-15.

作者姓名：	杨瑞龙朱庆生谢洪涛

作者单位：	1. 重庆大学,计算机学院,重庆,400044 2. 重庆大学,计算机学院,重庆,400044;后勤工程学院,重庆,400016

基金项目：	国家科技支撑计划，重庆市科技支撑计划

摘要：	提出了一种使用后缀树聚类算法优化K-means文档聚类初始值的快速混合聚类方法STK-means。该方法首先构建文档集的后缀树模型，使用后缀树聚类算法识别初始聚类、提取K-means聚类算法初始值中心值。然后，把后缀树模型的节点映射到M维向量空间模型中的特征项，利用TF-IDF方案计算基于短语的文档向量特征值。最后，使用K-means算法产生聚类结果。实验结果表明该方法优于传统K-means聚类算法和后缀树聚类算法，并具备了这些算法聚类速度快的优点。
关键词：	聚类算法 K-means算法后缀树 Web文档聚类基于短语的相似度
收稿时间：	2010-4-2
修稿时间：	2010-5-28
Fast hybrid clustering for Web documents

YANG Rui-long,ZHU Qing-sheng,XIE Hong-tao.Fast hybrid clustering for Web documents[J].Computer Engineering and Applications,2010,46(22):12-15.

Authors:	YANG Rui-long ZHU Qing-sheng XIE Hong-tao

Affiliation:	1.College of Computer Science,Chongqing University,Chongqing 400044,China 2.Logistical Engineering University,Chongqing 400016,China)

Abstract:	A fast hybrid clustering algorithm for Web documents clustering is proposed which optimizes the initial center val- ues of K-means algorithm through STC algorithm.Firstly,the initial center values are extracted after the Web document set is clustered by STC algorithm.Secondly,by mapping the each internal node of suffix tree into M-dimensional VSM,each fea- ture term weights is computed using TF-IDF extended with phrases.Finally, the final result is generated by K-means algo- rithm.The evaluation experiments indicate that the new hybrid algorithm is more effective on clustering documents than ordi- nary K-means and STC algorithm.Moreover,it is as fast as K-means and STC algorithm.

Keywords:	clustering algorithms K-means algorithm suffix tree Web document clustering phrase-based similarity
本文献已被维普万方数据等数据库收录！
	点击此处可从《计算机工程与应用》浏览原始摘要信息
	点击此处可从《计算机工程与应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏