首页 | 本学科首页   官方微博 | 高级检索  
     

基于关联规则的Web文档聚类算法
引用本文:宋擒豹,沈钧毅.基于关联规则的Web文档聚类算法[J].软件学报,2002,13(3):417-423.
作者姓名:宋擒豹  沈钧毅
作者单位:西安交通大学,计算机科学与技术系,陕西,西安,710049
基金项目:国家自然科学基金资助项目(60173058);国家863青年基金资助项目(863-306-QN2000-5)
摘    要:Web文档聚类可以有效地压缩搜索空间,加快检索速度,提高查询精度.提出了一种Web文档的聚类算法.该算法首先采用向量空间模型VSM(vector space model)表示主题,根据主题表示文档;再以文档为事务,以主题为事务项,将文档和主题间的关系看作事务的形式,采用关联规则挖掘算法发现主题频集,相应的文档集即为初步文档类;然后依据类间距离和类内连接强度阈值合并、拆分类,最终实现文档聚类.实验结果表明,该算法是有效的,能处理文档类间固有的重叠情况,具有一定的实用价值.

关 键 词:文档聚类  关联规则  Web挖掘  WWW
文章编号:1000-9825/2002/13(03)0417-07
收稿时间:4/4/2000 12:00:00 AM
修稿时间:2000年4月4日

A Web Document Clustering Algorithm Based on Association Rule
SONG Qin-bao and SHEN Jun-yi.A Web Document Clustering Algorithm Based on Association Rule[J].Journal of Software,2002,13(3):417-423.
Authors:SONG Qin-bao and SHEN Jun-yi
Abstract:By grouping similar Web documents into clusters, the search space can be reduced, the search accelerated, and its precision improved. In this paper, a new clustering algorithm is introduced. In the clustering technique, topics are represented according to VSM (vector space model), documents are represented according to topics, and the relation between documents and topics is viewed in a transactional form, each document corresponds to a transaction and each topic corresponds to an item. A frequent item sets can be found by using the association riles discovery algorithm,corresponding documents can be seen as initial clusters.These clusters are merged according to the disance between clusters,or divided aivided according to the strength of connection among documents of a cluster.By real Wed documents,experimental results show the algorithm's effectivenss and suitability for tackling the overlapping clusters inhered by documents.
Keywords:document clustering  association rule  Web mining  WWW
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号