首页 | 本学科首页   官方微博 | 高级检索  
     

一种大规模的递增聚类算法及其在文档聚类中的应用
引用本文:唐春生,金以慧.一种大规模的递增聚类算法及其在文档聚类中的应用[J].计算机工程与应用,2002,38(11):187-190,195.
作者姓名:唐春生  金以慧
作者单位:清华大学自动化系,北京,100084
摘    要:聚类是将数据进行划分,并从中发现有用信息的一种有效手段,它在很多领域都有着非常重要的应用。K均值方法是聚类方法中较常用的一种,但对于大规模的数据,而且有计算资源和时间约束的情况下,K均值方法已不能满足要求。该文提出的CFK-means方法是一种适合于大规模数据的、快速高效的递增聚类方法,它采用了聚类特性(Clus-teringFeatures,CF)结构来表示聚类,能更有效地保留和利用聚类信息。它只需扫描数据一次即可得到聚类划分,所需的计算时间和文件交换时间数倍少于K均值方法,而且聚类的准确度和K均值方法相当。通过对仿真数据和实际文本集数据进行的对比实验证明了CFK-means方法的有效性。

关 键 词:聚类特性(CF)  CFK-means算法  k-means算法  文档聚类
文章编号:1002-8331-(2002)11-0187-04

An Incremental Clustering Algorithm for Large-Scale Data and Its Application on Document Clustering
Tang Chunsheng Jin Yihui.An Incremental Clustering Algorithm for Large-Scale Data and Its Application on Document Clustering[J].Computer Engineering and Applications,2002,38(11):187-190,195.
Authors:Tang Chunsheng Jin Yihui
Abstract:Clustering is an efficient method to discovery va luable information in data and it is applied to many domains.K-means algorithm is an important clustering method,but it is difficult to use k-means to clu ster for large-scale data,especially when there is limit in computing resource and time.An incremental algorithm called CFK-means method is presented in thi s paper to solve this problem.More cluster information can be reserved and util ized by using the clustering features(CF)structure into k-means algorithm.Cl ustering results can be achieved very fast in one scan of the data.The computin g and file exchange time of CFK-means method is several times less than k-mea ns algorithm and the accuracy of the results is almost equal to k-means algorit hm.The effectiveness of this method is proved by the experiments on simulative data and real text sets.
Keywords:Clustering Features(CF)  CFK-means algori thm  k-means algorithm  document clustering  
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号