首页 | 本学科首页   官方微博 | 高级检索  
     

一种改进的BIRCH聚类算法
引用本文:蒋盛益,李霞.一种改进的BIRCH聚类算法[J].计算机应用,2009,29(1):293-296.
作者姓名:蒋盛益  李霞
作者单位:广东外语外贸大学 广东外语外贸大学
基金项目:国家自然科学基金,广东省高等学校自然科学研究重点项目,广东外语外贸大学科研创新团队项目 
摘    要:BIRCH算法是一种适应于大规模数据集的聚类算法,通过对所有叶节点设定统一阈值T来构建聚类特征(CF)树,并在各阶段采取不同的阈值来重建树,但没有给出一个合理设定阈值初值T及如何在各阶段提升阈值大小的具体方法。另外BIRCH算法只能处理数值型数据,这使其应用受到限制。针对以上不足,对BIRCH算法做了以下改进:1)改进原BIRCH算法的CF结构,使其可以处理混合型属性数据集; 2)启发式为BIRCH算法选择初始阈值T并给出了第二阶段提升阈值的具体操作方法; 3)对BIRCH算法的参数B和L做了探讨,指出当参数B=L时算法性能相近,并提出为获得较好聚类效果时B值的取值范围。实验结果表明,改进后的BIRCH算法具有较好的性能。

关 键 词:BIRCH算法    聚类    阈值    混合属性数据    数据挖掘
收稿时间:2008-09-17

Improved BIRCH clustering algorithm
JIANG Shen-yi,LI Xia.Improved BIRCH clustering algorithm[J].journal of Computer Applications,2009,29(1):293-296.
Authors:JIANG Shen-yi  LI Xia
Affiliation:School of Information;Guangdong University of Foreign Studies;Guangzhou Guangdong 510006;China
Abstract:BIRCH algorithm is a clustering algorithm suitable for very large data sets. In the algorithm, a CF-tree is built whose all entries in each leaf node must satisfy a uniform threshold T, and the CF-tree is rebuilt at each stage by different threshold. But how to set the initial threshold and how to increase the threshold of each stage are not given. In addition, the algorithm can only work with "metric" attribute, which makes its application restrained. This paper made some improvements on BIRCH algorithm: 1) Changed CF structure so that heterogeneous attributes could be manipulated; 2) Gave a heuristic method of getting initial threshold and increasing threshold of second stage of the algorithm; 3) Discussed the algorithm's parameter B and L and found that the algorithm had equal performance when B=L, at last, gave a sound scope for B. Experimental results on public data sets show that the improved algorithm has preferable performance.
Keywords:BIRCH algorithm  clustering  threshold  heterogeneous attributes  data mining
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号