首页 | 本学科首页   官方微博 | 高级检索  
     

基于密度的聚类中心自动确定的混合属性数据聚类算法研究
引用本文:陈晋音,何辉豪.基于密度的聚类中心自动确定的混合属性数据聚类算法研究[J].自动化学报,2015,41(10):1798-1813.
作者姓名:陈晋音  何辉豪
作者单位:1.浙江工业大学信息工程学院 杭州 310023
基金项目:浙江省自然科学基金(Y14F020092), 宁波市自然科学基金 (2013A610070)资助
摘    要:面对广泛存在的混合属性数据,现有大部分混合属性聚类算法普遍存在聚类 质量低、聚类算法参数依赖性大、聚类类别个数和聚类中心无法准确自动确定等问题,针对 这些问题本文提出了一种基于密度的聚类中心自动确定的混合属性数据 聚类算法.该算法通过分析混合属性数据特征,将混合属性数据分为数 值占优、分类占优和均衡型混合属性数据三类,分析不同情况的特征选取 相应的距离度量方式.在计算数据集各个点的密度和距离分布图基础 上,深入分析获得规律: 高密度且与比它更高密度的数据点有较大距离的数 据点最可能成为聚类中心,通过线性回归模型和残差分析确定奇异 点,理论论证这些奇异点即为聚类中心,从而实现了自动确定聚类中心.采 用粒子群算法(Particle swarm optimization, PSO)寻找最优dc值,通过参数dc能够计算得到 任意数据对象的密度和到比它密度更高的点的最小距离,根据聚类 中心自动确定方法确定每个簇中心,并将其他点按到最近邻的更高 密度对象的最小距离划分到相应的簇中,从而实现聚类.最终将本文 提出算法与其他现有的多种混合属性聚类算法在多个数据集上进行 算法性能比较,验证本文提出算法具有较高的聚类质量.

关 键 词:数据挖掘    混合属性    数据聚类    密度    混合距离度量
收稿时间:2015-02-03

Research on Density-based Clustering Algorithm for Mixed Data with Determine Cluster Centers Automatically
CHEN Jin-Yin,HE Hui-Hao.Research on Density-based Clustering Algorithm for Mixed Data with Determine Cluster Centers Automatically[J].Acta Automatica Sinica,2015,41(10):1798-1813.
Authors:CHEN Jin-Yin  HE Hui-Hao
Affiliation:1.Institute of Information Engineering, Zhejiang University of Technology, Hangzhou 310023
Abstract:For mixed data clustering, mostly current clustering algorithms have shortcomings such as low clustering efficiency, clustering parameter sensibility, clustering center number initialization and center determination difficulty. A density based cluster center self-determination mixed data clustering algorithm is proposed in this paper. Firstly, mixed data are divided into three types, including numeric dominant data, categorical dominant data and balanced data based on their data attributes analysis, and corresponding similarity metrics are designed for these three types of mixed data. Then, based on the density and distance relationship for each data object, an important conclusion is achieved that those data objects that have both higher density and larger distance than other data objects are more likely to be the cluster centers. So the linear regression model and residuals analysis are used to find those outliers that are fixed to be cluster centers automatically. The initialization value of dcis most crucial to clustering efficiency, so particle swarm optimization (PSO) algorithm is adopted to search the optimal dc by calculating the distance and density of each data object according to the automatic method for determining the cluster centers. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density. Finally, the performance of the proposed method is testified by a series of simulations on real-world datasets in comparison with other excellent clustering algorithms.
Keywords:Data mining  mixed attributes  data clustering  peak density  mixed distance measure methods
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号