首页 | 本学科首页   官方微博 | 高级检索  
     

类属数据的贝叶斯聚类算法
引用本文:朱杰,陈黎飞. 类属数据的贝叶斯聚类算法[J]. 计算机应用, 2017, 37(4): 1026-1031. DOI: 10.11772/j.issn.1001-9081.2017.04.1026
作者姓名:朱杰  陈黎飞
作者单位:1. 中国西南电子技术研究所, 成都 610036;2. 福建师范大学 数学与计算机科学学院, 福州 350117
基金项目:国家自然科学基金资助项目(61175123);福建省自然科学基金资助项目(2015J01238)。
摘    要:针对类属型数据聚类中对象间距离函数定义的困难问题,提出一种基于贝叶斯概率估计的类属数据聚类算法。首先,提出一种属性加权的概率模型,在这个模型中每个类属属性被赋予一个反映其重要性的权重;其次,经过贝叶斯公式的变换,定义了基于最大似然估计的聚类优化目标函数,并提出了一种基于划分的聚类算法,该算法不再依赖于对象间的距离,而是根据对象与数据集划分间的加权似然进行聚类;第三,推导了计算属性权重的表达式,得出了类属型属性权重与其符号分布的信息熵成反比的结论。在实际数据和合成数据集上进行了实验,结果表明,与基于距离的现有聚类算法相比,所提算法提高了聚类精度,特别是在生物信息学数据上取得了5%~48%的提升幅度,并可以获得有实际意义的属性加权结果。

关 键 词:数据聚类  类属型属性  属性加权  贝叶斯聚类  概率模型  
收稿时间:2016-09-12
修稿时间:2016-12-23

Bayesian clustering algorithm for categorical data
ZHU Jie,CHEN Lifei. Bayesian clustering algorithm for categorical data[J]. Journal of Computer Applications, 2017, 37(4): 1026-1031. DOI: 10.11772/j.issn.1001-9081.2017.04.1026
Authors:ZHU Jie  CHEN Lifei
Affiliation:1. Southwest China Institute of Electronic Technology, Chengdu Sichuan 610036, China;2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China
Abstract:To address the difficulty of defining a meaningful distance measure for categorical data clustering, a new categorical data clustering algorithm was proposed based on Bayesian probability estimation. Firstly, a probability model with automatic attribute-weighting was proposed, in which each categorical attribute is assigned an individual weight to indicate its importance for clustering. Secondly, a clustering objective function was derived using maximum likelihood estimation and Bayesian transformation, then a partitioning algorithm was proposed to optimize the objective function which groups data according to the weighted likelihood between objects and clusters instead of the pairwise distances. Thirdly, an expression for estimating the attribute weights was derived, indicating that the weight should be inversely proportional to the entropy of category distribution. The experiments were conducted on some real datasets and a synthetic dataset. The results show that the proposed algorithm yields higher clustering accuracy than the existing distance-based algorithms, achieving 5%-48% improvements on the Bioinformatics data with meaningful attribute-weighting results for the categorical attributes.
Keywords:data clustering   categorical attribute   attribute weighting   Bayesian clustering   probability model
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号