类属数据的贝叶斯聚类算法 Bayesian clustering algorithm for categorical data期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

类属数据的贝叶斯聚类算法

引用本文：	朱杰,陈黎飞. 类属数据的贝叶斯聚类算法[J]. 计算机应用, 2017, 37(4): 1026-1031. DOI: 10.11772/j.issn.1001-9081.2017.04.1026

作者姓名：	朱杰陈黎飞

作者单位：	1. 中国西南电子技术研究所, 成都 610036;2. 福建师范大学数学与计算机科学学院, 福州 350117

基金项目：	国家自然科学基金资助项目（61175123）；福建省自然科学基金资助项目（2015J01238）。

摘要：	针对类属型数据聚类中对象间距离函数定义的困难问题，提出一种基于贝叶斯概率估计的类属数据聚类算法。首先，提出一种属性加权的概率模型，在这个模型中每个类属属性被赋予一个反映其重要性的权重；其次，经过贝叶斯公式的变换，定义了基于最大似然估计的聚类优化目标函数，并提出了一种基于划分的聚类算法，该算法不再依赖于对象间的距离，而是根据对象与数据集划分间的加权似然进行聚类；第三，推导了计算属性权重的表达式，得出了类属型属性权重与其符号分布的信息熵成反比的结论。在实际数据和合成数据集上进行了实验，结果表明，与基于距离的现有聚类算法相比，所提算法提高了聚类精度，特别是在生物信息学数据上取得了5%~48%的提升幅度，并可以获得有实际意义的属性加权结果。
关键词：	数据聚类类属型属性属性加权贝叶斯聚类概率模型
收稿时间：	2016-09-12
修稿时间：	2016-12-23
Bayesian clustering algorithm for categorical data

ZHU Jie,CHEN Lifei. Bayesian clustering algorithm for categorical data[J]. Journal of Computer Applications, 2017, 37(4): 1026-1031. DOI: 10.11772/j.issn.1001-9081.2017.04.1026

Authors:	ZHU Jie CHEN Lifei

Affiliation:	1. Southwest China Institute of Electronic Technology, Chengdu Sichuan 610036, China;2. School of Mathematics and Computer Science, Fujian Normal University, Fuzhou Fujian 350117, China

Abstract:	To address the difficulty of defining a meaningful distance measure for categorical data clustering, a new categorical data clustering algorithm was proposed based on Bayesian probability estimation. Firstly, a probability model with automatic attribute-weighting was proposed, in which each categorical attribute is assigned an individual weight to indicate its importance for clustering. Secondly, a clustering objective function was derived using maximum likelihood estimation and Bayesian transformation, then a partitioning algorithm was proposed to optimize the objective function which groups data according to the weighted likelihood between objects and clusters instead of the pairwise distances. Thirdly, an expression for estimating the attribute weights was derived, indicating that the weight should be inversely proportional to the entropy of category distribution. The experiments were conducted on some real datasets and a synthetic dataset. The results show that the proposed algorithm yields higher clustering accuracy than the existing distance-based algorithms, achieving 5%-48% improvements on the Bioinformatics data with meaningful attribute-weighting results for the categorical attributes.

Keywords:	data clustering categorical attribute attribute weighting Bayesian clustering probability model

	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏