首页 | 本学科首页   官方微博 | 高级检索  
     

基于互信息属性排序的不完整数据聚类算法
引用本文:钱晓东,罗彦福.基于互信息属性排序的不完整数据聚类算法[J].信息与控制,2019,48(1):80-87.
作者姓名:钱晓东  罗彦福
作者单位:1. 兰州交通大学研究生院, 甘肃 兰州 730070;
2. 兰州交通大学自动化与电气工程学院, 甘肃 兰州 730070
基金项目:国家自然科学基金资助项目(71461017)
摘    要:数据缺失对聚类算法提出了挑战,传统方法往往采用均值或回归方法将不完整数据进行填充,再对填充后的数据进行聚类.为解决均值填充和回归填充等方法在数据缺失比率增大时填充精度以及聚类效果变差的问题,提出一种新的不完整数据相似度计算方法.以期望互信息为依据对数据集中的属性排序,充分考虑了数据集中与位置相关的属性值特征,以数据集本身元素作为缺失值填充的来源,对排序后的不完整数据集进行相似度填充计算,最后采用基于局部密度的聚类算法进行聚类.利用UCI机器学习库中的数据集验证本文填充聚类算法,实验结果表明,当数据集中缺失值增多时,算法对缺失值的容忍性较好,对缺失元素的恢复能力较强,填充精度以及最终聚类结果方面均表现良好.本文填充计算相似度的方法考虑数据集的每个属性值来对缺失值逐个填充,因而耗时较多.

关 键 词:不完整数据  互信息  缺失值填充  局部密度  
收稿时间:2017-12-28

Incomplete Data Clustering Algorithm Based on Mutual Information Attributes Ranking
QIAN Xiaodong,LUO Yanfu.Incomplete Data Clustering Algorithm Based on Mutual Information Attributes Ranking[J].Information and Control,2019,48(1):80-87.
Authors:QIAN Xiaodong  LUO Yanfu
Affiliation:1. Graduate School, Lanzhou Jiaotong University, Lanzhou 730070, China;
2. School of Automation and Electrical Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
Abstract:The incompleteness of data poses a challenge in clustering algorithms. Traditional methods usually use mean or regression algorithms to fill incomplete data and then cluster the data. To solve the problem of inaccurate filling and poor clustering effect for a high data loss ratio encountered in mean filling and regression filling methods, we propose a new method to calculate the incomplete data similarity. Based on the expected mutual information, we sort the attributes in the dataset, considering the location-related attribute values in the dataset, we use the data element itself as the source of missing values, and then calculate the similarity of the sorted incomplete datasets. Finally, we do the clustering using an algorithm based on local density. The clustering algorithm is verified using a data cluster in the UCI machine learning database. The experimental results show that the algorithm is more tolerant to missing values, better in recovering missing elements, and results in a better filling precision and final clustering results when the number of missing data sets increases. The method of filling similarity calculation in this study is more time-consuming as it fully considers each attribute value of a dataset to discretely fill in missing values.
Keywords:incomplete data  mutual information  missing value padding  local density  
点击此处可从《信息与控制》浏览原始摘要信息
点击此处可从《信息与控制》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号