基于从共现矩阵提取关联的类别型数据聚类 Categorical Data Clustering Based on Extraction of Associations from Co-association Matrix期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于从共现矩阵提取关联的类别型数据聚类

引用本文：	关云鹏,刘玉龙.基于从共现矩阵提取关联的类别型数据聚类[J].计算机与现代化,2022,0(11):1-8.

作者姓名：	关云鹏刘玉龙

基金项目：	科技创新2030—“新一代人工智能”重大项目(2020AAA0105100)

摘要：	类别型数据聚类被广泛应用于现实世界的不同领域中，如医学科学、计算机科学等。通常的类别型数据聚类，是在基于相异度量上进行研究，针对不同特点的数据集，聚类结果会受到数据集自身特点和噪音信息的影响。此外，基于表示学习的类别型数据聚类，实现复杂，聚类结果受到表示结果的影响较大。本文以共现矩阵为基础，提出一种可以直接考虑类别型数据原始信息关联关系的聚类方法———基于从共现矩阵提取关联的类别型数据聚类方法(CDCBCM)。共现矩阵可被看作是一种对原始数据空间中信息关联情况的汇总。本文通过计算不同对象在各个属性子空间下的共现频率值来构建共现矩阵，并从共现矩阵中去除一些噪音信息，再使用归一化切割来得到聚类结果。本文方法在16个不同领域的公开数据集中进行测试，与8种现有方法进行比较，并采用F1-score指标进行检测。实验结果表明，本文方法在7个数据集上效果最好，平均排名最高，能更好地完成对类别型数据的聚类任务。
关键词：	类别型数据类别型数据聚类机器学习共现矩阵归一化切割
收稿时间：	2022-11-30
Categorical Data Clustering Based on Extraction of Associations from Co-association Matrix

Abstract:	Categorical data clustering is widely used in different fields in the real world, such as medical science, computer science, etc. The usual categorical data clustering is studied based on the dissimilarity measure. For data sets with different characteristics, the clustering results will be affected by the characteristics of the data set itself and noise information. In addition, the categorical data clustering based on representation learning is too complicated to implement, and the clustering results are greatly affected by the representation results. Based on the co-association matrix, this paper proposes a clustering method that can directly consider the relationship between the original information of categorical data, categorical data clustering based on extraction of associations from co-association matrix (CDCBCM). The co-association matrix can be regarded as a summary of the information association in the original data space. The co-association matrix is constructed by calculating the co-association frequency value of different objects in each attribute subspace, and some noise information is removed from the co-association matrix, and then the clustering result is obtained by normalized cut. The method is tested on 16 publicly available datasets in various aspects, compared with 8 existing methods, and detected using the F1-score metric. The experimental results show that this method has the best effect on 7 data sets, the average ranking is the best, and it can better complete the clustering task of categorical data.

Keywords:	categorical data categorical data clustering machine learning co-association matrix normalized cut

	点击此处可从《计算机与现代化》浏览原始摘要信息
	点击此处可从《计算机与现代化》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏