首页 | 本学科首页   官方微博 | 高级检索  
     

基于聚类分析的不均衡数据标注技术研究
引用本文:赵俊杰,黄四牛,吴正午,王帅. 基于聚类分析的不均衡数据标注技术研究[J]. 计算机仿真, 2020, 0(2): 476-480
作者姓名:赵俊杰  黄四牛  吴正午  王帅
作者单位:北京控制与电子技术研究所
基金项目:国防科技创新特区项目支持。
摘    要:分布不均衡的数据在通过传统聚类分析的方式进行标注时,聚类效果容易偏向于样本数多的类,从而造成标注出现误差的问题。针对此问题提出改进的含有均衡约束聚类算法的标注方法,对不均衡数据的聚类标注准确率实现了比较有效的提高,方法包含数据初始聚类、专家知识调整,数据均衡化处理,含均衡约束聚类等步骤。通过初始聚类对不均衡数据进行初始类标签分配,专家知识调整对部分数据错误标注进行标签调整修改,对数据进行均衡化处理得到均衡数据集,通过均衡约束聚类对均衡数据进行标签最终精确分配。经仿真验证表明,上述方法比较有效的提高了不均衡数据标注准确率。

关 键 词:不均衡数据  数据标注  聚类分析  均衡化处理  仿真验证

Research on Unbalanced Data Labeling Technology Based on Clustering Analysis
ZHAO Jun-jie,HUANG Si-niu,WU Zheng-wu,WANG Shuai. Research on Unbalanced Data Labeling Technology Based on Clustering Analysis[J]. Computer Simulation, 2020, 0(2): 476-480
Authors:ZHAO Jun-jie  HUANG Si-niu  WU Zheng-wu  WANG Shuai
Affiliation:(Science and Technology on Information System Engineering Laboratory,Beijing 100038,China)
Abstract:When labeling on unbalanced datasets based on clustering analysis, it has a problem that clustering effect favors in ‘big’ cluster causing the errors. Focus on the problem, we proposed a labeling method based on a new clustering algorithm, the method includes initial clustering, expert knowledge modifying the error, balanced processing of the unbalanced datasets and re-clustering on balanced datasets. We got the initial clusters by the initial clustering. Then we modified the errors for a part of the data under the guidance of the expert knowledge. After the balanced processing of the unbalanced data, we proposed and used a new clustering algorithm with balancing constraint, and the data are re-labeled based on the clustering method, which finally improves the accuracy of the labeled results. Through simulation, it is proved that the proposed method can improve the accuracy of clustering and labeling.
Keywords:Imbalanced data  Data labeling  Clustering analysis  Balance processing  Simulation verification
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号