首页 | 官方网站   微博 | 高级检索  
     

面向不均衡数据的融合谱聚类的自适应过采样法
引用本文:刘金平,周嘉铭,贺俊宾,,唐朝晖,徐鹏飞,张国勇.面向不均衡数据的融合谱聚类的自适应过采样法[J].智能系统学报,2020,15(4):732-739.
作者姓名:刘金平  周嘉铭  贺俊宾    唐朝晖  徐鹏飞  张国勇
作者单位:1. 湖南师范大学 智能计算与语言信息处理湖南省重点实验室,湖南 长沙 410081;2. 湖南省计量检测研究院,湖南 长沙 410014;3. 中南大学 自动化学院,湖南 长沙 410082
摘    要:分类是模式识别领域中的研究热点,大多数经典的分类器往往默认数据集是分布均衡的,而现实中的数据集往往存在类别不均衡问题,即属于正常/多数类别的数据的数量与属于异常/少数类数据的数量之间的差异很大。若不对数据进行处理往往会导致分类器忽略少数类、偏向多数类,使得分类结果恶化。针对数据的不均衡分布问题,本文提出一种融合谱聚类的综合采样算法。首先采用谱聚类方法对不均衡数据集的少数类样本的分布信息进行分析,再基于分布信息对少数类样本进行过采样,获得相对均衡的样本,用于分类模型训练。在多个不均衡数据集上进行了大量实验,结果表明,所提方法能有效解决数据的不均衡问题,使得分类器对于少数类样本的分类精度得到提升。

关 键 词:不自适应综合采样法  不均衡数据集  谱聚类  过采样  模式分类  数据分布  有偏分类器  数据预处理

Spectral clustering-fused adaptive synthetic oversampling approach for imbalanced data processing
LIU Jinping,ZHOU Jiaming,HE Junbin,,TANG Zhaohui,XU Pengfei,ZHANG Guoyong.Spectral clustering-fused adaptive synthetic oversampling approach for imbalanced data processing[J].CAAL Transactions on Intelligent Systems,2020,15(4):732-739.
Authors:LIU Jinping  ZHOU Jiaming  HE Junbin    TANG Zhaohui  XU Pengfei  ZHANG Guoyong
Affiliation:1. Hu’nan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hu’nan Normal University, Changsha 410081, China;2. Hu’nan Institute of Metrology and Test, Changsha 410014, China;3. School of Automation, Central South University, Changsha 410082, China
Abstract:Classification is a research hotspot in the field of machine learning. Most classic classifiers assume that the distribution of dataset is generally balanced, while the data se-t in reality often has a problem of class imbalance. Namely, the number of data belonging to the normal/majority category and the amount of anomaly/minority data vary greatly. If the data is not processed, the classifier will ignore the minority and be biased towards the majority, which deteriorates the classification results. Focusing on the problem of data imbalance, this paper proposes a spectral clustering-fused comprehensive sampling algorithm (SCF-ADASYN). First, the spectral clustering method is employed to analyze the distribution information of the minority-type samples in the imbalanced dataset, and the samples of minority class are oversampled to obtain a relatively balanced dataset, used for the classification model training. A large number of experiments have been carried out on multiple unbalanced datasets. The results show that the SCF-ADASYN can effectively improve the imbalance on the data se-t, and the classification accuracies of the testing classifiers on the unbalanced data se-t can be significantly improved.
Keywords:adaptive synthetic sampling approach (ADASYN)  imbalanced data se-t  spectral clustering  oversampling  pattern classification  data distribution  biased classifier  data pre-processing
点击此处可从《智能系统学报》浏览原始摘要信息
点击此处可从《智能系统学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司    京ICP备09084417号-23

京公网安备 11010802026262号