首页 | 本学科首页   官方微博 | 高级检索  
     

基于海林格距离和SMOTE的多类不平衡学习算法
引用本文:董明刚,姜振龙,敬超. 基于海林格距离和SMOTE的多类不平衡学习算法[J]. 计算机科学, 2020, 47(1): 102-109
作者姓名:董明刚  姜振龙  敬超
作者单位:桂林理工大学信息科学与工程学院 广西 桂林 541004;广西嵌入式技术与智能系统重点实验室 广西 桂林 541004;桂林理工大学信息科学与工程学院 广西 桂林 541004
基金项目:广西嵌入式技术与智能系统重点实验室基金项目;国家自然科学基金;广西自然科学基金
摘    要:数据不平衡现象在现实生活中普遍存在。在处理不平衡数据时,传统的机器学习算法难以达到令人满意的效果。少数类样本合成上采样技术(Synthetic Minority Oversampling Technique,SMOTE)是一种有效的方法,但在多类不平衡数据中,边界点分布错乱和类别分布不连续变得更加复杂,导致合成的样本点会侵入其他类别区域,造成数据过泛化。鉴于基于海林格距离的决策树已被证明对不平衡数据具有不敏感性,文中结合海林格距离和SMOTE,提出了一种基于海林格距离和SMOTE的上采样算法(Based on Hellinger Distance and SMOTE Oversampling Algorithm,HDSMOTE)。首先,建立基于海林格距离的采样方向选择策略,通过比较少数类样本点的局部近邻域内的海林格距离的大小,来引导合成样本点的方向。其次,设计了基于海林格距离的采样质量评估策略,以免合成的样本点侵入其他类别的区域,降低过泛化的风险。最后,采用7种代表性的上采样算法和HDSMOTE算法对15个多类不平衡数据集进行预处理,使用决策树的分类器进行分类,以Precision,Recall,F-measure,G-mean和MAUC作为评价标准对各算法的性能进行评价。实验结果表明,相比于对比算法,HDSMOTE算法在以上评价标准上均有所提升:在Precision上最高提升了17.07%,在Recall上最高提升了21.74%,在F-measure上最高提升了19.63%,在G-mean上最高提升了16.37%,在MAUC上最高提升了8.51%。HDSMOTE相对于7种代表性的上采样方法,在处理多类不平衡数据时有更好的分类效果。

关 键 词:SMOTE  上采样  海林格距离  多类不平衡学习  分类

Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm
DONG Ming-gang,JIANG Zhen-long,JING Chao. Multi-class Imbalanced Learning Algorithm Based on Hellinger Distance and SMOTE Algorithm[J]. Computer Science, 2020, 47(1): 102-109
Authors:DONG Ming-gang  JIANG Zhen-long  JING Chao
Affiliation:(College of Information Science and Engineering,Guilin University of Technology,Guilin,Guangxi 541004,China;Guangxi Key Laboratory of Embedded Technology and Intelligent System,Guilin,Guangxi 541004,China)
Abstract:Imbalanced data is common in real life.Traditional machine learning algorithms are difficult to achieve satisfied results on imbalanced data.The synthetic minority oversampling technique(SMOTE)is an efficient method to handle this problem.However,in multi-class imbalanced data,disordered distribution of boundary sample and discontinuous class distribution become more complicated,and the synthetic samples may invade other classes area,leading to over-generalization.In order to solve this issue,considering the algorithm based on Hellinger distance decision tree has been proved to be insensitive to imbalanced data,combining with Hellinger distance and SMOTE,this paper proposed an oversampling method SMOTE with Hellinger distance(HDSMOTE).Firstly,a sampling direction selection strategy was presented based on Hellinger distances of local neighborhood area,which can guide the direction of the synthesized sample.Secondly,a sampling quality evaluation strategy based on Hellinger distance was designed to avoid the synthesized sample into other classes,which can reduce the risk of over-generalization.Finally,to demonstrate the performance of HDSMOTE,15 multi-class imbalanced data sets were preprocessed by 7 representative oversampling algorithms and HDSMOTE algorithm,and were classified with C4.5 decision tree.Precision,Recall,F-measure,G-mean and MAUC are employed as the evaluation standards.Compared with competitive oversampling methods,the experimental results show that the HDSMOTE algorithm has improved in the these evaluation standards.It is increased by 17.07%in Precision,21.74%in Recall,19.63%in F-measure,16.37%in G-mean,and 8.51%in MAUC.HDSMOTE has better classification performance than the seven representative oversampling methods on multi-class imbalanced data.
Keywords:SMOTE  Oversampling  Hellinger distance  Multi-class imbalanced learning  Classification
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号