首页 | 本学科首页   官方微博 | 高级检索  
     

面向不平衡数据集的机器学习分类策略
引用本文:徐玲玲,迟冬祥. 面向不平衡数据集的机器学习分类策略[J]. 计算机工程与应用, 2020, 56(24): 12-27. DOI: 10.3778/j.issn.1002-8331.2007-0120
作者姓名:徐玲玲  迟冬祥
作者单位:上海电机学院 电子信息学院,上海 201306
基金项目:国家自然科学基金-青年科学基金
摘    要:由于不平衡数据集的内在固有特性,使得分类结果常受数量较多的类别影响,造成分类性能下降。近年来,为了能够从类别不平衡的数据集中学习数据的内在规律并且挖掘其潜在的价值,提出了一系列基于提升不平衡数据集机器学习分类算法准确率的研究策略。这些策略主要是立足于数据层面、分类模型改进层面来解决不平衡数据集分类难的困扰。从以上两个方面论述面向不平衡数据集分类问题的机器学习分类策略,分析和讨论了针对不平衡数据集机器学习分类器的评价指标,总结了不平衡数据集分类尚存在的问题,展望了未来能够深入研究的方向。特别的,这些讨论的研究主要关注类别极端不平衡场景下的二分类问题所面临的困难。

关 键 词:不平衡数据集  重采样策略  分类模型  评价指标  

Machine Learning Classification Strategy for Imbalanced Data Sets
XU Lingling,CHI Dongxiang. Machine Learning Classification Strategy for Imbalanced Data Sets[J]. Computer Engineering and Applications, 2020, 56(24): 12-27. DOI: 10.3778/j.issn.1002-8331.2007-0120
Authors:XU Lingling  CHI Dongxiang
Affiliation:School of Electronic Information Engineering, Shanghai Dianji University, Shanghai 201306, China
Abstract:Due to the inherent characteristics of the imbalanced data set, the classification results are often affected by a large number of categories, resulting in a decline in classification performance. In recent years, a series of research strategies based on improving the accuracy of machine learning classification algorithms for imbalanced data sets have been proposed in order to be able to learn the inherent laws of data from the imbalanced data sets and to tap their potential value. These strategies are mainly based on the data level and the classification model improvement level to solve the difficulty of unbalanced data set classification. From the above two aspects, the machine learning classification strategy for the imbalanced data set classification problem is discussed, the evaluation indicators for the imbalanced data set machine learning classifier are analyzed and discussed, and the existing problems in the imbalanced data set classification are summarized. Finally, looking forward to the direction that can be studied in the future. In particular, the research discusses mainly focuses on the difficulties faced by the binary classification problem in the extreme imbalanced category scenario.
Keywords:imbalanced data set  resampling strategy  classification model  evaluation index  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号