首页 | 本学科首页   官方微博 | 高级检索  
     

新的基于代价敏感集成学习的非平衡数据集分类方法NIBoost
引用本文:王莉,陈红梅,王生武. 新的基于代价敏感集成学习的非平衡数据集分类方法NIBoost[J]. 计算机应用, 2019, 39(3): 629-633. DOI: 10.11772/j.issn.1001-9081.2018071598
作者姓名:王莉  陈红梅  王生武
作者单位:西南交通大学 信息科学与技术学院,成都,611756;西南交通大学 信息科学与技术学院,成都,611756;西南交通大学 信息科学与技术学院,成都,611756
基金项目:国家自然科学基金资助项目(61572406)。
摘    要:现实生活中存在大量的非平衡数据,大多数传统的分类算法假定类分布平衡或者样本的错分代价相同,因此在对这些非平衡数据进行分类时会出现少数类样本错分的问题。针对上述问题,在代价敏感的理论基础上,提出了一种新的基于代价敏感集成学习的非平衡数据分类算法--NIBoost(New Imbalanced Boost)。首先,在每次迭代过程中利用过采样算法新增一定数目的少数类样本来对数据集进行平衡,在该新数据集上训练分类器;其次,使用该分类器对数据集进行分类,并得到各样本的预测类标及该分类器的分类错误率;最后,根据分类错误率和预测的类标计算该分类器的权重系数及各样本新的权重。实验采用决策树、朴素贝叶斯作为弱分类器算法,在UCI数据集上的实验结果表明,当以决策树作为基分类器时,与RareBoost算法相比,F-value最高提高了5.91个百分点、G-mean最高提高了7.44个百分点、AUC最高提高了4.38个百分点;故该新算法在处理非平衡数据分类问题上具有一定的优势。

关 键 词:非平衡数据集  分类  代价敏感  过采样  Adaboost算法
收稿时间:2018-07-31
修稿时间:2018-09-13

NIBoost: new imbalanced dataset classification method based on cost sensitive ensemble learning
WANG Li,CHEN Hongmei,WANG Shengwu. NIBoost: new imbalanced dataset classification method based on cost sensitive ensemble learning[J]. Journal of Computer Applications, 2019, 39(3): 629-633. DOI: 10.11772/j.issn.1001-9081.2018071598
Authors:WANG Li  CHEN Hongmei  WANG Shengwu
Affiliation:School of Information Science and Technology, Southwest Jiaotong University, Chengdu Sichuan 611756, China
Abstract:The problem of misclassification of minority class samples appears frequently when classifying massive amount of imbalanced data in real life with traditional classification algorithms, because most of these algorithms only suit balanced class distribution or samples with same misclassification cost. To overcome this problem, a classification algorithm for imbalanced dataset based on cost sensitive ensemble learning and oversampling-New Imbalanced Boost (NIBoost) was proposed. Firstly, the oversampling algorithm was used to add a certain number of minority samples to balance the dataset in each iteration, and the classifier was trained on the new dataset. Secondly, the classifier was used to classify the dataset to obtain the predicted class label of each sample and the classification error rate of the classifier. Finally, the weight coefficient of the classifier and new weight of each sample were calculated according to the classification error rate and the predicted class labeles. Experimental results on UCI datasets with decision tree and Naive Bayesian used as weak classifier algorithm show that when decision tree was used as the base classifier of NIBoost, compared with RareBoost algorithm, the F-value is increased up to 5.91 percentage points, the G-mean is increased up to 7.44 percentage points, and the AUC is increased up to 4.38 percentage points. The experimental results show that the proposed algorithm has advantages on imbalanced data classification problem.
Keywords:imbalanced dataset   classification   cost sensitive   over-sampling   Adaboost algorithm
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号