首页 | 本学科首页   官方微博 | 高级检索  
     

针对不平衡数据的过采样和随机森林改进算法
引用本文:张家伟,郭林明,杨晓梅. 针对不平衡数据的过采样和随机森林改进算法[J]. 计算机工程与应用, 2020, 56(11): 39-45. DOI: 10.3778/j.issn.1002-8331.1908-0338
作者姓名:张家伟  郭林明  杨晓梅
作者单位:四川大学 电气工程学院,成都 610000
摘    要:针对数据不平衡带来的少数类样本识别率低的问题,提出通过加权策略对过采样和随机森林进行改进的算法,从数据预处理和算法两个方面降低数据不平衡对分类器的影响。数据预处理阶段应用合成少数类过采样技术(Synthetic Minority Oversampling Technique,SMOTE)降低数据不平衡度,每个少数类样本根据其相对于剩余样本的欧氏距离分配权重,使每个样本合成不同数量的新样本。算法改进阶段利用Kappa系数评价随机森林中决策树训练后的分类效果,并赋予每棵树相应的权重,使分类能力更好的树在投票阶段有更大的投票权,提高随机森林算法对不平衡数据的整体分类性能。在KEEL数据集上的实验表明,与未改进算法相比,改进后的算法对少数类样本分类准确率和整体样本分类性能有所提升。

关 键 词:数据不平衡  合成少数类过采样技术(SMOTE)  Kappa系数  随机森林

Improved Oversampling and Random Forest Algorithm for Imbalanced Data
ZHANG Jiawei,GUO Linming,YANG Xiaomei. Improved Oversampling and Random Forest Algorithm for Imbalanced Data[J]. Computer Engineering and Applications, 2020, 56(11): 39-45. DOI: 10.3778/j.issn.1002-8331.1908-0338
Authors:ZHANG Jiawei  GUO Linming  YANG Xiaomei
Affiliation:School of Electrical Engineering, Sichuan University, Chengdu 610000, China
Abstract:To solve the problem of low recognition rate for minority samples due to imbalanced data, an improved algorithm based on weighted oversampling and random forest is proposed to reduce the influence of imbalanced data on classifier. In data preprocessing step, weighted oversampling based on Synthetic Minority Oversampling Technique(SMOTE) is applied to reduce the data imbalanced rate. Weights are determined by the Euclidean distance between each sample and the rest in minority class, new samples with different number are generated by weighting samples of minority class. To improve the random forest, Kappa coefficient is used to evaluate the classification performance of decision tree, and corresponding weight is given to each tree. It makes trees with better performance having more voting rights at final voting stage. Experiments on KEEL datasets show that the proposed algorithm improves the classification accuracy for minority samples and the classification performance of the imbalanced datasets compared with unimproved algorithm.
Keywords:imbalanced data  Synthetic Minority Oversampling Technique(SMOTE)  Kappa coefficient  random forest  
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号