首页 | 本学科首页   官方微博 | 高级检索  
     

基于不平衡数据样本特性的新型过采样SVM分类算法
引用本文:黄海松,魏建安,康佩栋.基于不平衡数据样本特性的新型过采样SVM分类算法[J].控制与决策,2018,33(9):1549-1558.
作者姓名:黄海松  魏建安  康佩栋
作者单位:贵州大学现代制造技术教育部重点实验室,贵阳550025,贵州大学现代制造技术教育部重点实验室,贵阳550025,贵州大学现代制造技术教育部重点实验室,贵阳550025
基金项目:贵州工业攻关重点项目(黔科合GZ字[2015]3009);贵州省自然科学基金项目(黔科合J字[2015]2043);贵州省重大专项项目(黔科合JZ字[2014]2001);贵州省教育厅项目(黔教合协同创新字[2015]02);贵州大学研究生创新基金项目(研理工2017037).
摘    要:针对传统采样方式准确率与鲁棒性不够明显,欠采样容易丢失重要的样本信息,而过采样容易引入冗杂信息等问题,以UCI公共数据集中的不平衡数据集Pima-Indians为例,综合考虑数据集正负类样本的类间距离、类内距离与不平衡度之间的关系,提出一种基于样本特性的新型过采样方式.首先对原始数据集进行距离带的划分,然后提出一种改进的基于样本特性的自适应变邻域Smote算法,在每个距离带的少数类样本中进行新样本的合成,并将此方式推广到UCI数据集中其他5种不平衡数据集.最后利用SVM分类器进行实验验证的结果表明:在6类不平衡数据集中,应用新型过采样SVM算法,相比已有的采样方式,少(多)数类样本的分类准确率均有明显提高,且算法具有更强的鲁棒性.

关 键 词:数据集不平衡  样本距离  ANBSC-Smote过采样  数据集重构  支持向量机

New over-sampling SVM classification algorithm based on unbalanced data sample characteristics
HUANG Hai-song,WEI Jian-an and KANG Pei-dong.New over-sampling SVM classification algorithm based on unbalanced data sample characteristics[J].Control and Decision,2018,33(9):1549-1558.
Authors:HUANG Hai-song  WEI Jian-an and KANG Pei-dong
Affiliation:Key Laboratory of Advanced Manufacturing Technology of Ministry of Education,Guizhou University,Guiyang550025,China,Key Laboratory of Advanced Manufacturing Technology of Ministry of Education,Guizhou University,Guiyang550025,China and Key Laboratory of Advanced Manufacturing Technology of Ministry of Education,Guizhou University,Guiyang550025,China
Abstract:Aiming at the problem that the accuracy and robustness of the traditional sampling methods are not obvious, under-sampling is easy to lose important sample information, and oversampling is easy to introduce redundant information, the Pima-Indians dataset in the UCI common unbalanced datasets is taken as an example to consider the relationship between the distance within classes, the distance within classes and the imbalance, therefore, a new type oversampling method based on sample characteristics is presented. Firstly, the algorithm divides the original data set into some distance belts. Then an improved adaptive neighborhood neighborhood(Smote) algorithm based on sample characteristics is proposed to synthesize new samples in each class with several samples, and is extended to other five unbalanced data sets of UCI dataset. Finally, experiments are conducted using the traditional SVM classifier, and the results show that, in the six categories of unbalanced data sets, compared with the existing sampling method, the proposed algorithm improves the classification accuracy of the minority or majority class samples, and has stronger robustness.
Keywords:
点击此处可从《控制与决策》浏览原始摘要信息
点击此处可从《控制与决策》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号