首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于混合重取样策略的非均衡数据集分类算法
引用本文:谷琼,袁磊,宁彬,吴钊,华丽,李文新.一种基于混合重取样策略的非均衡数据集分类算法[J].计算机工程与科学,2012,34(10):128-134.
作者姓名:谷琼  袁磊  宁彬  吴钊  华丽  李文新
作者单位:湖北文理学院数学与计算机科学学院,湖北襄阳,441053
基金项目:国家自然科学基金资助项目,湖北省自然科学基金资助项目
摘    要:非均衡数据是分类中的常见问题,当一类实例远远多于另一类实例,则代表类非均衡,真实世界的分类问题存在很多类别非均衡的情况并得到众多专家学者的重视,非均衡数据的分类问题已成为数据挖掘和模式识别领域中新的研究热点,是对传统分类算法的重大挑战。本文提出了一种新型重取样算法,采用改进的SMOTE算法对少数类数据进行过取样,产生新的少数类样本,使类之间数据量基本均衡,然后再根据SMO算法的特点,提出使用聚类的数据欠取样方法,删除冗余或噪音数据。通过对数据集的过取样和清理之后,一些有用的样本被保留下来,减少了数据集规模,增强支持向量机训练执行的效率。实验结果表明,该方法在保持整体分类性能的情况下可以有效地提高少数类的分类精度。

关 键 词:分类  非均衡数据集  预处理  混合重取样  SMOTE  聚类

A Novel Classification Algorithm for Imbalanced Datasets Based on Hybrid Resampling Strategy
GU Qiong , YUAN Lei , NING Bin , WU Zhao , HUA Li , LI Wen-xin.A Novel Classification Algorithm for Imbalanced Datasets Based on Hybrid Resampling Strategy[J].Computer Engineering & Science,2012,34(10):128-134.
Authors:GU Qiong  YUAN Lei  NING Bin  WU Zhao  HUA Li  LI Wen-xin
Affiliation:(School of Mathematics and Computer Science,Hubei University of Arts and Science,Xiangyang 441053,China)
Abstract:Imbalanced data is a common problem in classification,this issue occurs when the number of examples of one class is much smaller than the ones of the other classes.Its presence in many real-world applications has attracted a growth of attention from researchers.Classifier learning with data-sets that suffer from imbalanced class distributions is a challenging problem in data mining and pattern recognition community.In this paper,we present a novel preprocessing approach that combines unsupervised clustering and supervised learning to handle imbalanced data set and apply this learning approach for training SMO.This proposed algorithm lessen the imbalance ration through the construction of new samples using the improved synthetic minority oversampling technique and then clustering for both classes to delete redundant or noisy samples.Thus,the useful samples are remained,improving the computational efficiency.Experimental results show that the proposed approach can effectively improve the classification accuracy of the minority classes,while maintaining the overall classification performance.
Keywords:classification  imbalanced dataset  preprocessing  hybrid resampling  SMOTE  clustering
本文献已被 CNKI 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号