首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于混合策略的失衡数据集分类方法
引用本文:李鹏,王晓龙,刘远超,王宝勋.一种基于混合策略的失衡数据集分类方法[J].电子学报,2007,35(11):2161-2165.
作者姓名:李鹏  王晓龙  刘远超  王宝勋
作者单位:哈尔滨工业大学计算机科学与技术学院,黑龙江哈尔滨,150001;哈尔滨工业大学计算机科学与技术学院,黑龙江哈尔滨,150001;哈尔滨工业大学计算机科学与技术学院,黑龙江哈尔滨,150001;哈尔滨工业大学计算机科学与技术学院,黑龙江哈尔滨,150001
基金项目:国家自然科学基金,国家高技术研究发展计划(863计划)
摘    要:提出了一种有效应用于失衡数据集的分类方法,其核心思想是从样本预处理和分类器改进两方面入手,为失衡数据集的分类问题提供全面的解决方案.首先创造性地采用动态自组织映射聚类的方法对失衡数据集进行重采样,这种采样方法,有效地解决了传统重采样的方法随机性强,人为主观干扰以及信息损失等弊端.随后借助K-近邻规则的思想,对新采集的样本进行剪枝,有效地解决了实际存在的数据混叠现象.算法对SVM的核函数进行等角变换,由此对类边界进行了校准,以适应样本类别失衡的情况.通过对三种算法的对比实验证明了算法在失衡数据集分类上的有效性.本文的算法已经在答案抽取技术中得到了成功应用,并在TREC2006国际QA 评测中得到了客观充分的验证.

关 键 词:失衡数据集  分类  支持向量机  动态自组织映射  K-近邻
文章编号:0372-2112(2007)11-2161-06
收稿时间:2007-04-11
修稿时间:2007-08-30

A Classification Method for Imbalance Data Set Based on Hybrid Strategy
LI Peng,WANG Xiao-long,LIU Yuan-chao,WANG Bao-xun.A Classification Method for Imbalance Data Set Based on Hybrid Strategy[J].Acta Electronica Sinica,2007,35(11):2161-2165.
Authors:LI Peng  WANG Xiao-long  LIU Yuan-chao  WANG Bao-xun
Affiliation:School of Computer Science and Technology,Harbin Institute of Technology.Heilongjiang,Harbin,Heilongjiang 150001,China
Abstract:This paper presents a novel and effective classification method for imbalanced data sets.The core idea of the algorithrn,which is composed of three parts,is to provide a general solution for IDS classification by both sample preprocessing and classifter improving.Firstly,we re-sample the imbalance data by using variable SOM clustering so as to overcome the flaws of the traditional re-sampling methods,such as serious randomness,subjective interference and information loss.Then we cut down the sampled data sets according to the K-NN rule to solve the problem of data confusion,which improves the generalization of SVM.Especially, in order to adapt the class imbalance,the class boundary alignment is introduced through conformal transform on kernel function. The comparison results show the effectiveness of three algorithms.Meanwhile,the algorithm has also been used in our question answer system,which obtains outstanding result in the international TREC-2006 QA track.
Keywords:imbalanced data sets(IDS)  classification  support vector machine(SVM)  variable self-organizing maps(VSOM)  K-nearest neighbor(K-NN)
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《电子学报》浏览原始摘要信息
点击此处可从《电子学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号