首页 | 本学科首页   官方微博 | 高级检索  
     

一种新的不平衡数据学习算法PCBoost
引用本文:李雄飞,李军,董元方,屈成伟.一种新的不平衡数据学习算法PCBoost[J].计算机学报,2012,35(2):2202-2209.
作者姓名:李雄飞  李军  董元方  屈成伟
作者单位:1. 吉林大学符号计算与知识工程教育部重点实验室 长春 130012
2. 吉林大学符号计算与知识工程教育部重点实验室 长春 130012;长春理工大学应用数学系 长春 130022
3. 吉林大学符号计算与知识工程教育部重点实验室 长春 130012;长春理工大学经济管理学院 长春 130022
基金项目:本课题得到国家科技支撑计划项目,吉林省科技发展计划项目
摘    要:现实世界中广泛存在不平衡数据,其分类问题是机器学习研究中的一个热点.多数传统分类算法假定类分布平衡或误分类代价均衡,在处理不平衡数据时,效果不够理想.文中提出一种不平衡数据分类算法-PCBoost.算法以信息增益率为分裂准则构建决策树,作为弱分类器.在每次迭代初始,利用数据合成方法添加合成的少数类样例,平衡训练信息;在子分类器形成后,修正“扰动”,删除未被正确分类的合成样例.文中讨论了数据合成方法,给出了训练误差界的理论分析,并分析了集成学习参数的选择.实验结果表明,PCBoost算法具有处理不平衡数据分类问题的优势.

关 键 词:数据挖掘  不平衡数据  集成学习  提升  扰动

A New Learning Algorithm for Imbalanced Data-PCBoost
LI Xiong-Fei , LI Jun , DONG Yuan-Fang , QU Cheng-Wei.A New Learning Algorithm for Imbalanced Data-PCBoost[J].Chinese Journal of Computers,2012,35(2):2202-2209.
Authors:LI Xiong-Fei  LI Jun  DONG Yuan-Fang  QU Cheng-Wei
Affiliation:1) 1)(Key Laboratory of Symbolic Computation and Knowledge Engineering for Ministry of Education,Jilin University,Changchun 130012) 2)(Department of Applied Mathematics,Changchun University of Science and Technology,Changchun 130022) 3)(School of Economics and Management,Changchun University of Science and Technology,Changchun 130022)
Abstract:Imbalanced data exists widely in the real world,and its classification is a hot topic in machine learning.Most traditional classification algorithms assume balanced class distribution or equal misclassification costs,while they do not work when dealing with the imbalanced data.On the one hand,an imbalanced data classification algorithm,named as PCBoost,is proposed in this paper.The algorithm constructs decision tree with information gain ratio as the splitting criterion,and regards the decision tree as a weak classifier.At the beginning of each iteration,the algorithm makes use of data synthesize method to add synthetic minority class examples in order to balance training information.After the sub-classifier is formed,the algorithm corrects the perturbation and deletes the synthetic examples that are not correctly classified.On the other hand,the data synthesize method is discussed,the theoretical analysis of training error boundary is put forward,and the choice of ensemble learning parameters is analyzed.The experimental results show that the PCBoost algorithm has advantages on imbalanced data classification problem.
Keywords:data mining  imbalanced data  ensemble learning  boosting  perturbation
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号