首页 | 本学科首页   官方微博 | 高级检索  
     

基于MapReduce和上采样的两类非平衡大数据集成分类
引用本文:翟俊海,张明阳,王陈希,刘晓萌,王耀达.基于MapReduce和上采样的两类非平衡大数据集成分类[J].数据采集与处理,2018,33(3):416-425.
作者姓名:翟俊海  张明阳  王陈希  刘晓萌  王耀达
作者单位:1. 河北省机器学习与计算智能重点实验室, 保定, 071002;2. 河北大学数学与信息科学学院, 保定, 071002;3. 河北大学计算机科学与技术学院, 保定, 071002
基金项目:国家自然科学基金(71371063)资助项目;河北省自然科学基金(F2017201026)资助项目;河北大学自然科学研究计划(799207217071)资助项目。
摘    要:提出了一种基于MapReduce和上采样的两类非平衡大数据分类方法,该方法分为5步:(1)对于每一个正类样例,用MapReduce寻找其异类最近临;(2)在两个样例点之间的直线上生成若干个正类样例;(3)以新的正类样例子集的大小为基准,将负类样例随机划分为若干子集;(4)用负类样例子集和正类样例子集构造若干个平衡数据子集;(5)用平衡数据子集训练若干个分类器,并对训练好的分类器进行集成。在5个两类非平衡大数据集上与3种相关方法进行了实验比较,实验结果表明本文提出的优于这3种方法。

关 键 词:大数据  非平衡分类  上采样  最近邻
收稿时间:2016/6/7 0:00:00
修稿时间:2016/11/29 0:00:00

Binary Ensemble Classification for Imbalanced Big Data Based on MapRecuce and Upper Sampling
Zhai Junhai,Zhang Mingyang,Wang Chenxi,Liu Xiaomeng,Wang Yaoda.Binary Ensemble Classification for Imbalanced Big Data Based on MapRecuce and Upper Sampling[J].Journal of Data Acquisition & Processing,2018,33(3):416-425.
Authors:Zhai Junhai  Zhang Mingyang  Wang Chenxi  Liu Xiaomeng  Wang Yaoda
Affiliation:1. Key Lab of Machine Learning and Computational Intelligence, Baoding, 071002, China;2. College of Mathematics and Information Science, Hebei University, Baoding, 071002, China;3. College of Computer Science and Technology, Hebei University, Baoding, 071002, China
Abstract:Based on MapReduce and upper sampling, an approach for imbalanced big data classification is proposed in this paper. The proposed method includes five steps:(1) For each positive instance, its nearest neighbor is found by MapReduce. (2) Some positive instances on the line between the two points are created. (3) According to the cardinality of the set of positive instances, the set of negative instances is partitioned into some subsets. (4) Some balanced subsets are generated with the set of positive instances and the subset of negative instances. (5) Some classifiers are trained by extreme learning machine on the generated balanced subsets, and the trained classifiers are integrated by majority voting for classifying new instances. Experimental comparisons with three related methods are conducted on five imbalanced big data sets. The experimental results show that the proposed method outperforms the three methods.
Keywords:big data  imbalanced classification  upper sampling  nearest neighbor
点击此处可从《数据采集与处理》浏览原始摘要信息
点击此处可从《数据采集与处理》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号