首页 | 本学科首页   官方微博 | 高级检索  
     

面向分布式漂移数据流的集成分类模型
引用本文:尹春勇,张帼杰.面向分布式漂移数据流的集成分类模型[J].计算机应用,2021,41(7):1947-1955.
作者姓名:尹春勇  张帼杰
作者单位:南京信息工程大学 计算机与软件学院, 南京 210044
基金项目:国家自然科学基金资助项目(61772282)。
摘    要:针对大数据环境下分类精度不高的问题,提出了一种面向分布式数据流的集成分类模型。首先,使用微簇模式减少局部节点向中心节点传输的数据量,降低通信代价;然后,使用样本重构算法生成全局分类器的训练样本;最后,提出一种面向漂移数据流的集成分类模型,采用动态分类器和稳定分类器的加权组合策略,使用混合标记策略标记最具代表性的样本以更新集成模型。在两个虚拟数据集和两个真实数据集上的实验结果表明,该模型与DS-means、BDS-ensemble这两个分布式挖掘模型相比,受到概念漂移时的波动较小;而与在线主动学习集成模型(OALEnsemble)相比,准确率更高,在四个数据集上的准确率分别提高了1.58、0.97、0.77和1.91个百分点。该模型虽然在内存消耗上略高于DS-means和BDS-ensemble模型,但是可以在较小的内存代价下获得较大的分类性能的提升。因此,该模型适用于具有分布式和流动性特征的大数据的分类工作,如网络监控、银行业务系统等。

关 键 词:分布式  数据流  集成  分类  概念漂移  
收稿时间:2020-08-21
修稿时间:2020-11-27

Ensemble classification model for distributed drifted data streams
YIN Chunyong,ZHANG Guojie.Ensemble classification model for distributed drifted data streams[J].journal of Computer Applications,2021,41(7):1947-1955.
Authors:YIN Chunyong  ZHANG Guojie
Affiliation:School of Computer and Software, Nanjing University of Information Science and Technology, Nanjing Jiangsu 210044, China
Abstract:Aiming at the problem of low classification accuracy in big data environment, an ensemble classification model for distributed data streams was proposed. Firstly, the microcluster mode was used to reduce the amount of data transmitted from local nodes to the central nodes, so as to reduce the communication cost. Secondly, the training samples of the global classifier were generated by using the sample reconstruction algorithm. Finally, an ensemble classification model for drift data streams was proposed, which adopted the weighted combination strategy of dynamic classifiers and steady classifiers, and the mixed labeling strategy was used to label the most representative instances to update the ensemble model. Experiments on two virtual datasets and two real datasets showed that the model suffered less fluctuation from concept drift compared with two distributed mining models DS-means and BDS-ensemble, and had higher accuracy than Online Active Learning Ensemble model (OALEnsemble), with the accuracy on four datasets improved by 1.58、0.97、0.77 and 1.91 percentage points respectively. Although the memory consumption of this model was slightly higher than those of BDS-ensemble and DS-means models, this model was able to improve the classification performance at a lower memory cost. Therefore, the model is suitable for the classification of big data with distributed and mobility characteristics, such as network monitoring and banking business system.
Keywords:distributed  data stream  ensemble  classification  concept drift  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号