首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于Tri-training的数据流集成分类算法
引用本文:胡学钢,马利伟,李培培.一种基于Tri-training的数据流集成分类算法[J].数据采集与处理,2017,32(5):853-860.
作者姓名:胡学钢  马利伟  李培培
作者单位:合肥工业大学计算机与信息学院数据挖掘与智能计算实验室,合肥,230009
摘    要:数据流分类是数据挖掘领域的重要研究任务之一,已有的数据流分类算法大多是在有标记数据集上进行训练,而实际应用领域数据流中有标记的数据数量极少。为解决这一问题,可通过人工标注的方式获取标记数据,但人工标注昂贵且耗时。考虑到未标记数据的数量极大且隐含大量信息,因此在保证精度的前提下,为利用这些未标记数据的信息,本文提出了一种基于Tri-training的数据流集成分类算法。该算法采用滑动窗口机制将数据流分块,在前k块含有未标记数据和标记数据的数据集上使用Tri-training训练基分类器,通过迭代的加权投票方式不断更新分类器直到所有未标记数据都被打上标记,并利用k个Tri-training集成模型对第k+1块数据进行预测,丢弃分类错误率高的分类器并在当前数据块上重建新分类器从而更新当前模型。在10个UCI数据集上的实验结果表明:与经典算法相比,本文提出的算法在含80%未标记数据的数据流上的分类精度有显著提高。

关 键 词:数据流分类  Tri-training  未标记数据  集成  加权投票

Data Stream Ensemble Classification Algorithm Based on Tri-training
Hu Xuegang,Ma Liwei,Li Peipei.Data Stream Ensemble Classification Algorithm Based on Tri-training[J].Journal of Data Acquisition & Processing,2017,32(5):853-860.
Authors:Hu Xuegang  Ma Liwei  Li Peipei
Affiliation:Data Mining and Intelligence Computing Laboratory, School of Computer and Information, Hefei University of Technology, Hefei, 230009, China
Abstract:Data stream classification is one of important research tasks in the field of data mining. Most existing data stream classification algorithms require the labeled data for training. However, there are few labeled data in data streams in real applications. To solve this problem, the labeled data can be obtained by manual labeling, but it is very expensive and time consuming. Considering the unlabeled data are huge and full of information, a data stream ensemble classification algorithm based on Tri-training for labeled and unlabeled data is proposed in this paper. The proposed algorithm divides data stream into chunks by sliding windows and trains base classifiers with Tri-training on the first coming k chunks with labeled and unlabeled data. Then the classifiers are iteratively updated by weighted voting until all unlabeled data are labeled. Meanwhile, the k+1 data chunk is predicted by using the ensemble model of k Tri-training classifiers and the classifier with higher classification error is discarded, which reconstructs a new classifier on current data chunk to update the model. Experiments on 10 UCI data sets show that the proposed algorithm can significantly improve the class ification accuracy of data stream even with 80% unlabeled data in comparison with traditional algorithms.
Keywords:data stream classification  Tri-training  unlabeled data  ensemble  weighted voting
点击此处可从《数据采集与处理》浏览原始摘要信息
点击此处可从《数据采集与处理》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号