首页 | 本学科首页   官方微博 | 高级检索  
     


Facing the reality of data stream classification: coping with scarcity of labeled data
Authors:Mohammad M Masud  Clay Woolam  Jing Gao  Latifur Khan  Jiawei Han  Kevin W Hamlen  Nikunj C Oza
Affiliation:1. Department of Computer Science, University of Texas at Dallas, Richardson, TX, 75080, USA
2. Department of Computer Science, University of Illinois at Urbana Champaign, Urbana, IL, 61801, USA
3. Intelligent Systems Division, NASA Ames Research Center, Moffett Field, CA, 94035, USA
Abstract:Recent approaches for classifying data streams are mostly based on supervised learning algorithms, which can only be trained with labeled data. Manual labeling of data is both costly and time consuming. Therefore, in a real streaming environment where large volumes of data appear at a high speed, only a small fraction of the data can be labeled. Thus, only a limited number of instances will be available for training and updating the classification models, leading to poorly trained classifiers. We apply a novel technique to overcome this problem by utilizing both unlabeled and labeled instances to train and update the classification model. Each classification model is built as a collection of micro-clusters using semi-supervised clustering, and an ensemble of these models is used to classify unlabeled data. Empirical evaluation of both synthetic and real data reveals that our approach outperforms state-of-the-art stream classification algorithms that use ten times more labeled data than our approach.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号