首页 | 本学科首页   官方微博 | 高级检索  
     

一种对噪音健壮的数据流分类算法
引用本文:王勇,李战怀,张阳. 一种对噪音健壮的数据流分类算法[J]. 西北工业大学学报, 2007, 25(4): 603-607
作者姓名:王勇  李战怀  张阳
作者单位:1. 西北工业大学,计算机科学与软件系,陕西,西安,710072
2. 西北农林科技大学,信息工程学院,陕西,西安,712100
摘    要:数据流挖掘中的主要问题是概念流动和噪音污染。目前的数据流挖掘算法不能有效地处理数据流中的噪音,而一个理想的学习算法应该同时拥有对概念流动的敏感性和对噪音的健壮性。文中探讨了如何使用聚类方法在数据流中区分出噪音实例和难以学习的实例,并提出了相应的概念流动检测方法。在此基础上设计了基于推进技术的集合分类器算法RobustBoosting。通过在合成数据集和实际数据集上的实验,表明文中的算法即使在高达40%的类噪音时,与AdaptiveBoosting算法[1]相比,仍能保持更高的分类准确度,更快地收敛到新的目标概念。

关 键 词:RobustBoosting算法  数据流  概念流动  噪音实例
文章编号:1000-2758(2007)04-0603-05
修稿时间:2006-04-21

A Better Algorithm for Classifying Data Streams with Concept Drifting
Wang Yong,Li Zhanhuai,Zhang Yang. A Better Algorithm for Classifying Data Streams with Concept Drifting[J]. Journal of Northwestern Polytechnical University, 2007, 25(4): 603-607
Authors:Wang Yong  Li Zhanhuai  Zhang Yang
Abstract:Existing algorithms can not strike a good balance between robustness to data noise and sensitivity to concept drifting.We now propose an algorithm that we believe can strike a balance better than those of existing algorithms.In the full paper,we explain in some detail the algorithm we propose,called by us RobustBoosting algorithm.In this abstract,we just add some pertinent remarks to listing the four topics of explanation.The first topic is: distinguishing data noise from hard-to-learn samples.The second topic is: separating hard-to-learn samples from data noise with a clustering method based on density.In the second topic,we say that the separating is not absolute but according to mathematical probability we do achieve the separation into two groups: one group consisting mostly of hard-to-learn samples and an insignificant amount of data noise and another group that is just the reverse.The third topic is: discovering concept drifting.In the third topic,we derive three equations for discovering concept drifting.The fourth topic is: the design of our RobustBoosting algorithm.We compared RobustBoosting algorithm with AdaptiveBoosting algorithm[1] on both synthetic and real-life data sets.The experimental results,given in two figures in the full paper,show preliminarily that the proposed method has substantial advantage over AdaptiveBoosting algorithm in prediction accuracy,and it can converge to target concepts with high accuracy and speed even with 40% data noise samples.
Keywords:RobustBoosting algorithm  data stream  concept drifting  data noise
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号