首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 109 毫秒
1.
数据流中的不平衡问题会严重影响算法的分类性能,其中概念漂移更是流数据挖掘研究领域的一个难点问题。为了提高此类问题下的分类性能,提出了一种新的基于Hellinger距离的不平衡漂移数据流Boosting分类BCA-HD算法。该算法创新性地采用实例级和分类器级的权重组合方式来动态更新分类器,以适应概念漂移的发生,在底层采用集成算法SMOTEBoost作为基分类器,该分类器内部使用重采样技术处理数据的不平衡。在16个突变型和渐变型的数据集上将所提算法与9种不同算法进行比较,实验结果表明,所提算法的G-mean和AUC的平均值和平均排名均为第1名。因此,该算法能更好地适应概念漂移和不平衡现象的同时发生,有助于提高分类性能。  相似文献   

2.
数据流分类已成为当前研究热点之一,如何解决其中的概念漂移和噪声是关键问题,为此提出了一种新的基 于分类器相似性的动态集成算法。由于数据流中相部数据具有相同概念的概率较大,因此用最新基分类器代表数据 流中即将出现的概念,同时基于此分类器求出基分类器之间的相似性作为权值进行加权多数投票,并根据相似性大小 淘汰较弱基分类器以适应概念漂移和噪声。在标准仿真数据集上进行了仿真实验,结果表明该算法相比其他集成方 法在抗噪性能和分类准确性方面均得到显著提高。  相似文献   

3.
张杰  赵峰 《控制与决策》2013,28(1):29-35
鉴于流数据具有实时、连续、有序和无限等特点,使用近似方法便可检测连续分时段的流数据序列,基于此,运用目标分布数据,结合相似分布理论,提出了利用 Tr-OEM 算法对流数据中的概念漂移现象进行检测.该算法能够动态地判断流数据概念漂移的发生,自适应地优化概念漂移的检测值,适用于不同类型的流数据.通过分析和实验仿真可以表明,该算法在处理流数据概念漂移时具有较好的适应性.  相似文献   

4.
互联网环境日新月异,使得网络数据流中存在概念漂移,对数据流的分类也由传统的静态分类变为动态分类,而如何对概念漂移进行检测是动态分类的关键。本文提出一种基于概念漂移检测的网络数据流自适应分类算法,通过比较滑动窗口中数据与历史数据的分布差异来检测概念漂移,然后将窗口中数据过采样来减少样本间的不均衡性,最后将处理后的数据集输入到OS-ELM分类器中进行在线学习,从而更新分类器使其应对数据流中的概念漂移。本文在MOA实验平台中使用合成数据集和真实数据集对提出的算法进行验证,结果表明,该算法较集成学习算法在分类准确率和稳定性上有一定的提升,并且随着数据流量的增加,时间性能上的优势开始体现,适合复杂多变的网络环境。  相似文献   

5.
大部分数据流分类算法解决了数据流无限长度和概念漂移这两个问题。但是,这些算法需要人工专家将全部实例都标记好作为训练集来训练分类器,这在数据流高速到达并需要快速分类的环境中是不现实的,因为标记实例需要时间和成本。此时,如果采用监督学习的方法来训练分类器,由于标记数据稀少将得到一个弱分类器。提出一种基于主动学习的数据流分类算法,该算法通过选择全部实例中的一小部分来人工标记,其中这小部分实例是分类置信度较低的样本,从而可以极大地减少需要人工标记的实例数量。实验结果表明,该算法可以在数据流存在概念漂移情况下,使用较少的标记数据对数据流训练出分类器,并且分类效果良好。  相似文献   

6.
近年来,数据流挖掘已成为知识发现领域中的一个研究热点.数据流中数据的无限性和概念漂移等特征使得传统的分类算法不能很好地适用于数据流环境.提出了一种基于eEP的分类器集成算法CEEPCE(classification by eEP-based classifiers ensemble)对数据流进行分类.CEEPCE使用eEP建立基分类器,当新数据块流入时训练新的分类器,并调整集成分类器中的基分类器.依据基分类器在新流入数据上的分类误差对其进行加权,集成权重最高的若干个基分类器来分类未来数据.实验表明,与单分类器相比,CEEPCE具有更好的分类准确率,并足以与以C4.5为基分类器的集成方法相媲美.  相似文献   

7.
基于子空间集成的概念漂移数据流分类算法   总被引:4,自引:2,他引:2  
具有概念漂移的复杂结构数据流分类问题已成为数据挖掘领域研究的热点之一。提出了一种新颖的子空间分类算法,并采用层次结构将其构成集成分类器用于解决带概念漂移的数据流的分类问题。在将数据流划分为数据块后,在每个数据块上利用子空间分类算法建立若干个底层分类器,然后由这几个底层分类器组成集成分类模型的基分类器。同时,引入数理统计中的参数估计方法检测概念漂移,动态调整模型。实验结果表明:该子空间集成算法不但能够提高分类模型对复杂类别结构数据流的分类精度,而且还能够快速适应概念漂移的情况。  相似文献   

8.
现有概念漂移处理算法在检测到概念漂移发生后,通常需要在新到概念上重新训练分类器,同时“遗忘”以往训练的分类器。在概念漂移发生初期,由于能够获取到的属于新到概念的样本较少,导致新建的分类器在短时间内无法得到充分训练,分类性能通常较差。进一步,现有的基于在线迁移学习的数据流分类算法仅能使用单个分类器的知识辅助新到概念进行学习,在历史概念与新到概念相似性较差时,分类模型的分类准确率不理想。针对以上问题,文中提出一种能够利用多个历史分类器知识的数据流分类算法——CMOL。CMOL算法采取分类器权重动态调节机制,根据分类器的权重对分类器池进行更新,使得分类器池能够尽可能地包含更多的概念。实验表明,相较于其他相关算法,CMOL算法能够在概念漂移发生时更快地适应新到概念,显示出更高的分类准确率。  相似文献   

9.
一种基于双层窗口的概念漂移数据流分类算法   总被引:1,自引:0,他引:1  
数据流中概念漂移问题的研究已成为近年来流数据挖掘领域的研究热点之一. 已有的研究工作多依据单窗口中错误率的变化来检测概念漂移,难以适应不同类型的漂移. 为此, 本文提出一种新的基于双层窗口机制的数据流分类算法(Double-windows-based classification algorithm for concept drifting data streams, DWCDS),该算法采用随机决策树模型构建集成分类器, 利用双层窗口机制周期性地检测滑动窗口中流数据分布的变化,并动态地更新模型以适应概念漂移. 分析与实验结果表明: 该算法可以快速有效地跟踪检测含噪数据流中的概念漂移,且抗噪性能与分类精度显著提高.  相似文献   

10.
为了克服数据流概念漂移现象对分类模型的影响,提高数据流分类准确率,提出了一种基于概念漂移检测算法的数据流分类模型.针对不同概念漂移类型使用不同的方法进行检测,该模型通过对概念漂移进行监控,从而有效控制分类模型的更新频率,做到有的放矢地更新分类器模型,提高分类模型的分类性能.通过使用两种不同的数据集进行实验,并与传统分类模型进行比较,验证了该模型的有效性和正确性.  相似文献   

11.
Recent research shows that rule based models perform well while classifying large data sets such as data streams with concept drifts. A genetic algorithm is a strong rule based classification algorithm which is used only for mining static small data sets. If the genetic algorithm can be made scalable and adaptable by reducing its I/O intensity, it will become an efficient and effective tool for mining large data sets like data streams. In this paper a scalable and adaptable online genetic algorithm is proposed to mine classification rules for the data streams with concept drifts. Since the data streams are generated continuously in a rapid rate, the proposed method does not use a fixed static data set for fitness calculation. Instead, it extracts a small snapshot of the training example from the current part of data stream whenever data is required for the fitness calculation. The proposed method also builds rules for all the classes separately in a parallel independent iterative manner. This makes the proposed method scalable to the data streams and also adaptable to the concept drifts that occur in the data stream in a fast and more natural way without storing the whole stream or a part of the stream in a compressed form as done by the other rule based algorithms. The results of the proposed method are comparable with the other standard methods which are used for mining the data streams.  相似文献   

12.
张玉红  陈伟  胡学钢 《计算机科学》2016,43(12):179-182, 194
现实生活中网络监控、网络评论以及微博等应用领域涌现了大量文本数据流,这些数据的不完全标记和频繁概念漂移给已有的数据流分类方法带来了挑战。为此,面向不完全标记的文本数据流提出了一种自适应的数据流分类算法。该算法以一个标记数据块作为起始数据块,对未标记数据块首先提取标记数据块与未标记数据块之间的特征集,并利用特征在两个数据块间的相似度进行概念漂移检测,最后计算未标记数据中特征的极性并对数据进行预测。实验表明了算法在分类精度上的优越性,尤其在标记信息较少和概念漂移较为频繁时。  相似文献   

13.
Data stream classification with artificial endocrine system   总被引:3,自引:3,他引:0  
Due to concept drifts, maintaining an up-to-date model is a challenging task for most of the current classification approaches used in data stream mining. Both the incremental classifiers and the ensemble classifiers spend most of their time in updating their temporary models and at the same time, a big sample buffer for training a classifier is necessary for most of them. These two drawbacks constrain further application in classifying a data stream. In this paper, we present a hormone based nearest neighbor classification algorithm for data stream classification, in which the classifier is updated every time a new record arrives. The records could be seen as locations in the feature space, and each location can accommodate only one endocrine cell. The classifier consists of endocrine cells on the boundaries of different classes. Every time a new record arrives, the cell that resides in the most unfit location will move to the new arrived record. In this way, the changing boundaries between different classes are recorded by the locations where endocrine cells reside in. The main advantages of the proposed method are the saving of the sample buffer and the improving of the classification accuracy. It is very important for conditions where the hardware resources are very expensive or the main memory is limited. Experiments on synthetic and real life data sets show that the proposed algorithm is able to classify data streams with less memory space and classification error.  相似文献   

14.
Classifiers deployed in the real world operate in a dynamic environment, where the data distribution can change over time. These changes, referred to as concept drift, can cause the predictive performance of the classifier to drop over time, thereby making it obsolete. To be of any real use, these classifiers need to detect drifts and be able to adapt to them, over time. Detecting drifts has traditionally been approached as a supervised task, with labeled data constantly being used for validating the learned model. Although effective in detecting drifts, these techniques are impractical, as labeling is a difficult, costly and time consuming activity. On the other hand, unsupervised change detection techniques are unreliable, as they produce a large number of false alarms. The inefficacy of the unsupervised techniques stems from the exclusion of the characteristics of the learned classifier, from the detection process. In this paper, we propose the Margin Density Drift Detection (MD3) algorithm, which tracks the number of samples in the uncertainty region of a classifier, as a metric to detect drift. The MD3 algorithm is a distribution independent, application independent, model independent, unsupervised and incremental algorithm for reliably detecting drifts from data streams. Experimental evaluation on 6 drift induced datasets and 4 additional datasets from the cybersecurity domain demonstrates that the MD3 approach can reliably detect drifts, with significantly fewer false alarms compared to unsupervised feature based drift detectors. At the same time, it produces performance comparable to that of a fully labeled drift detector. The reduced false alarms enables the signaling of drifts only when they are most likely to affect classification performance. As such, the MD3 approach leads to a detection scheme which is credible, label efficient and general in its applicability.  相似文献   

15.
基于主要特征抽取的重现概念漂移处理算法   总被引:1,自引:1,他引:0  
针对重现概念漂移检测中的概念表征和分类器选择问题,提出了一种适用于含重现概念漂移的数据流分类的算法——基于主要特征抽取的概念聚类和预测算法(Conceptual clustering and prediction through main feature extraction, MFCCP)。MFCCP通过计算不同批次样本的主要特征及影响因子的差异度以识别重复出现的概念,为每个概念维持且及时更新一个分类器,并依据Hoeffding不等式选择最合适的分类器对当前样本集实施分类,以 提高对概念漂移的反应能力。在3个数据集上的实验表明:MFCCP在含重现概念漂移的数据集上的分类准确率,对概念漂移的反应能力及对概念漂移检测的准确率均明显优于其他4种 对比算法,且MFCCP也适用于对不含重现概念漂移的数据流进行分类。  相似文献   

16.
赵强利  蒋艳凰  卢宇彤 《软件学报》2015,26(10):2567-2580
集成式数据流挖掘是对存在概念漂移的数据流进行学习的重要方法.针对传统集成式数据流挖掘存在的缺陷,将人类的回忆和遗忘机制引入到数据流挖掘中,提出基于记忆的数据流挖掘模型MDSM(memorizing based data stream mining).该模型将基分类器看作是系统获得的知识,通过"回忆与遗忘"机制,不仅使历史上有用的基分类器因记忆强度高而保存在"记忆库"中,提高预测的稳定性,而且从"记忆库"中选取当前分类效果好的基分类器参与集成预测,以提高对概念变化的适应能力.基于MDSM模型,提出了一种集成式数据流挖掘算法MAE(memorizing based adaptive ensemble),该算法利用Ebbinghaus遗忘曲线对系统的遗忘机制进行设计,并利用选择性集成来模拟人类的"回忆"机制.与4种典型的数据流挖掘算法进行比较,结果表明:MAE算法分类精度高,对概念漂移的整体适应能力强,尤其对重复出现的概念漂移以及实际应用中存在的复杂概念漂移具有很好的适应能力.不仅能够快速适应新的概念变化,并且能够有效抵御随机的概念波动对系统性能的影响.  相似文献   

17.
Incremental learning has been used extensively for data stream classification. Most attention on the data stream classification paid on non-evolutionary methods. In this paper, we introduce new incremental learning algorithms based on harmony search. We first propose a new classification algorithm for the classification of batch data called harmony-based classifier and then give its incremental version for classification of data streams called incremental harmony-based classifier. Finally, we improve it to reduce its computational overhead in absence of drifts and increase its robustness in presence of noise. This improved version is called improved incremental harmony-based classifier. The proposed methods are evaluated on some real world and synthetic data sets. Experimental results show that the proposed batch classifier outperforms some batch classifiers and also the proposed incremental methods can effectively address the issues usually encountered in the data stream environments. Improved incremental harmony-based classifier has significantly better speed and accuracy on capturing concept drifts than the non-incremental harmony based method and its accuracy is comparable to non-evolutionary algorithms. The experimental results also show the robustness of improved incremental harmony-based classifier.  相似文献   

18.
挖掘带有概念漂移的数据流对于许多实时决策是十分重要的.本文使用统计学理论估计某一确定模型在最新概念上的真实错误率的置信区间,在一定概率保证下检测数据流中是否发生了概念漂移,并将此方法和KMM(核平均匹配)算法引入集成分类器框架中,提出一种数据流分类的新算法WSEC.在仿真和真实数据流上的试验结果表明该算法是有效的.  相似文献   

19.
Many researchers have applied clustering to handle semi-supervised classification of data streams with concept drifts.However,the generalization ability for each specific concept cannot be steadily improved,and the concept drift detection method without considering the local structural information of data cannot accurately detect concept drifts.This paper proposes to solve these problems by BIRCH(Balanced Iterative Reducing and Clustering Using Hierarchies)ensemble and local structure mapping.The local structure mapping strategy is utilized to compute local similarity around each sample and combined with semi-supervised Bayesian method to perform concept detection.If a recurrent concept is detected,a historical BIRCH ensemble classifier is selected to be incrementally updated;otherwise a new BIRCH ensemble classifier is constructed and added into the classifier pool.The extensive experiments on several synthetic and real datasets demonstrate the advantage of the proposed algorithm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号