首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Almost all drift detection mechanisms designed for classification problems work reactively: after receiving the complete data set (input patterns and class labels) they apply a sequence of procedures to identify some change in the class-conditional distribution – a concept drift. However, detecting changes after its occurrence can be in some situations harmful to the process under analysis. This paper proposes a proactive approach for abrupt drift detection, called DetectA (Detect Abrupt Drift). Briefly, this method is composed of three steps: (i) label the patterns from the test set (an unlabelled data block), using an unsupervised method; (ii) compute some statistics from the train and test sets, conditioned to the given class labels for train set; and (iii) compare the training and testing statistics using a multivariate hypothesis test. Based on the results of the hypothesis tests, we attempt to detect the drift on the test set, before the real labels are obtained. A procedure for creating datasets with abrupt drift has been proposed to perform a sensitivity analysis of the DetectA model. The result of the sensitivity analysis suggests that the detector is efficient and suitable for datasets of high-dimensionality, blocks with any proportion of drifts, and datasets with class imbalance. The performance of the DetectA method, with different configurations, was also evaluated on real and artificial datasets, using an MLP as a classifier. The best results were obtained using one of the detection methods, being the proactive manner a top contender regarding improving the underlying base classifier accuracy.  相似文献   

2.
针对网络流特征会随网络环境变化而发生改变,从而导致基于流特征的机器学习分类方法精度明显降低的问题。提出一种基于概念漂移检测的自适应流量分类方法,该方法借助Kolmogorov-Smirnov检验对出现的流量进行概念漂移检测,然后通过多视图协同学习策略引入新流量样本修正概念漂移导致的模型变化,使分类器得到有效更新。实验结果表明该方法可以有效检测概念漂移并更新分类器,表现出较好的分类性能和泛化能力。  相似文献   

3.

Recently, sequence anomaly detection has been widely used in many fields. Sequence data in these fields are usually multi-dimensional over the data stream. It is a challenge to design an anomaly detection method for a multi-dimensional sequence over the data stream to satisfy the requirements of accuracy and high speed. It is because: (1) Redundant dimensions in sequence data and large state space lead to a poor ability for sequence modeling; (2) Anomaly detection cannot adapt to the high-speed nature of the data stream, especially when concept drift occurs, and it will reduce the detection rate. On one hand, most existing methods of sequence anomaly detection focus on the single-dimension sequence. On the other hand, some studies concerning multi-dimensional sequence concentrate mainly on the static database rather than the data stream. To improve the performance of anomaly detection for a multi-dimensional sequence over the data stream, we propose a novel unsupervised fast and accurate anomaly detection (FAAD) method which includes three algorithms. First, a method called “information calculation and minimum spanning tree cluster” is adopted to reduce redundant dimensions. Second, to speed up model construction and ensure the detection rate for the sequence over the data stream, we propose a method called “random sampling and subsequence partitioning based on the index probabilistic suffix tree.” Last, the method called “anomaly buffer based on model dynamic adjustment” dramatically reduces the effects of concept drift in the data stream. FAAD is implemented on the streaming platform Storm to detect multi-dimensional log audit data. Compared with the existing anomaly detection methods, FAAD has a good performance in detection rate and speed without being affected by concept drift.

  相似文献   

4.
概念漂移是动态流数据挖掘中一类常见的问题,但混杂噪声或训练样本规模过小而产生的伪概念漂移会引起与真实概念漂移相似的结果,即模型在线测试性能的不稳定波动,导致二者容易混淆,发生概念漂移的误报.针对流数据中真伪概念漂移的混淆问题,提出一种基于在线性能测试的概念漂移检测方法(concept drift detection method based on online performance test,简称CDPT).该方法将最新获得的数据集进行均匀分组,在每组子数据集上分别进行在线学习,同时记录每组子数据集训练测试得到的分类精度向量,并计算相邻学习时间单元之间的精度落差,依据测试精度下降阈值得到有效波动位点.然后采用交叉检验的方式整合不同分组中的有效波动位点,以消除流数据在线学习过程中由于训练样本过小导致模型不稳定造成的检测干扰,根据精度波动一致性得到一致波动位点.最后,通过跟踪在线学习分类准确率,得到一致波动位点邻域参照点的测试精度变化,比较一致波动位点邻域参照点对应的模型测试精度下降幅度及收敛情况,以有效检测一致波动位点当中真实的概念漂移位点.实验结果表明,该方法能够有效辨识流数据在线学习过程中发生的真实概念漂移,并能有效避免训练样本过小或者流数据中噪声对检测结果的负面影响,同时提高模型的泛化性能.  相似文献   

5.
Concept drift detectors are online learning software that mostly attempt to estimate the drift positions in data streams in order to modify the base classifier after these changes and improve accuracy. This is very important in applications such as the detection of anomalies in TCP/IP traffic and/or frauds in financial transactions. Drift Detection Method (DDM) is a simple, efficient, well-known method whose performance is often impaired when the concepts are very long. This article proposes the Reactive Drift Detection Method (RDDM), which is based on DDM and, among other modifications, discards older instances of very long concepts aiming to detect drifts earlier, improving the final accuracy. Experiments run in MOA, using abrupt and gradual concept drift versions of different dataset generators and sizes (48 artificial datasets in total), as well as three real-world datasets, suggest RDDM beats the accuracy results of DDM, ECDD, and STEPD in most scenarios.  相似文献   

6.
In an online data stream, the composition and distribution of the data may change over time, which is a phenomenon known as concept drift. The occurrence of concept drift can affect considerably the performance of a data stream mining method, especially in relation to mining accuracy. In this paper, we study the problem of mining frequent patterns from transactional data streams in the presence of concept drift, considering the important issue of mining accuracy preservation. In terms of frequent-pattern mining, we give the definitions of concept and concept drift with respect to streaming data; moreover, we present a categorization for concept drift. The concept of streaming data is considered the relationships of frequency between different patterns. Accordingly, we devise approaches to describe the concept concretely and to learn the concept through frequency relationship modeling. Based on concept learning, we propose a method of support approximation for discovering data stream frequent patterns. Our analyses and experimental results have shown that in several studied cases of concept drift, the proposed method not only performs efficiently in terms of time and memory but also preserves mining accuracy well on concept-drifting data streams.  相似文献   

7.
针对重现概念漂移检测中的概念表征和分类器选择问题,提出了一种适用于含重现概念漂移的数据流分类的算法——基于主要特征抽取的概念聚类和预测算法(Conceptual clustering and prediction through main feature extraction, MFCCP)。MFCCP通过计算不同批次样本的主要特征及影响因子的差异度以识别重复出现的概念,为每个概念维持且及时更新一个分类器,并依据Hoeffding不等式选择最合适的分类器对当前样本集实施分类,以 提高对概念漂移的反应能力。在3个数据集上的实验表明:MFCCP在含重现概念漂移的数据集上的分类准确率,对概念漂移的反应能力及对概念漂移检测的准确率均明显优于其他4种 对比算法,且MFCCP也适用于对不含重现概念漂移的数据流进行分类。  相似文献   

8.
Many researchers have applied clustering to handle semi-supervised classification of data streams with concept drifts.However,the generalization ability for each specific concept cannot be steadily improved,and the concept drift detection method without considering the local structural information of data cannot accurately detect concept drifts.This paper proposes to solve these problems by BIRCH(Balanced Iterative Reducing and Clustering Using Hierarchies)ensemble and local structure mapping.The local structure mapping strategy is utilized to compute local similarity around each sample and combined with semi-supervised Bayesian method to perform concept detection.If a recurrent concept is detected,a historical BIRCH ensemble classifier is selected to be incrementally updated;otherwise a new BIRCH ensemble classifier is constructed and added into the classifier pool.The extensive experiments on several synthetic and real datasets demonstrate the advantage of the proposed algorithm.  相似文献   

9.
动态非平衡数据分类是在线学习和类不平衡学习领域重要的研究问题,用于处理类分布非常倾斜的数据流。这类问题在实际场景中普遍存在,如实时控制监控系统的故障诊断和计算机网络中的入侵检测等。由于动态数据流中存在概念漂移现象和不平衡问题,因此数据流分类算法既要处理概念漂移,又要解决类不平衡问题。针对以上问题,提出了在检测概念漂移的同时对非平衡数据进行处理的一种方法。该方法采用Kappa系数检测概念漂移,进而检测平衡率,利用非平衡数据分类方法更新分类器。实验结果表明,在不同的评价指标上,该算法对非平衡数据流具有较好的分类性能。  相似文献   

10.
袁泉  郭江帆 《计算机应用》2018,38(6):1591-1595
针对数据流中概念漂移和噪声问题,提出一种新型的增量式学习的数据流集成分类算法。首先,引入噪声过滤机制过滤噪声;然后,引入假设检验方法对概念漂移进行检测,以增量式C4.5决策树为基分类器构建加权集成模型;最后,实现增量式学习实例并随之动态更新分类模型。实验结果表明,该集成分类器对概念漂移的检测精度达到95%~97%,对数据流抗噪性保持在90%以上。该算法分类精度较高,且在检测概念漂移的准确性和抗噪性方面有较好的表现。  相似文献   

11.
针对大多数概念漂移检测算法都存在高延迟和对噪声过于敏感的问题,提出了一种基于四分位距交叠滑动窗口的概念漂移检测方法,该方法使用四分位距窗口中的样本和改进的Hoeffding不等式进行概念漂移检测。为更好地避免噪声对分类器性能的影响,算法在Hoeffding不等式中引入了一个基于当前样本分类正确率的动态系数。实验结果表明,改进后的方法可以有效提高概念漂移检测的准确率,减少漂移检测延迟。  相似文献   

12.
Yan  Zhang  Hongle  Du  Gang  Ke  Lin  Zhang  Chen  Yeh-Cheng 《The Journal of supercomputing》2022,78(4):5394-5419

Data stream mining is one of the hot topics in data mining. Most existing algorithms assume that data stream with concept drift is balanced. However, in real-world, the data streams are imbalanced with concept drift. The learning algorithm will be more complex for the imbalanced data stream with concept drift. In online learning algorithm, the oversampling method is used to select a small number of samples from the previous data block through a certain strategy and add them into the current data block to amplify the current minority class. However, in this method, the number of stored samples, the method of oversampling and the weight calculation of base-classifier all affect the classification performance of ensemble classifier. This paper proposes a dynamic weighted selective ensemble (DWSE) learning algorithm for imbalanced data stream with concept drift. On the one hand, through resampling the minority samples in previous data block, the minority samples of the current data block can be amplified, and the information in the previous data block can be absorbed into building a classifier to reduce the impact of concept drift. The calculation method of information content of every sample is defined, and the resampling method and updating method of the minority samples are given in this paper. On the other hand, because of concept drift, the performance of the base-classifier will be degraded, and the decay factor is usually used to describe the performance degradation of base-classifier. However, the static decay factor cannot accurately describe the performance degradation of the base-classifier with the concept drift. The calculation method of dynamic decay factor of the base-classifier is defined in DWSE algorithm to select sub-classifiers to eliminate according to the attenuation situation, which makes the algorithm better deal with concept drift. Compared with other algorithms, the results show that the DWSE algorithm has better classification performance for majority class samples and minority samples.

  相似文献   

13.
The last decade has seen a surge of interest in adaptive learning algorithms for data stream classification, with applications ranging from predicting ozone level peaks, learning stock market indicators, to detecting computer security violations. In addition, a number of methods have been developed to detect concept drifts in these streams. Consider a scenario where we have a number of classifiers with diverse learning styles and different drift detectors. Intuitively, the current ‘best’ (classifier, detector) pair is application dependent and may change as a result of the stream evolution. Our research builds on this observation. We introduce the Tornado framework that implements a reservoir of diverse classifiers, together with a variety of drift detection algorithms. In our framework, all (classifier, detector) pairs proceed, in parallel, to construct models against the evolving data streams. At any point in time, we select the pair which currently yields the best performance. To this end, we introduce the CAR measure, which is employed to balance classification, adaptation and resource utilization requirements. We further incorporate two novel stacking-based drift detection methods, namely the FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) approaches. The experimental evaluation confirms that the current ‘best’ (classifier, detector) pair is not only heavily dependent on the characteristics of the stream, but also that this selection evolves as the stream flows. Further, our FHDDMS variants detect concept drifts accurately in a timely fashion while outperforming the state-of-the-art.  相似文献   

14.
互联网环境日新月异,使得网络数据流中存在概念漂移,对数据流的分类也由传统的静态分类变为动态分类,而如何对概念漂移进行检测是动态分类的关键。本文提出一种基于概念漂移检测的网络数据流自适应分类算法,通过比较滑动窗口中数据与历史数据的分布差异来检测概念漂移,然后将窗口中数据过采样来减少样本间的不均衡性,最后将处理后的数据集输入到OS-ELM分类器中进行在线学习,从而更新分类器使其应对数据流中的概念漂移。本文在MOA实验平台中使用合成数据集和真实数据集对提出的算法进行验证,结果表明,该算法较集成学习算法在分类准确率和稳定性上有一定的提升,并且随着数据流量的增加,时间性能上的优势开始体现,适合复杂多变的网络环境。  相似文献   

15.
《Information Fusion》2008,9(1):56-68
In the real world concepts are often not stable but change with time. A typical example of this in the biomedical context is antibiotic resistance, where pathogen sensitivity may change over time as new pathogen strains develop resistance to antibiotics that were previously effective. This problem, known as concept drift, complicates the task of learning a model from data and requires special approaches, different from commonly used techniques that treat arriving instances as equally important contributors to the final concept. The underlying data distribution may change as well, making previously built models useless. This is known as virtual concept drift. Both types of concept drifts make regular updates of the model necessary. Among the most popular and effective approaches to handle concept drift is ensemble learning, where a set of models built over different time periods is maintained and the best model is selected or the predictions of models are combined, usually according to their expertise level regarding the current concept. In this paper we propose the use of an ensemble integration technique that would help to better handle concept drift at an instance level. In dynamic integration of classifiers, each base classifier is given a weight proportional to its local accuracy with regard to the instance tested, and the best base classifier is selected, or the classifiers are integrated using weighted voting. Our experiments with synthetic data sets simulating abrupt and gradual concept drifts and with a real-world antibiotic resistance data set demonstrate that dynamic integration of classifiers built over small time intervals or fixed-sized data blocks can be significantly better than majority voting and weighted voting, which are currently the most commonly used integration techniques for handling concept drift with ensembles.  相似文献   

16.
Classifiers deployed in the real world operate in a dynamic environment, where the data distribution can change over time. These changes, referred to as concept drift, can cause the predictive performance of the classifier to drop over time, thereby making it obsolete. To be of any real use, these classifiers need to detect drifts and be able to adapt to them, over time. Detecting drifts has traditionally been approached as a supervised task, with labeled data constantly being used for validating the learned model. Although effective in detecting drifts, these techniques are impractical, as labeling is a difficult, costly and time consuming activity. On the other hand, unsupervised change detection techniques are unreliable, as they produce a large number of false alarms. The inefficacy of the unsupervised techniques stems from the exclusion of the characteristics of the learned classifier, from the detection process. In this paper, we propose the Margin Density Drift Detection (MD3) algorithm, which tracks the number of samples in the uncertainty region of a classifier, as a metric to detect drift. The MD3 algorithm is a distribution independent, application independent, model independent, unsupervised and incremental algorithm for reliably detecting drifts from data streams. Experimental evaluation on 6 drift induced datasets and 4 additional datasets from the cybersecurity domain demonstrates that the MD3 approach can reliably detect drifts, with significantly fewer false alarms compared to unsupervised feature based drift detectors. At the same time, it produces performance comparable to that of a fully labeled drift detector. The reduced false alarms enables the signaling of drifts only when they are most likely to affect classification performance. As such, the MD3 approach leads to a detection scheme which is credible, label efficient and general in its applicability.  相似文献   

17.
传统的多分类器选择算法产生较大的计算和存储开销。另外,多分类器对异常数据流的预测稳定性是解决概念飘移的重要因素。通过引入改进的决策轮廓矩阵和支持熵解决了每个分类器集合之间模糊差异度问题,并将支持熵作为差异度度量的输入衡量标准,使分类器集合之间的差异度计算更加稳定高效,并在此基础上提出了一种基于差异度集成的异常数据流检测方法并实现其算法;该方法应用在异常分类器选择模块,主要包括三个步骤:构建决策轮廓矩阵、整合支持熵、分类器集合差异度度量。实验结果表明,该算法对异常流量的预测精度和稳定性相比其他算法较好,由于分类器训练时间达到10-2 s左右,基本上能够适应数据流量检测的实时性需求。  相似文献   

18.
现有概念漂移处理算法在检测到概念漂移发生后,通常需要在新到概念上重新训练分类器,同时“遗忘”以往训练的分类器。在概念漂移发生初期,由于能够获取到的属于新到概念的样本较少,导致新建的分类器在短时间内无法得到充分训练,分类性能通常较差。进一步,现有的基于在线迁移学习的数据流分类算法仅能使用单个分类器的知识辅助新到概念进行学习,在历史概念与新到概念相似性较差时,分类模型的分类准确率不理想。针对以上问题,文中提出一种能够利用多个历史分类器知识的数据流分类算法——CMOL。CMOL算法采取分类器权重动态调节机制,根据分类器的权重对分类器池进行更新,使得分类器池能够尽可能地包含更多的概念。实验表明,相较于其他相关算法,CMOL算法能够在概念漂移发生时更快地适应新到概念,显示出更高的分类准确率。  相似文献   

19.
While feature point recognition is a key component of modern approaches to object detection, existing approaches require computationally expensive patch preprocessing to handle perspective distortion. In this paper, we show that formulating the problem in a naive Bayesian classification framework makes such preprocessing unnecessary and produces an algorithm that is simple, efficient, and robust. Furthermore, it scales well as the number of classes grows. To recognize the patches surrounding keypoints, our classifier uses hundreds of simple binary features and models class posterior probabilities. We make the problem computationally tractable by assuming independence between arbitrary sets of features. Even though this is not strictly true, we demonstrate that our classifier nevertheless performs remarkably well on image data sets containing very significant perspective changes.  相似文献   

20.
Robust automated vortex detection algorithms are needed to facilitate the exploration of large‐scale turbulent fluid flow simulations. Unfortunately, robust non‐local vortex detection algorithms are computationally intractable for large data sets and local algorithms, while computationally tractable, lack robustness. We argue that the deficiencies inherent to the local definitions occur because of two fundamental issues: the lack of a rigorous definition of a vortex and the fact that a vortex is an intrinsically non‐local phenomenon. As a first step towards addressing this problem, we demonstrate the use of machine learning techniques to enhance the robustness of local vortex detection algorithms. We motivate the presence of an expert‐in‐the‐loop using empirical results based on machine learning techniques. We employ adaptive boosting to combine a suite of widely used, local vortex detection algorithms, which we term weak classifiers, into a robust compound classifier. Fundamentally, the training phase of the algorithm, in which an expert manually labels small, spatially contiguous regions of the data, incorporates non‐local information into the resulting compound classifier. We demonstrate the efficacy of our approach by applying the compound classifier to two data sets obtained from computational fluid dynamical simulations. Our results demonstrate that the compound classifier has a reduced misclassification rate relative to the component classifiers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号