首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
现有概念漂移处理算法在检测到概念漂移发生后,通常需要在新到概念上重新训练分类器,同时“遗忘”以往训练的分类器。在概念漂移发生初期,由于能够获取到的属于新到概念的样本较少,导致新建的分类器在短时间内无法得到充分训练,分类性能通常较差。进一步,现有的基于在线迁移学习的数据流分类算法仅能使用单个分类器的知识辅助新到概念进行学习,在历史概念与新到概念相似性较差时,分类模型的分类准确率不理想。针对以上问题,文中提出一种能够利用多个历史分类器知识的数据流分类算法——CMOL。CMOL算法采取分类器权重动态调节机制,根据分类器的权重对分类器池进行更新,使得分类器池能够尽可能地包含更多的概念。实验表明,相较于其他相关算法,CMOL算法能够在概念漂移发生时更快地适应新到概念,显示出更高的分类准确率。  相似文献   

2.
周胜  刘三民 《计算机工程》2020,46(5):139-143,149
为解决数据流分类中的概念漂移和噪声问题,提出一种基于样本确定性的多源迁移学习方法。该方法存储多源领域上由训练得到的分类器,求出各源领域分类器对目标领域数据块中每个样本的类别后验概率和样本确定性值。在此基础上,将样本确定性值满足当前阈值限制的源领域分类器与目标领域分类器进行在线集成,从而将多个源领域的知识迁移到目标领域。实验结果表明,该方法能够有效消除噪声数据流给不确定分类器带来的不利影响,与基于准确率选择集成的多源迁移学习方法相比,具有更高的分类准确率和抗噪稳定性。  相似文献   

3.
在开放环境下,数据流具有数据高速生成、数据量无限和概念漂移等特性.在数据流分类任务中,利用人工标注产生大量训练数据的方式昂贵且不切实际.包含少量有标记样本和大量无标记样本且还带概念漂移的数据流给机器学习带来了极大挑战.然而,现有研究主要关注有监督的数据流分类,针对带概念漂移的数据流的半监督分类的研究尚未引起足够的重视....  相似文献   

4.
Prediction in streaming data is an important activity in the modern society. Two major challenges posed by data streams are (1) the data may grow without limit so that it is difficult to retain a long history of raw data; and (2) the underlying concept of the data may change over time. The novelties of this paper are in four folds. First, it uses a measure of conceptual equivalence to organize the data history into a history of concepts. This contrasts to the common practice that only keeps recent raw data. The concept history is compact while still retains essential information for learning. Second, it learns concept-transition patterns from the concept history and anticipates what the concept will be in the case of a concept change. It then proactively prepares a prediction model for the future change. This contrasts to the conventional methodology that passively waits until the change happens. Third, it incorporates proactive and reactive predictions. If the anticipation turns out to be correct, a proper prediction model can be launched instantly upon the concept change. If not, it promptly resorts to a reactive mode: adapting a prediction model to the new data. Finally, an efficient and effective system RePro is proposed to implement these new ideas. It carries out prediction at two levels, a general level of predicting each oncoming concept and a specific level of predicting each instance's class. Experiments are conducted to compare RePro with representative existing prediction methods on various benchmark data sets that represent diversified scenarios of concept change. Empirical evidence offers inspiring insights and demonstrates the proposed methodology is an advisable solution to prediction in data streams.A preliminary and shorter version of this paper has been published in the Proceedings of the llth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2005), pp. 710–715.Sometimes there are conflicts in the literature when describing these modes. For example, the concept shift in some papers means the concept drift in other papers. The definitions here are cleared up to the best of the authors’ understanding.The value in each cell can be frequency as well as probability. The latter can be approximated from the former.If the concept changes so fast that the learning can not catch up with it, the prediction will be inordinate. This also applies to human learning.For example, C4.5rules (Quinlan, 1993) can achieve a 100% classification accuracy on the whole data set.If the attribute value has less than 500 instances, all instances will be sampled without replacement.If a data set has only nominal attributes, two nominal attributes will be selected. If a data set has only numeric attributes, two numeric attributes will be selected.One can not manipulate these degrees in the hyperplane or network intrusion data, for which no results are presented.The sample size is chosen to avoid observation noise caused by high classification variance.These error rates may sometimes be higher than those reported in the original work (Hulten et al., 2001). It is because the original work used a much larger data size. There are many more instances coming after the new classifier becomes stable and hence can be classified correctly. This longer existence of each concept relieves CVFDT's dilemma and lowers its average error rate.Please note that for DWCE, the optimal version whose buffer size equals to 10% of its window size has been used on those 3 artificial data streams. However, its prohibitively high time demand makes DWCE intractable when a large number (36) of real-world data streams are tested here. Hence a compromise version is used here instead whose buffer size is half of its window size. The results are sufficient to verify that DWCE trades time for accuracy. It can improve prediction accuracy on WCE, but is often too slow to be useful for on-line prediction.  相似文献   

5.
Most data-mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data-mining model generated from past examples to become less accurate and relevant for classifying the current data. Most online learning algorithms deal with concept drift by generating a new model every time a concept drift is detected. On one hand, this solution ensures accurate and relevant models at all times, thus implying an increase in the classification accuracy. On the other hand, this approach suffers from a major drawback, which is the high computational cost of generating new models. The problem is getting worse when a concept drift is detected more frequently and, hence, a compromise in terms of computational effort and accuracy is needed. This work describes a series of incremental algorithms that are shown empirically to produce more accurate classification models than the batch algorithms in the presence of a concept drift while being computationally cheaper than existing incremental methods. The proposed incremental algorithms are based on an advanced decision-tree learning methodology called “Info-Fuzzy Network” (IFN), which is capable to induce compact and accurate classification models. The algorithms are evaluated on real-world streams of traffic and intrusion-detection data.  相似文献   

6.
Most existing works on data stream classification assume the streaming data is precise and definite. Such assumption, however, does not always hold in practice, since data uncertainty is ubiquitous in data stream applications due to imprecise measurement, missing values, privacy protection, etc. The goal of this paper is to learn accurate decision tree models from uncertain data streams for classification analysis. On the basis of very fast decision tree (VFDT) algorithms, we proposed an algorithm for constructing an uncertain VFDT tree with classifiers at tree leaves (uVFDTc). The uVFDTc algorithm can exploit uncertain information effectively and efficiently in both the learning and the classification phases. In the learning phase, it uses Hoeffding bound theory to learn from uncertain data streams and yield fast and reasonable decision trees. In the classification phase, at tree leaves it uses uncertain naive Bayes (UNB) classifiers to improve the classification performance. Experimental results on both synthetic and real-life datasets demonstrate the strong ability of uVFDTc to classify uncertain data streams. The use of UNB at tree leaves has improved the performance of uVFDTc, especially the any-time property, the benefit of exploiting uncertain information, and the robustness against uncertainty.  相似文献   

7.
Incremental learning has been used extensively for data stream classification. Most attention on the data stream classification paid on non-evolutionary methods. In this paper, we introduce new incremental learning algorithms based on harmony search. We first propose a new classification algorithm for the classification of batch data called harmony-based classifier and then give its incremental version for classification of data streams called incremental harmony-based classifier. Finally, we improve it to reduce its computational overhead in absence of drifts and increase its robustness in presence of noise. This improved version is called improved incremental harmony-based classifier. The proposed methods are evaluated on some real world and synthetic data sets. Experimental results show that the proposed batch classifier outperforms some batch classifiers and also the proposed incremental methods can effectively address the issues usually encountered in the data stream environments. Improved incremental harmony-based classifier has significantly better speed and accuracy on capturing concept drifts than the non-incremental harmony based method and its accuracy is comparable to non-evolutionary algorithms. The experimental results also show the robustness of improved incremental harmony-based classifier.  相似文献   

8.
数据流分类已成为当前研究热点之一,如何解决其中的概念漂移和噪声是关键问题,为此提出了一种新的基 于分类器相似性的动态集成算法。由于数据流中相部数据具有相同概念的概率较大,因此用最新基分类器代表数据 流中即将出现的概念,同时基于此分类器求出基分类器之间的相似性作为权值进行加权多数投票,并根据相似性大小 淘汰较弱基分类器以适应概念漂移和噪声。在标准仿真数据集上进行了仿真实验,结果表明该算法相比其他集成方 法在抗噪性能和分类准确性方面均得到显著提高。  相似文献   

9.
互联网环境日新月异,使得网络数据流中存在概念漂移,对数据流的分类也由传统的静态分类变为动态分类,而如何对概念漂移进行检测是动态分类的关键。本文提出一种基于概念漂移检测的网络数据流自适应分类算法,通过比较滑动窗口中数据与历史数据的分布差异来检测概念漂移,然后将窗口中数据过采样来减少样本间的不均衡性,最后将处理后的数据集输入到OS-ELM分类器中进行在线学习,从而更新分类器使其应对数据流中的概念漂移。本文在MOA实验平台中使用合成数据集和真实数据集对提出的算法进行验证,结果表明,该算法较集成学习算法在分类准确率和稳定性上有一定的提升,并且随着数据流量的增加,时间性能上的优势开始体现,适合复杂多变的网络环境。  相似文献   

10.
袁泉  郭江帆 《计算机应用》2018,38(6):1591-1595
针对数据流中概念漂移和噪声问题,提出一种新型的增量式学习的数据流集成分类算法。首先,引入噪声过滤机制过滤噪声;然后,引入假设检验方法对概念漂移进行检测,以增量式C4.5决策树为基分类器构建加权集成模型;最后,实现增量式学习实例并随之动态更新分类模型。实验结果表明,该集成分类器对概念漂移的检测精度达到95%~97%,对数据流抗噪性保持在90%以上。该算法分类精度较高,且在检测概念漂移的准确性和抗噪性方面有较好的表现。  相似文献   

11.
一种基于混合集成方法的数据流概念漂移检测方法   总被引:1,自引:0,他引:1  
近年来,数据流分类问题研究受到了普遍关注,而漂移检测是其中一个重要的研究问题。已有的分类模型有单一集成模型和混合模型,其漂移检测机制多基于理想的分布假设。单一模型集成可能导致分类误差扩大,噪音环境下分类效果受到了一定影响,而混合集成模型多存在分类精度和时间性能难以两者兼顾的问题。为此,基于简单的WE集成框架,构建了基于决策树和bayes混合模型的集成分类方法 WE-DTB,并利用典型的概念漂移检测机制Hoeffding Bounds和μ检验来进行数据流环境下概念漂移的检测和分类。大量实验表明,WE-DTB能够有效检测概念漂移且具有较好的分类精度及时空性能。  相似文献   

12.
Many data mining and data analysis techniques operate on dense matrices or complete tables of data. Real‐world data sets, however, often contain unknown values. Even many classification algorithms that are designed to operate with missing values still exhibit deteriorated accuracy. One approach to handling missing values is to fill in (impute) the missing values. In this article, we present a technique for unsupervised learning called unsupervised backpropagation (UBP), which trains a multilayer perceptron to fit to the manifold sampled by a set of observed point vectors. We evaluate UBP with the task of imputing missing values in data sets and show that UBP is able to predict missing values with significantly lower sum of squared error than other collaborative filtering and imputation techniques. We also demonstrate with 24 data sets and nine supervised learning algorithms that classification accuracy is usually higher when randomly withheld values are imputed using UBP, rather than with other methods.  相似文献   

13.
Data stream learning has been largely studied for extracting knowledge structures from continuous and rapid data records. As data is evolving on a temporal basis, its underlying knowledge is subject to many challenges. Concept drift,1 as one of core challenge from the stream learning community, is described as changes of statistical properties of the data over time, causing most of machine learning models to be less accurate as changes over time are in unforeseen ways. This is particularly problematic as the evolution of data could derive to dramatic change in knowledge. We address this problem by studying the semantic representation of data streams in the Semantic Web, i.e., ontology streams. Such streams are ordered sequences of data annotated with ontological vocabulary. In particular we exploit three levels of knowledge encoded in ontology streams to deal with concept drifts: i) existence of novel knowledge gained from stream dynamics, ii) significance of knowledge change and evolution, and iii) (in)consistency of knowledge evolution. Such knowledge is encoded as knowledge graph embeddings through a combination of novel representations: entailment vectors, entailment weights, and a consistency vector. We illustrate our approach on classification tasks of supervised learning. Key contributions of the study include: (i) an effective knowledge graph embedding approach for stream ontologies, and (ii) a generic consistent prediction framework with integrated knowledge graph embeddings for dealing with concept drifts. The experiments have shown that our approach provides accurate predictions towards air quality in Beijing and bus delay in Dublin with real world ontology streams.  相似文献   

14.
为解决数据流分类过程中样本标注和概念漂移问题,提出了一种基于实例迁移的数据流分类挖掘模型.首先,该模型用支持向量机作学习器,用所得分类模型中的支持向量构建源领域,待分类的当前数据块为目标域.然后,借助互近邻思想在源域中挑选目标域中样本的真邻居进行实例迁移,避免发生负迁移.最后,通过合并目标域和迁移样本形成训练集,提高标注样本数量,增强模型的泛化能力.理论分析和实验结果表明,所提方法具有可行性,相比其它学习方法在分类准确性方面更具优势.  相似文献   

15.
Processing data streams requires new demands not existent on static environments. In online learning, the probability distribution of the data can often change over time (concept drift). The prequential assessment methodology is commonly used to evaluate the performance of classifiers in data streams with stationary and non‐stationary distributions. It is based on the premise that the purpose of statistical inference is to make sequential probability forecasts for future observations, rather than to express information about the past accuracy achieved. This article empirically evaluates the prequential methodology considering its three common strategies used to update the prediction model, namely, Basic Window, Sliding Window, and Fading Factors. Specifically, it aims to identify which of these variations is the most accurate for the experimental evaluation of the past results in scenarios where concept drifts occur, with greater interest in the accuracy observed within the total data flow. The prequential accuracy of the three variations and the real accuracy obtained in the learning process of each dataset are the basis for this evaluation. The results of the carried‐out experiments suggest that the use of Prequential with the Sliding Window variation is the best alternative.  相似文献   

16.
概念漂移数据流挖掘算法综述   总被引:1,自引:0,他引:1  
丁剑  韩萌  李娟 《计算机科学》2016,43(12):24-29, 62
数据流是一种新型的数据模型,具有动态、无限、高维、有序、高速和变化等特性。在真实的数据流环境中,一些数据分布是随着时间改变的,即具有概念漂移特征,称为可变数据流或概念漂移数据流。因此处理数据流模型的方法需要处理时空约束和自适应调整概念变化。对概念漂移问题和概念漂移数据流分类、聚类和模式挖掘等内容进行综述。首先介绍概念漂移的类型和常用概念改变检测方法。为了解决概念漂移问题,数据流挖掘中常使用滑动窗口模型对新近事务进行处理。数据流分类常用的模型包括单分类模型和集成分类模型,常用的方法包括决策树、分类关联规则等。数据流聚类方式通常包括基于k- means的和非基于k- means的。模式挖掘可以为分类、聚类和关联规则等提供有用信息。概念漂移数据流中的模式包括频繁模式、序列模式、episode、模式树、模式图和高效用模式等。最后详细介绍其中的频繁模式挖掘算法和高效用模式挖掘算法。  相似文献   

17.
In an online data stream, the composition and distribution of the data may change over time, which is a phenomenon known as concept drift. The occurrence of concept drift can affect considerably the performance of a data stream mining method, especially in relation to mining accuracy. In this paper, we study the problem of mining frequent patterns from transactional data streams in the presence of concept drift, considering the important issue of mining accuracy preservation. In terms of frequent-pattern mining, we give the definitions of concept and concept drift with respect to streaming data; moreover, we present a categorization for concept drift. The concept of streaming data is considered the relationships of frequency between different patterns. Accordingly, we devise approaches to describe the concept concretely and to learn the concept through frequency relationship modeling. Based on concept learning, we propose a method of support approximation for discovering data stream frequent patterns. Our analyses and experimental results have shown that in several studied cases of concept drift, the proposed method not only performs efficiently in terms of time and memory but also preserves mining accuracy well on concept-drifting data streams.  相似文献   

18.
李南 《计算机系统应用》2016,25(12):187-192
现有数据流分类算法大多使用有监督学习,而标记高速数据流上的样本需要很大的代价,因此缺乏实用性.针对以上问题,提出了一种低代价的数据流分类算法2SDC.新算法利用少量已标记类别的样本和大量未标记样本来训练和更新分类模型,并且动态监测数据流上可能发生的概念漂移.真实数据流上的实验表明,2SDC算法不仅具有和当前有监督学习分类算法相当的分类精度,并且能够自适应数据流上的概念漂移.  相似文献   

19.
由于现有各种机器学习算法本质上都基于一个静态学习环境,而以尽量保证学习系统泛化能力为目标的寻优过程,概念漂移数据流分类给机器学习带来了巨大挑战.从数据流与概念漂移、概念漂移数据流分类研究的发展与趋势、概念漂移数据流分类的主要研究领域、概念漂移数据流分类研究的新动态4个方面展开了文献综述,并分析了当前概念漂移数据流分类算法存在的问题.  相似文献   

20.
复杂数据流中所存在的概念漂移及不平衡问题降低了分类器的性能。传统的批量学习算法需要考虑内存以及运行时间等因素,在快速到达的海量数据流中性能并不突出,并且其中还包含着大量的漂移及类失衡现象,利用在线集成算法处理复杂数据流问题已经成为数据挖掘领域重要的研究课题。从集成策略的角度对bagging、boosting、stacking集成方法的在线版本进行了介绍与总结,并对比了不同模型之间的性能。首次对复杂数据流的在线集成分类算法进行了详细的总结与分析,从主动检测和被动自适应两个方面对概念漂移数据流检测与分类算法进行了介绍,从数据预处理和代价敏感两个方面介绍不平衡数据流,并分析了代表性算法的时空效率,之后对使用相同数据集的算法性能进行了对比。最后,针对复杂数据流在线集成分类研究领域的挑战提出了下一步研究方向。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号