首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
《Information Fusion》2008,9(3):344-353
In real-world sensor networks, the monitored processes generating time-stamped data may change drastically over time. An online data-mining algorithm called OLIN (on-line information network) adapts itself automatically to the rate of concept drift in a non-stationary data stream by repeatedly constructing a classification model from every sliding window of training examples. In this paper, we introduce a new real-time data-mining algorithm called IOLIN (incremental on-line information network), which saves a significant amount of computational effort by updating an existing model as long as no major concept drift is detected. The proposed algorithm builds upon the oblivious decision-tree classification model called “information network” (IN) and it implements three different types of model updating operations. In the experiments with multi-year streams of traffic sensors data, no statistically significant difference between the accuracy of the incremental algorithm (IOLIN) vs. the regenerative one (OLIN) has been observed.  相似文献   

2.
由于在信用卡欺诈分析等领域的广泛应用,学者们开始关注概念漂移数据流分类问题.现有算法通常假设数据一旦分类后类标已知,利用所有待分类实例的真实类别来检测数据流是否发生概念漂移以及调整分类模型.然而,由于标记实例需要耗费大量的时间和精力,该解决方案在实际应用中无法实现.据此,提出一种基于KNNModel和增量贝叶斯的概念漂移检测算法KnnM-IB.新算法在具有KNNModel算法分类被模型簇覆盖的实例分类精度高、速度快优点的同时,利用增量贝叶斯算法对难处理样本进行分类,从而保证了分类效果.算法同时利用可变滑动窗口大小的变化以及主动学习标记的少量样本进行概念漂移检测.当数据流稳定时,半监督学习被用于扩大标记实例的数量以对模型进行更新,因而更符合实际应用的要求.实验结果表明,该方法能够在对数据流进行有效分类的同时检测数据流概念漂移及相应地更新模型.  相似文献   

3.
Liang  Shunpan  Pan  Weiwei  You  Dianlong  Liu  Ze  Yin  Ling 《Applied Intelligence》2022,52(12):13398-13414

Multi-label learning has attracted many attentions. However, the continuous data generated in the fields of sensors, network access, etc., that is data streams, the scenario brings challenges such as real-time, limited memory, once pass. Several learning algorithms have been proposed for offline multi-label classification, but few researches develop it for dynamic multi-label incremental learning models based on cascading schemes. Deep forest can perform representation learning layer by layer, and does not rely on backpropagation, using this cascading scheme, this paper proposes a multi-label data stream deep forest (VDSDF) learning algorithm based on cascaded Very Fast Decision Tree (VFDT) forest, which can receive examples successively, perform incremental learning, and adapt to concept drift. Experimental results show that the proposed VDSDF algorithm, as an incremental classification algorithm, is more competitive than batch classification algorithms on multiple indicators. Moreover, in dynamic flow scenarios, the adaptability of VDSDF to concept drift is better than that of the contrast algorithm.

  相似文献   

4.
现有概念漂移处理算法在检测到概念漂移发生后,通常需要在新到概念上重新训练分类器,同时“遗忘”以往训练的分类器。在概念漂移发生初期,由于能够获取到的属于新到概念的样本较少,导致新建的分类器在短时间内无法得到充分训练,分类性能通常较差。进一步,现有的基于在线迁移学习的数据流分类算法仅能使用单个分类器的知识辅助新到概念进行学习,在历史概念与新到概念相似性较差时,分类模型的分类准确率不理想。针对以上问题,文中提出一种能够利用多个历史分类器知识的数据流分类算法——CMOL。CMOL算法采取分类器权重动态调节机制,根据分类器的权重对分类器池进行更新,使得分类器池能够尽可能地包含更多的概念。实验表明,相较于其他相关算法,CMOL算法能够在概念漂移发生时更快地适应新到概念,显示出更高的分类准确率。  相似文献   

5.
由于现有各种机器学习算法本质上都基于一个静态学习环境,而以尽量保证学习系统泛化能力为目标的寻优过程,概念漂移数据流分类给机器学习带来了巨大挑战.从数据流与概念漂移、概念漂移数据流分类研究的发展与趋势、概念漂移数据流分类的主要研究领域、概念漂移数据流分类研究的新动态4个方面展开了文献综述,并分析了当前概念漂移数据流分类算法存在的问题.  相似文献   

6.
复杂数据流中所存在的概念漂移及不平衡问题降低了分类器的性能。传统的批量学习算法需要考虑内存以及运行时间等因素,在快速到达的海量数据流中性能并不突出,并且其中还包含着大量的漂移及类失衡现象,利用在线集成算法处理复杂数据流问题已经成为数据挖掘领域重要的研究课题。从集成策略的角度对bagging、boosting、stacking集成方法的在线版本进行了介绍与总结,并对比了不同模型之间的性能。首次对复杂数据流的在线集成分类算法进行了详细的总结与分析,从主动检测和被动自适应两个方面对概念漂移数据流检测与分类算法进行了介绍,从数据预处理和代价敏感两个方面介绍不平衡数据流,并分析了代表性算法的时空效率,之后对使用相同数据集的算法性能进行了对比。最后,针对复杂数据流在线集成分类研究领域的挑战提出了下一步研究方向。  相似文献   

7.
IKnnM-DHecoc:一种解决概念漂移问题的方法   总被引:2,自引:0,他引:2  
随着数据流挖掘的应用日趋广泛,带概念漂移的数据流分类问题已成为一项重要且充满挑战的工作.根据带概念漂移的数据流的特点,一个有效的学习器必须能跟踪并快速适应这种变化.一种基于增量KnnModel的动态层次编码算法被提出用于解决数据流的概念漂移问题.在将数据流划分为数据块后,根据增量KnnModel算法对每块的预学习结果构建并更新类别层次树、层次编码,用可增量学习的分类算法对照编码划分进行学习,并生成备选分类器集.最后依据活跃度对结点进行剪枝处理以减少计算代价.在预测阶段,利用增量KnnModel算法和动态层次纠错输出编码算法的各自优势进行联合预测.实验结果表明:基于增量KnnModel算法的动态层次纠错输出编码算法不但能够提高模型学习的动态性和分类的正确性,而且还能够快速适应概念漂移的情况.  相似文献   

8.

针对流数据中概念漂移发生后,在线学习模型不能对分布变化后的数据做出及时响应且难以提取数据分布的最新信息,导致学习模型收敛较慢的问题,提出一种基于在线集成的概念漂移自适应分类方法(adaptive classification method for concept drift based on online ensemble,AC_OE). 一方面,该方法利用在线集成策略构建在线集成学习器,对数据块中的训练样本进行局部预测以动态调整学习器权重,有助于深入提取漂移位点附近流数据的演化信息,对数据分布变化进行精准响应,提升在线学习模型对概念漂移发生后新数据分布的适应能力,提高学习模型的实时泛化性能;另一方面,利用增量学习策略构建增量学习器,并随新样本的进入进行增量式的训练更新,提取流数据的全局分布信息,使模型在平稳的流数据状态下保持较好的鲁棒性. 实验结果表明,该方法能够对概念漂移做出及时响应并加速在线学习模型的收敛速度,同时有效提高学习器的整体泛化性能.

  相似文献   

9.
在开放环境下,数据流具有数据高速生成、数据量无限和概念漂移等特性.在数据流分类任务中,利用人工标注产生大量训练数据的方式昂贵且不切实际.包含少量有标记样本和大量无标记样本且还带概念漂移的数据流给机器学习带来了极大挑战.然而,现有研究主要关注有监督的数据流分类,针对带概念漂移的数据流的半监督分类的研究尚未引起足够的重视....  相似文献   

10.
Incremental learning has been used extensively for data stream classification. Most attention on the data stream classification paid on non-evolutionary methods. In this paper, we introduce new incremental learning algorithms based on harmony search. We first propose a new classification algorithm for the classification of batch data called harmony-based classifier and then give its incremental version for classification of data streams called incremental harmony-based classifier. Finally, we improve it to reduce its computational overhead in absence of drifts and increase its robustness in presence of noise. This improved version is called improved incremental harmony-based classifier. The proposed methods are evaluated on some real world and synthetic data sets. Experimental results show that the proposed batch classifier outperforms some batch classifiers and also the proposed incremental methods can effectively address the issues usually encountered in the data stream environments. Improved incremental harmony-based classifier has significantly better speed and accuracy on capturing concept drifts than the non-incremental harmony based method and its accuracy is comparable to non-evolutionary algorithms. The experimental results also show the robustness of improved incremental harmony-based classifier.  相似文献   

11.
数据流分类研究在开放、动态环境中如何提供更可靠的数据驱动预测模型, 关键在于从实时到达且不断变化的数据流中检测并适应概念漂移. 目前, 为检测概念漂移和更新分类模型, 数据流分类方法通常假设所有样本的标签都是已知的, 这一假设在真实场景下是不现实的. 此外, 真实数据流可能表现出较高且不断变化的类不平衡比率, 会进一步增加数据流分类任务的复杂性. 为此, 提出一种非平衡概念漂移数据流主动学习方法(Active learning method for imbalanced concept drift data stream, ALM-ICDDS). 定义基于多预测概率的样本预测确定性度量, 提出边缘阈值矩阵的自适应调整方法, 使得标签查询策略适用于类别数较多的非平衡数据流; 提出基于记忆强度的样本替换策略, 将难区分、少数类样本和代表当前数据分布的样本保存在记忆窗口中, 提升新基分类器的分类性能; 定义基于分类精度的基分类器重要性评价及更新方法, 实现漂移后的集成分类器更新. 在7个合成数据流和3个真实数据流上的对比实验表明, 提出的非平衡概念漂移数据流主动学习方法的分类性能优于6种概念漂移数据流学习方法.  相似文献   

12.
Mining data streams is the process of extracting information from non-stopping, rapidly flowing data records to provide knowledge that is reliable and timely. Streaming data algorithms need to be one pass and operate under strict limitations of memory and response time. In addition, the classification of streaming data requires learning in an environment where the data characteristics might change constantly. Many of the classification algorithms presented in literature assume a 100 % labeling rate, which is impractical and expensive when data records are rapidly flowing in. In this paper, a new incremental grid density based learning framework, the GC3 framework, is proposed to perform classification of streaming data with concept drift and limited labeling. The proposed framework uses grid density clustering to detect changes in the input data space. It maintains an evolving ensemble of classifiers to learn and adapt to the model changes over time. The framework also uses a uniform grid density sampling mechanism to obtain a uniform subset of samples for better classification performance with a lower labeling rate. The entire framework is designed to be one-pass, incremental and work with limited memory to perform any-time classification on demand. Experimental comparison with state of the art concept drift handling systems demonstrate the GC3 frameworks ability to provide high classification performance, using fewer models in the ensemble and with only 4-6 % of the samples labeled. The results show that the GC3 framework is effective and attractive for use in real world data stream classification applications.  相似文献   

13.
Yan  Zhang  Hongle  Du  Gang  Ke  Lin  Zhang  Chen  Yeh-Cheng 《The Journal of supercomputing》2022,78(4):5394-5419

Data stream mining is one of the hot topics in data mining. Most existing algorithms assume that data stream with concept drift is balanced. However, in real-world, the data streams are imbalanced with concept drift. The learning algorithm will be more complex for the imbalanced data stream with concept drift. In online learning algorithm, the oversampling method is used to select a small number of samples from the previous data block through a certain strategy and add them into the current data block to amplify the current minority class. However, in this method, the number of stored samples, the method of oversampling and the weight calculation of base-classifier all affect the classification performance of ensemble classifier. This paper proposes a dynamic weighted selective ensemble (DWSE) learning algorithm for imbalanced data stream with concept drift. On the one hand, through resampling the minority samples in previous data block, the minority samples of the current data block can be amplified, and the information in the previous data block can be absorbed into building a classifier to reduce the impact of concept drift. The calculation method of information content of every sample is defined, and the resampling method and updating method of the minority samples are given in this paper. On the other hand, because of concept drift, the performance of the base-classifier will be degraded, and the decay factor is usually used to describe the performance degradation of base-classifier. However, the static decay factor cannot accurately describe the performance degradation of the base-classifier with the concept drift. The calculation method of dynamic decay factor of the base-classifier is defined in DWSE algorithm to select sub-classifiers to eliminate according to the attenuation situation, which makes the algorithm better deal with concept drift. Compared with other algorithms, the results show that the DWSE algorithm has better classification performance for majority class samples and minority samples.

  相似文献   

14.
在监督或半监督学习的条件下对数据流集成分类进行研究是一个很有意义的方向.从基分类器、关键技术、集成策略等三个方面进行介绍,其中,基分类器主要介绍了决策树、神经网络、支持向量机等;关键技术从增量、在线等方面介绍;集成策略主要介绍了boosting、stacking等.对不同集成方法的优缺点、对比算法和实验数据集进行了总结与分析.最后给出了进一步研究方向,包括监督和半监督学习下对于概念漂移的处理、对于同质集成和异质集成的研究,无监督学习下的数据流集成分类等.  相似文献   

15.
数据流分类是数据挖掘中最重要的任务之一,而数据流的概念漂移特性给分类算法带来了巨大的挑战.基于极限学习机算法进行优化是解决数据流分类问题的一个热门方向,但目前大多数算法都采用提前指定模型参数的方式进行学习,这种做法使得分类模型只能在特定的数据集上才能发挥较好的性能.针对这一问题,提出了一种简单有效的处理概念漂移的算法——自适应在线顺序极限学习机分类算法.算法通过引入自适应模型复杂度机制,从而具有更好的分类性能.然后通过引入自适应遗忘因子与概念漂移检测机制,能够根据动态变化的数据流进行自适应学习,从而可以更好地适应概念漂移.进一步还引入异常点检测机制,避免分类决策边界被异常点破坏.仿真实验表明,所提出算法比同类算法具有更好的稳定性、分类准确性以及概念漂移适应能力.此外,还通过消融实验证实了算法所引入3个机制的有效性.  相似文献   

16.
The number of Internet of Things devices generating data streams is expected to grow exponentially with the support of emergent technologies such as 5G networks. Therefore, the online processing of these data streams requires the design and development of suitable machine learning algorithms, able to learn online, as data is generated. Like their batch-learning counterparts, stream-based learning algorithms require careful hyperparameter settings. However, this problem is exacerbated in online learning settings, especially with the occurrence of concept drifts, which frequently require the reconfiguration of hyperparameters. In this article, we present SSPT, an extension of the Self Parameter Tuning (SPT) optimisation algorithm for data streams. We apply the Nelder–Mead algorithm to dynamically-sized samples, converging to optimal settings in a single pass over data while using a relatively small number of hyperparameter configurations. In addition, our proposal automatically readjusts hyperparameters when concept drift occurs. To assess the effectiveness of SSPT, the algorithm is evaluated with three different machine learning problems: recommendation, regression, and classification. Experiments with well-known data sets show that the proposed algorithm can outperform previous hyperparameter tuning efforts by human experts. Results also show that SSPT converges significantly faster and presents at least similar accuracy when compared with the previous double-pass version of the SPT algorithm.  相似文献   

17.
李南  郭躬德  陈黎飞 《计算机应用》2012,32(8):2176-2185
传统的概念漂移数据流分类算法通常利用测试数据的真实类标来检测数据流是否发生概念漂移,并根据需要调整分类模型。然而,真实类标的标记需要耗费大量的人力、物力,而持续不断到来的高速数据流使得这种解决方案在现实中难以实现。针对上述问题,提出一种基于少量类标签的概念漂移检测算法。它根据快速KNNModel算法利用模型簇分类的特点,在未知分类数据类标的情况下,根据当前数据块不被任一模型簇覆盖的实例数目较之前数据块在一定的显著水平下是否发生显著增大,来判断是否发生概念漂移。在概念漂移发生的情况下,让领域专家针对那些少量的不被模型簇覆盖的数据进行标记,并利用这些数据自我修正模型,较好地解决了概念漂移的检测和模型自我更新问题。实验结果表明,该方法能够在自适应处理数据流概念漂移的前提下对数据流进行快速的分类,并得到和传统数据流分类算法近似或更高的分类精度。  相似文献   

18.
网络流量特征分布的动态变化产生概念漂移问题,造成基于机器学习的网络流量分类模型精度下降.定期更新分类模型耗时且无法保证分类模型的泛化能力.基于此,提出一种基于散度的网络流概念漂移分类方法(ensemble classification based on divergence detection, ECDD),采用双层窗口机制,从信息熵的角度出发,根据流量特征分布的JS散度,记为JSD(Jensen-Shannon divergence)来度量滑动窗口内数据分布的差异,从而检测概念漂移.借鉴增量集成学习的思想,检测到漂移时对于新样本重新训练出新的分类器,之后通过分类器权值排序,保留性能较高的分类器,加权集成分类结果对样本进行分类.抓取常见的网络应用流量,根据应用特征分布的不同构建概念漂移数据集,将该方法与常见的概念漂移检测方法进行实验对比,实验结果表明:该方法可以有效地检测概念漂移和更新分类器,表现出较好的分类性能.  相似文献   

19.
In many applications of information systems learning algorithms have to act in dynamic environments where data are collected in the form of transient data streams. Compared to static data mining, processing streams imposes new computational requirements for algorithms to incrementally process incoming examples while using limited memory and time. Furthermore, due to the non-stationary characteristics of streaming data, prediction models are often also required to adapt to concept drifts. Out of several new proposed stream algorithms, ensembles play an important role, in particular for non-stationary environments. This paper surveys research on ensembles for data stream classification as well as regression tasks. Besides presenting a comprehensive spectrum of ensemble approaches for data streams, we also discuss advanced learning concepts such as imbalanced data streams, novelty detection, active and semi-supervised learning, complex data representations and structured outputs. The paper concludes with a discussion of open research problems and lines of future research.  相似文献   

20.
目前数据流分类算法大多是基于类分布这一理想状态,然而在真实数据流环境中数据分布往往是不均衡的,并且数据流中往往伴随着概念漂移。针对数据流中的不均衡问题和概念漂移问题,提出了一种新的基于集成学习的不均衡数据流分类算法。首先为了解决数据流的不均衡问题,在训练模型前加入混合采样方法平衡数据集,然后采用基分类器加权和淘汰策略处理概念漂移问题,从而提高分类器的分类性能。最后与经典数据流分类算法在人工数据集和真实数据集上进行对比实验,实验结果表明,本文提出的算法在含有概念漂移和不均衡的数据流环境中,其整体分类性能优于其他算法的。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号