首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Sliding window is a widely used model for data stream mining due to its emphasis on recent data and its bounded memory requirement. The main idea behind a transactional sliding window is to keep a fixed size window over a data stream. The window size is kept constant by removing old transactions from the window, when new transactions arrive. Older transactions of window are removed irrespective to whether a significant change has occurred or not. Another challenge of sliding window model is determining window size. The classic approach for determining the window size is to obtain it from the user. In order to determine the precise size of the window, the user must have prior knowledge about the time and scale of changes within the data stream. However, due to the unpredictable changing nature of data streams, this prior knowledge cannot be easily determined. Moreover, by using a fixed window size during a data stream mining, the performance of this model is degraded in terms of reflecting recent changes. Based on these observations, this study relaxes the notion of window size and proposes a new algorithm named VSW (Variable Size sliding Window frequent itemset mining) which is suitable for observing recent changes in the set of frequent itemsets over data streams. The window size is determined dynamically based on amounts of concept change that occurs within the arriving data stream. The window expands as the concept becomes stable and shrinks when a concept change occurs. In this study, it is shown that if stale transactions are removed from the window after a concept change, updated frequent itemsets always belong to the most recent concept. Experimental evaluations on both synthetic and real data show that our algorithm effectively detects the concept change, adjust the window size, and adapts itself to the new concepts along the data stream.  相似文献   

2.
Feature selection targets the identification of which features of a dataset are relevant to the learning task. It is also widely known and used to improve computation times, reduce computation requirements, and to decrease the impact of the curse of dimensionality and enhancing the generalization rates of classifiers. In data streams, classifiers shall benefit from all the items above, but more importantly, from the fact that the relevant subset of features may drift over time. In this paper, we propose a novel dynamic feature selection method for data streams called Adaptive Boosting for Feature Selection (ABFS). ABFS chains decision stumps and drift detectors, and as a result, identifies which features are relevant to the learning task as the stream progresses with reasonable success. In addition to our proposed algorithm, we bring feature selection-specific metrics from batch learning to streaming scenarios. Next, we evaluate ABFS according to these metrics in both synthetic and real-world scenarios. As a result, ABFS improves the classification rates of different types of learners and eventually enhances computational resources usage.  相似文献   

3.
This work aims to connect two rarely combined research directions, i.e., non-stationary data stream classification and data analysis with skewed class distributions. We propose a novel framework employing stratified bagging for training base classifiers to integrate data preprocessing and dynamic ensemble selection methods for imbalanced data stream classification. The proposed approach has been evaluated based on computer experiments carried out on 135 artificially generated data streams with various imbalance ratios, label noise levels, and types of concept drift as well as on two selected real streams. Four preprocessing techniques and two dynamic selection methods, used on both bagging classifiers and base estimators levels, were considered. Experimentation results showed that, for highly imbalanced data streams, dynamic ensemble selection coupled with data preprocessing could outperform online and chunk-based state-of-art methods.  相似文献   

4.
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications.  相似文献   

5.
In recent times, data are generated as a form of continuous data streams in many applications. Since handling data streams is necessary and discovering knowledge behind data streams can often yield substantial benefits, mining over data streams has become one of the most important issues. Many approaches for mining frequent itemsets over data streams have been proposed. These approaches often consist of two procedures including continuously maintaining synopses for data streams and finding frequent itemsets from the synopses. However, most of the approaches assume that the synopses of data streams can be saved in memory and ignore the fact that the information of the non-frequent itemsets kept in the synopses may cause memory utilization to be significantly degraded. In this paper, we consider compressing the information of all the itemsets into a structure with a fixed size using a hash-based technique. This hash-based approach skillfully summarizes the information of the whole data stream by using a hash table, provides a novel technique to estimate the support counts of the non-frequent itemsets, and keeps only the frequent itemsets for speeding up the mining process. Therefore, the goal of optimizing memory space utilization can be achieved. The correctness guarantee, error analysis, and parameter setting of this approach are presented and a series of experiments is performed to show the effectiveness and the efficiency of this approach.  相似文献   

6.
Mining frequent itemsets from transactional data streams is challenging due to the nature of the exponential explosion of itemsets and the limit memory space required for mining frequent itemsets. Given a domain of I unique items, the possible number of itemsets can be up to 2I − 1. When the length of data streams approaches to a very large number N, the possibility of an itemset to be frequent becomes larger and difficult to track with limited memory. The existing studies on finding frequent items from high speed data streams are false-positive oriented. That is, they control memory consumption in the counting processes by an error parameter ?, and allow items with support below the specified minimum support s but above s − ? counted as frequent ones. However, such false-positive oriented approaches cannot be effectively applied to frequent itemsets mining for two reasons. First, false-positive items found increase the number of false-positive frequent itemsets exponentially. Second, minimization of the number of false-positive items found, by using a small ?, will make memory consumption large. Therefore, such approaches may make the problem computationally intractable with bounded memory consumption. In this paper, we developed algorithms that can effectively mine frequent item(set)s from high speed transactional data streams with a bound of memory consumption. Our algorithms are based on Chernoff bound in which we use a running error parameter to prune item(set)s and use a reliability parameter to control memory. While our algorithms are false-negative oriented, that is, certain frequent itemsets may not appear in the results, the number of false-negative itemsets can be controlled by a predefined parameter so that desired recall rate of frequent itemsets can be guaranteed. Our extensive experimental studies show that the proposed algorithms have high accuracy, require less memory, and consume less CPU time. They significantly outperform the existing false-positive algorithms.  相似文献   

7.
The field of fault detection and diagnosis has been the subject of considerable interest in industry. Fault detection may increase the availability of products, thereby improving their quality. Fault detection and diagnosis methods can be classified in three categories: data-driven, analytically based, and knowledge-based methods.  相似文献   

8.
宋擒豹  杜磊 《计算机应用》2012,32(2):299-303
数据流是一种动态数据,它在某种因素的驱动下可能会随时间发生变化,而这种变化往往隐含着现实世界的某种事件。如何及时、准确地发现数据流中的变化已成为数据流挖掘的一个研究热点,并且在实际中有非常广泛的应用。描述了数据流变化及变化检测的核心任务,归纳了变化检测的通用框架,分析评价了目前已知的数据流变化检测方法及其技术特点,最后展望了数据流变化检测技术的发展方向。  相似文献   

9.
A data stream is a massive and unbounded sequence of data elements that are continuously generated at a fast speed. Compared with traditional approaches, data mining in data streams is more challenging since several extra requirements need to be satisfied. In this paper, we propose a mining algorithm for finding frequent itemsets over the transactional data stream. Unlike most of existing algorithms, our method works based on the theory of Approximate Inclusion–Exclusion. Without incrementally maintaining the overall synopsis of the stream, we can approximate the itemsets’ counts according to certain kept information and the counts bounding technique. Some additional techniques are designed and integrated into the algorithm for performance improvement. Besides, the performance of the proposed algorithm is tested and analyzed through a series of experiments.  相似文献   

10.
Learning from continuous streams of data has been receiving an increasingly attention in the last years. Among the many challenges related to mining data streams, change detection is one topic frequently addressed. Being able to determine whether or not data characteristics are changing along time is a major concern for data stream algorithms, be it on the supervised or unsupervised scenario. The unsupervised scenario is particularly relevant due to many practical applications do not provide target labeling information. In this scenario, most of the strategies induce consecutive models over time and compare them in order to detect data changes. In this situation, model changes are assumed to be a consequence of data modifications. However, there is no guarantee this assumption is true, since those algorithms do not rely on any theoretical background to ensure that model divergences truly indicate data changes. The need for such theoretical framework has motivated this paper to propose a new stability concept to establish bounds on the learning abilities of unsupervised algorithms designed to detect changes on data streams. This stability concept, based on the surrogate data strategy from time series analysis, provides learning guarantees for online unsupervised algorithms even in case of time dependency among observations. Furthermore, we propose a new change detection algorithm that meets the requirements of this stability concept. Experimental results on different synthetical scenarios illustrate how the stability concept proposed in this paper is applied to detect changes in unsupervised data streams.  相似文献   

11.
Smartphones centralize a great deal of users’ private information and are thus a primary target for cyber-attack. The main goal of the attacker is to try to access and exfiltrate the private information stored in the smartphone without detection. In situations where explicit information is lacking, these attackers can still be detected in an automated way by analyzing data streams (continuously sampled information such as an application’s CPU consumption, accelerometer readings, etc.). When clustered, anomaly detection techniques may be applied to the data stream in order to detect attacks in progress. In this paper we utilize an algorithm called pcStream that is well suited for detecting clusters in real world data streams and propose extensions to the pcStream algorithm designed to detect point, contextual, and collective anomalies. We provide a comprehensive evaluation that addresses mobile security issues on a unique dataset collected from 30 volunteers over eight months. Our evaluations show that the pcStream extensions can be used to effectively detect data leakage (point anomalies) and malicious activities (contextual anomalies) associated with malicious applications. Moreover, the algorithm can be used to detect when a device is being used by an unauthorized user (collective anomaly) within approximately 30 s with 1 false positive every two days.  相似文献   

12.
提出了一种称为ICEA(incremental classification ensemble algorithm)的数据流挖掘算法.它利用集成分类器综合技术,实现了数据流中概念漂移的增量式检测和挖掘.实验结果表明,ICEA在处理数据流的快速概念漂移上表现出很高的精确度和较好的时间效率.  相似文献   

13.
Most data-mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data-mining model generated from past examples to become less accurate and relevant for classifying the current data. Most online learning algorithms deal with concept drift by generating a new model every time a concept drift is detected. On one hand, this solution ensures accurate and relevant models at all times, thus implying an increase in the classification accuracy. On the other hand, this approach suffers from a major drawback, which is the high computational cost of generating new models. The problem is getting worse when a concept drift is detected more frequently and, hence, a compromise in terms of computational effort and accuracy is needed. This work describes a series of incremental algorithms that are shown empirically to produce more accurate classification models than the batch algorithms in the presence of a concept drift while being computationally cheaper than existing incremental methods. The proposed incremental algorithms are based on an advanced decision-tree learning methodology called “Info-Fuzzy Network” (IFN), which is capable to induce compact and accurate classification models. The algorithms are evaluated on real-world streams of traffic and intrusion-detection data.  相似文献   

14.
The rapid evolution of technology has led to the generation of high dimensional data streams in a wide range of fields, such as genomics, signal processing, and finance. The combination of the streaming scenario and high dimensionality is particularly challenging especially for the outlier detection task. This is due to the special characteristics of the data stream such as the concept drift, the limited time and space requirements, in addition to the impact of the well-known curse of dimensionality in high dimensional space. To the best of our knowledge, few studies have addressed these challenges simultaneously, and therefore detecting anomalies in this context requires a great deal of attention. The main objective of this work is to study the main approaches existing in the literature, to identify a set of comparison criteria, such as the computational cost and the interpretation of outliers, which will help us to reveal the different challenges and additional research directions associated with this problem. At the end of this study, we will draw up a summary report which summarizes the main limits identified and we will detail the different directions of research related to this issue in order to promote research for this community.  相似文献   

15.
In recent years, classification learning for data streams has become an important and active research topic. A major challenge posed by data streams is that their underlying concepts can change over time, which requires current classifiers to be revised accordingly and timely. To detect concept change, a common methodology is to observe the online classification accuracy. If accuracy drops below some threshold value, a concept change is deemed to have taken place. An implicit assumption behind this methodology is that any drop in classification accuracy can be interpreted as a symptom of concept change. Unfortunately however, this assumption is often violated in the real world where data streams carry noise that can also introduce a significant reduction in classification accuracy. To compound this problem, traditional noise cleansing methods are incompetent for data streams. Those methods normally need to scan data multiple times whereas learning for data streams can only afford one-pass scan because of data’s high speed and huge volume. Another open problem in data stream classification is how to deal with missing values. When new instances containing missing values arrive, how a learning model classifies them and how the learning model updates itself according to them is an issue whose solution is far from being explored. To solve these problems, this paper proposes a novel classification algorithm, flexible decision tree (FlexDT), which extends fuzzy logic to data stream classification. The advantages are three-fold. First, FlexDT offers a flexible structure to effectively and efficiently handle concept change. Second, FlexDT is robust to noise. Hence it can prevent noise from interfering with classification accuracy, and accuracy drop can be safely attributed to concept change. Third, it deals with missing values in an elegant way. Extensive evaluations are conducted to compare FlexDT with representative existing data stream classification algorithms using a large suite of data streams and various statistical tests. Experimental results suggest that FlexDT offers a significant benefit to data stream classification in real-world scenarios where concept change, noise and missing values coexist.  相似文献   

16.
由于现有各种机器学习算法本质上都基于一个静态学习环境,而以尽量保证学习系统泛化能力为目标的寻优过程,概念漂移数据流分类给机器学习带来了巨大挑战.从数据流与概念漂移、概念漂移数据流分类研究的发展与趋势、概念漂移数据流分类的主要研究领域、概念漂移数据流分类研究的新动态4个方面展开了文献综述,并分析了当前概念漂移数据流分类算法存在的问题.  相似文献   

17.
ILP-based concept discovery in multi-relational data mining   总被引:1,自引:0,他引:1  
Multi-relational data mining has become popular due to the limitations of propositional problem definition in structured domains and the tendency of storing data in relational databases. Several relational knowledge discovery systems have been developed employing various search strategies, heuristics, language pattern limitations and hypothesis evaluation criteria, in order to cope with intractably large search space and to be able to generate high-quality patterns. In this work, an ILP-based concept discovery method, namely Confidence-based Concept Discovery (C2D), is described in which strong declarative biases and user-defined specifications are relaxed. Moreover, this new method directly works on relational databases. In addition to this, a new confidence-based pruning is used in this technique. We also describe how to define and use aggregate predicates as background knowledge in the proposed method. In order to use aggregate predicates, we show how to handle numerical attributes by using comparison operators on them. Finally, we analyze the effect of incorporating unrelated facts for generating transitive rules on the proposed method. A set of experiments are conducted on real-world problems to test the performance of the proposed method.  相似文献   

18.
Online mining of frequent sets in data streams with error guarantee   总被引:7,自引:5,他引:2  
For most data stream applications, the volume of data is too huge to be stored in permanent devices or to be thoroughly scanned more than once. It is hence recognized that approximate answers are usually sufficient, where a good approximation obtained in a timely manner is often better than the exact answer that is delayed beyond the window of opportunity. Unfortunately, this is not the case for mining frequent patterns over data streams where algorithms capable of online processing data streams do not conform strictly to a precise error guarantee. Since the quality of approximate answers is as important as their timely delivery, it is necessary to design algorithms to meet both criteria at the same time. In this paper, we propose an algorithm that allows online processing of streaming data and yet guaranteeing the support error of frequent patterns strictly within a user-specified threshold. Our theoretical and experimental studies show that our algorithm is an effective and reliable method for finding frequent sets in data stream environments when both constraints need to be satisfied.  相似文献   

19.
数据流突发检测研究与进展   总被引:2,自引:0,他引:2       下载免费PDF全文
数据流是不断变化且难以预测的。因此,在数据流中进行突发检测,是数据流内在的,固有的问题之一。所谓突发,指的是特定时间段内的数据量显著异常于其它时间段。如何实时地相对精确地检测出数据流中的突发并良好地呈现给用户,国内外已展开相关研究,并成为数据流挖掘领域的热点问题之一。论文综述国内外数据流突发检测的研究现状,归纳与分析现有研究工作的适用场景,并给出研究的焦点及热点,最后展望了该领域的前景。  相似文献   

20.
Krleža  Dalibor  Vrdoljak  Boris  Brčić  Mario 《Machine Learning》2021,110(1):139-184
Machine Learning - Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号