首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Yi Wang  Tao Li 《Applied Intelligence》2018,48(10):3248-3262
Semi-supervised learning, which uses a large amount of unlabeled data to improve the performance of a classifier when only a limited amount of labeled data is available, has become a hot topic in machine learning research recently. In this paper, we propose a semi-supervised ensemble of classifiers approach, for learning in time-varying data streams. This algorithm maintains all the desirable properties of the semi-supervised Co-trained random FOREST algorithm (Co-Forest) and extends it into evolving data streams. It assigns a weight to each example according to Poisson(1) to simulate the bootstrap sample method in data streams, which is used to keep the diversity of Random Forest. By utilizing incremental learning technology, it avoids unnecessary repetition training and improves the accuracy of base models. In addition, the ADaptive WINdowing (ADWIN2) is introduced to deal with concept drift, which makes it adapt to the varying environment. Empirical evaluation on both synthetic data and UCI data reveals that our proposed method outperforms state-of-the-art semi-supervised and supervised methods in time-varying data streams, and also achieves relatively high performance in stationary streams.  相似文献   

2.
基于滑动窗口的异常检测是数据流挖掘研究的一个重要课题,在许多应用中数据流通常在一个分布网络上传输,解决这类问题时常采用分布计算技术,以便获得实时高质量的计算结果。对分布演化数据流上连续异常检测问题,进行形式化地阐述,提出了两个基于核密度估计的异常检测定义和算法,并通过大量真实数据集的实验,表明该算法具有良好的高效性和可扩展性,完全适应数据流应用的需求。  相似文献   

3.
A data stream is a potentially uninterrupted flow of data. Mining this flow makes it necessary to cope with uncertainty, as only a part of the stream can be stored. In this paper, we evaluate a statistical technique which biases the estimation of the support of patterns, so as to maximize either the precision or the recall, as chosen by the user, and limit the degradation of the other criterion. Theoretical results show that the technique is not far from the optimum, from the statistical standpoint. Experiments performed tend to demonstrate its potential, as it remains robust even under significant distribution drifts.  相似文献   

4.
Learning model trees from evolving data streams   总被引:2,自引:0,他引:2  
The problem of real-time extraction of meaningful patterns from time-changing data streams is of increasing importance for the machine learning and data mining communities. Regression in time-changing data streams is a relatively unexplored topic, despite the apparent applications. This paper proposes an efficient and incremental stream mining algorithm which is able to learn regression and model trees from possibly unbounded, high-speed and time-changing data streams. The algorithm is evaluated extensively in a variety of settings involving artificial and real data. To the best of our knowledge there is no other general purpose algorithm for incremental learning regression/model trees able to perform explicit change detection and informed adaptation. The algorithm performs online and in real-time, observes each example only once at the speed of arrival, and maintains at any-time a ready-to-use model tree. The tree leaves contain linear models induced online from the examples assigned to them, a process with low complexity. The algorithm has mechanisms for drift detection and model adaptation, which enable it to maintain accurate and updated regression models at any time. The drift detection mechanism exploits the structure of the tree in the process of local change detection. As a response to local drift, the algorithm is able to update the tree structure only locally. This approach improves the any-time performance and greatly reduces the costs of adaptation.  相似文献   

5.
Tracking clusters in evolving data streams over sliding windows   总被引:6,自引:4,他引:2  
Mining data streams poses great challenges due to the limited memory availability and real-time query response requirement. Clustering an evolving data stream is especially interesting because it captures not only the changing distribution of clusters but also the evolving behaviors of individual clusters. In this paper, we present a novel method for tracking the evolution of clusters over sliding windows. In our SWClustering algorithm, we combine the exponential histogram with the temporal cluster features, propose a novel data structure, the Exponential Histogram of Cluster Features (EHCF). The exponential histogram is used to handle the in-cluster evolution, and the temporal cluster features represent the change of the cluster distribution. Our approach has several advantages over existing methods: (1) the quality of the clusters is improved because the EHCF captures the distribution of recent records precisely; (2) compared with previous methods, the mechanism employed to adaptively maintain the in-cluster synopsis can track the cluster evolution better, while consuming much less memory; (3) the EHCF provides a flexible framework for analyzing the cluster evolution and tracking a specific cluster efficiently without interfering with other clusters, thus reducing the consumption of computing resources for data stream clustering. Both the theoretical analysis and extensive experiments show the effectiveness and efficiency of the proposed method. Aoying Zhou is currently a Professor in Computer Science at Fudan University, Shanghai, P.R. China. He won his Bachelor and Master degrees in Computer Science from Sichuan University in Chengdu, Sichuan, P.R. China in 1985 and 1988, respectively, and Ph.D. degree from Fudan University in 1993. He served as the member or chair of program committee for many international conferences such as WWW, SIGMOD, VLDB, EDBT, ICDCS, ER, DASFAA, PAKDD, WAIM, and etc. His papers have been published in ACM SIGMOD, VLDB, ICDE, and several other international journals. His research interests include Data mining and knowledge discovery, XML data management, Web mining and searching, data stream analysis and processing, peer-to-peer computing. Feng Cao is currently an R&D engineer in IBM China Research Laboratories. He received a B.E. degree from Xi'an Jiao Tong University, Xi'an, P.R. China, in 2000 and an M.E. degree from Huazhong University of Science and Technology, Wuhan, P.R. China, in 2003. From October 2004 to March 2005, he worked in Fudan-NUS Competency Center for Peer-to-Peer Computing, Singapore. In 2006, he received his Ph.D. degree from Fudan University, Shanghai, P.R. China. His current research interests include data mining and data stream. Weining Qian is currently an Assistant Professor in computer science at Fudan University, Shanghai, P.R. China. He received his M.S. and Ph.D. degree in computer science from Fudan University in 2001 and 2004, respectively. He is supported by Shanghai Rising-Star Program under Grant No. 04QMX1404 and National Natural Science Foundation of China (NSFC) under Grant No. 60673134. He served as the program committee member of several international conferences, including DASFAA 2006, 2007 and 2008, APWeb/WAIM 2007, INFOSCALE 2007, and ECDM 2007. His papers have been published in ICDE, SIAM DM, and CIKM. His research interests include data stream query processing and mining, and large-scale distributed computing for database applications. Cheqing Jin is currently an Assistant Professor in Computer Science at East China University of Science and Technology. He received his Bachelor and Master degrees in Computer Science from Zhejiang University in Hangzhou, P.R. China in 1999 and 2002, respectively, and the Ph.D. degree from Fudan University, Shanghai, P.R. China. He worked as a Research Assistant at E-business Technology Institute, the Hong Kong University from December 2003 to May 2004. His current research interests include data mining and data stream.  相似文献   

6.
Many challenging real world problems involve multi-label data streams. Efficient methods exist for multi-label classification in non-streaming scenarios. However, learning in evolving streaming scenarios is more challenging, as classifiers must be able to deal with huge numbers of examples and to adapt to change using limited time and memory while being ready to predict at any point. This paper proposes a new experimental framework for learning and evaluating on multi-label data streams, and uses it to study the performance of various methods. From this study, we develop a multi-label Hoeffding tree with multi-label classifiers at the leaves. We show empirically that this method is well suited to this challenging task. Using our new framework, which allows us to generate realistic multi-label data streams with concept drift (as well as real data), we compare with a selection of baseline methods, as well as new learning methods from the literature, and show that our Hoeffding tree method achieves fast and more accurate performance.  相似文献   

7.
A framework for on-demand classification of evolving data streams   总被引:4,自引:0,他引:4  
Current models of the classification problem do not effectively handle bursts of particular classes coming in at different times. In fact, the current model of the classification problem simply concentrates on methods for one-pass classification modeling of very large data sets. Our model for data stream classification views the data stream classification problem from the point of view of a dynamic approach in which simultaneous training and test streams are used for dynamic classification of data sets. This model reflects real-life situations effectively, since it is desirable to classify test streams in real time over an evolving training and test stream. The aim here is to create a classification system in which the training model can adapt quickly to the changes of the underlying data stream. In order to achieve this goal, we propose an on-demand classification process which can dynamically select the appropriate window of past training data to build the classifier. The empirical results indicate that the system maintains an high classification accuracy in an evolving data stream, while providing an efficient solution to the classification task.  相似文献   

8.
In this research we present a novel approach to the concept change detection problem. Change detection is a fundamental issue with data stream mining as classification models generated need to be updated when significant changes in the underlying data distribution occur. A number of change detection approaches have been proposed but they all suffer from limitations with respect to one or more key performance factors such as high computational complexity, poor sensitivity to gradual change, or the opposite problem of high false positive rate. Our approach uses reservoir sampling to build a sequential change detection model that offers statistically sound guarantees on false positive and false negative rates but has much smaller computational complexity than the ADWIN concept drift detector. Extensive experimentation on a wide variety of datasets reveals that the scheme also has a smaller false detection rate while maintaining a competitive true detection rate to ADWIN.  相似文献   

9.
Krleža  Dalibor  Vrdoljak  Boris  Brčić  Mario 《Machine Learning》2021,110(1):139-184
Machine Learning - Anomaly detection is a hard data analysis process that requires constant creation and improvement of data analysis algorithms. Using traditional clustering algorithms to analyse...  相似文献   

10.
Knowledge and Information Systems - Ensemble methods represent an effective way to solve supervised learning problems. Such methods are prevalent for learning from evolving data streams. One of the...  相似文献   

11.
In the era of big data, considerable research focus is being put on designing efficient algorithms capable of learning and extracting high-level knowledge from ubiquitous data streams in an online fashion. While, most existing algorithms assume that data samples are drawn from a stationary distribution, several complex environments deal with data streams that are subject to change over time. Taking this aspect into consideration is an important step towards building truly aware and intelligent systems. In this paper, we propose GNG-A, an adaptive method for incremental unsupervised learning from evolving data streams experiencing various types of change. The proposed method maintains a continuously updated network (graph) of neurons by extending the Growing Neural Gas algorithm with three complementary mechanisms, allowing it to closely track both gradual and sudden changes in the data distribution. First, an adaptation mechanism handles local changes where the distribution is only non-stationary in some regions of the feature space. Second, an adaptive forgetting mechanism identifies and removes neurons that become irrelevant due to the evolving nature of the stream. Finally, a probabilistic evolution mechanism creates new neurons when there is a need to represent data in new regions of the feature space. The proposed method is demonstrated for anomaly and novelty detection in non-stationary environments. Results show that the method handles different data distributions and efficiently reacts to various types of change.  相似文献   

12.
Artificial Intelligence Review - Data stream mining has become an important research area over the past decade due to the increasing amount of data available today. Sources from various domains...  相似文献   

13.
Data Mining and Knowledge Discovery - The world is constantly changing, and so are the massive amount of data produced. However, only a few studies deal with online class imbalance learning that...  相似文献   

14.
Existing density-based data stream clustering algorithms use a two-phase scheme approach consisting of an online phase, in which raw data is processed to gather summary statistics, and an offline phase that generates the clusters by using the summary data. In this article we propose a data stream clustering method based on a multi-agent system that uses a decentralized bottom-up self-organizing strategy to group similar data points. Data points are associated with agents and deployed onto a 2D space, to work simultaneously by applying a heuristic strategy based on a bio-inspired model, known as flocking model. Agents move onto the space for a fixed time and, when they encounter other agents into a predefined visibility range, they can decide to form a flock if they are similar. Flocks can join to form swarms of similar groups. This strategy allows to merge the two phases of density-based approaches and thus to avoid the computing demanding offline cluster computation, since a swarm represents a cluster. Experimental results show that the bio-inspired approach can obtain very good results on real and synthetic data sets.  相似文献   

15.
Neural Computing and Applications - Ensemble learning is one of the most frequently used techniques for handling concept drift, which is the greatest challenge for learning high-performance models...  相似文献   

16.
Recently data stream has been extensively explored due to its emergence in a great deal of applications such as sensor networks, web click streams and network flows. One of the most important challenges in data streams is concept change where data underlying distributions change from time to time. A vast majority of researches in the context of data stream mining are devoted to labeled data, whereas, in real word human practice label of data are rarely available to the learning algorithms. Moreover, most of the methods that detect changes in unlabeled data stream merely deal with numerical data sets, and also, they are facing considerable difficulty when dimension of data tends to increase. In this paper, we present a Precise Statistical approach for Concept Change Detection in unlabeled data streams, which, abbreviated as PSCCD, detects changes using an exchangeable test. This hypothesis test is driven from a martingale which is based on Doob’s Maximal Inequality. The advantages of our approach are three fold. First, it does not require a sliding window on the data stream whose size is a well-known challenging issue; second, it works well in multi-dimensional data stream, and last but not the least, it is applicable to different types of data including categorical, numerical and mixed-attribute data streams. To explore the advantages of our approach, quite a lot of experiments with different settings and specifications are conducted. The obtained results are very promising.  相似文献   

17.
On clustering massive text and categorical data streams   总被引:4,自引:4,他引:0  
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.  相似文献   

18.
19.
The last decade has seen a surge of interest in adaptive learning algorithms for data stream classification, with applications ranging from predicting ozone level peaks, learning stock market indicators, to detecting computer security violations. In addition, a number of methods have been developed to detect concept drifts in these streams. Consider a scenario where we have a number of classifiers with diverse learning styles and different drift detectors. Intuitively, the current ‘best’ (classifier, detector) pair is application dependent and may change as a result of the stream evolution. Our research builds on this observation. We introduce the Tornado framework that implements a reservoir of diverse classifiers, together with a variety of drift detection algorithms. In our framework, all (classifier, detector) pairs proceed, in parallel, to construct models against the evolving data streams. At any point in time, we select the pair which currently yields the best performance. To this end, we introduce the CAR measure, which is employed to balance classification, adaptation and resource utilization requirements. We further incorporate two novel stacking-based drift detection methods, namely the FHDDMS and \(\hbox {FHDDMS}_{\mathrm{add}}\) approaches. The experimental evaluation confirms that the current ‘best’ (classifier, detector) pair is not only heavily dependent on the characteristics of the stream, but also that this selection evolves as the stream flows. Further, our FHDDMS variants detect concept drifts accurately in a timely fashion while outperforming the state-of-the-art.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号