首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Data stream classification is a hot topic in data mining research. The great challenge is that the class priors may evolve along the data sequence. Algorithms have been proposed to estimate the dynamic class priors and adjust the classifier accordingly. However, the existing algorithms do not perform well on prior estimation due to the lack of samples from the target distribution. Sample size has great effects in parameter estimation and small-sample effects greatly contaminate the estimation performance. In this paper, we propose a novel parameter estimation method called transfer estimation. Transfer estimation makes use of samples not only from the target distribution but also from similar distributions. We apply this new estimation method to the existing algorithms and obtain an improved algorithm. Experiments on both synthetic and real data sets show that the improved algorithm outperforms the existing algorithms on both class prior estimation and classification.  相似文献   

2.
Accurate estimation of class membership probability is needed for many applications in data mining and decision-making, to which multiclass classification is often applied. Since existing methods for estimation of class membership probability are designed for binary classification, in which only a single score outputted from a classifier can be used, an approach for multiclass classification requires both a decomposition of a multiclass classifier into binary classifiers and a combination of estimates obtained from each binary classifier to a target estimate. We propose a simple and general method for directly estimating class membership probability for any class in multiclass classification without decomposition and combination, using multiple scores not only for a predicted class but also for other proper classes. To make it possible to use multiple scores, we propose to modify or extend representative existing methods. As a non-parametric method, which refers to the idea of a binning method as proposed by Zadrozny et al., we create an “accuracy table” by a different method. Moreover we smooth accuracies on the table with methods such as the moving average to yield reliable probabilities (accuracies). As a parametric method, we extend Platt’s method to apply a multiple logistic regression. On two different datasets (open-ended data from Japanese social surveys and the 20 Newsgroups) both with Support Vector Machines and naive Bayes classifiers, we empirically show that the use of multiple scores is effective in the estimation of class membership probabilities in multiclass classification in terms of cross entropy, the reliability diagram, the ROC curve and AUC (area under the ROC curve), and that the proposed smoothing method for the accuracy table works quite well. Finally, we show empirically that in terms of MSE (mean squared error), our best proposed method is superior to an expansion for multiclass classification of a PAV method proposed by Zadrozny et al., in both the 20 Newsgroups dataset and the Pendigits dataset, but is slightly worse than the state-of-the-art method, which is an expansion for multiclass classification of a combination of boosting and a PAV method, on the Pendigits dataset.
Manabu OkumuraEmail:
  相似文献   

3.
The application of the CD3 decision tree induction algorithm to telecommunications customer call data to obtain classification rules is described. CD3 is robust against drift in the underlying rules over time (concept drift): it both detects drift and protects the induction process from its effects. Specifically, the task is to data mine customer details and call records to determine whether the profile of customers registering for a friends and family service is changing over time and to maintain a rule set profiling such customers. CD3 and the rationale behind it are described and experimental results on customer data are presented.  相似文献   

4.
现有概念漂移处理算法在检测到概念漂移发生后,通常需要在新到概念上重新训练分类器,同时“遗忘”以往训练的分类器。在概念漂移发生初期,由于能够获取到的属于新到概念的样本较少,导致新建的分类器在短时间内无法得到充分训练,分类性能通常较差。进一步,现有的基于在线迁移学习的数据流分类算法仅能使用单个分类器的知识辅助新到概念进行学习,在历史概念与新到概念相似性较差时,分类模型的分类准确率不理想。针对以上问题,文中提出一种能够利用多个历史分类器知识的数据流分类算法——CMOL。CMOL算法采取分类器权重动态调节机制,根据分类器的权重对分类器池进行更新,使得分类器池能够尽可能地包含更多的概念。实验表明,相较于其他相关算法,CMOL算法能够在概念漂移发生时更快地适应新到概念,显示出更高的分类准确率。  相似文献   

5.
Unlabeled training examples are readily available in many applications, but labeled examples are fairly expensive to obtain. For instance, in our previous works on classification of peer-to-peer (P2P) Internet traffics, we observed that only about 25% of examples can be labeled as “P2P”or “NonP2P” using a port-based heuristic rule. We also expect that even fewer examples can be labeled in the future as more and more P2P applications use dynamic ports. This fact motivates us to investigate the techniques which enhance the accuracy of P2P traffic classification by exploiting the unlabeled examples. In addition, the Internet data flows dynamically in large volumes (streaming data). In P2P applications, new communities of peers often join and old communities of peers often leave, requiring the classifiers to be capable of updating the model incrementally, and dealing with concept drift. Based on these requirements, this paper proposes an incremental Tri-Training (iTT) algorithm. We tested our approach on a real data stream with 7.2 Mega labeled examples and 20.4 Mega unlabeled examples. The results show that iTT algorithm can enhance accuracy of P2P traffic classification by exploiting unlabeled examples. In addition, it can effectively deal with dynamic nature of streaming data to detect the changes in communities of peers. We extracted attributes only from the IP layer, eliminating the privacy concern associated with the techniques that use deep packet inspection.
Jing LiuEmail:

Bijan Raahemi   is an assistant professor at the Telfer School of Management, University of Ottawa, Canada, with cross-appointment with the School of Information Technology and Engineering. He received his Ph.D. in Electrical and Computer Engineering from the University of Waterloo, Canada, in 1997. Prior to joining the University of Ottawa, Dr. Raahemi held several research positions in Telecommunications industry, including Nortel Networks and Alcatel-Lucent, focusing on Computer Networks Architectures and Services, Dynamics of Internet Traffic, Systems Modeling, and Performance Analysis of Data Networks. His current research interests include Knowledge Discovery and Data Mining, Information Systems, and Data Communications Networks. Dr. Raahemi’s work has appeared in several peer-reviewed journals and conference proceedings. He also holds 10 patents in Data Communications. He is a senior Member of the Institute of Electrical and Electronics Engineering (IEEE), and a member of the Association for Computing Machinery (ACM). Weicai Zhong   is a post-doctoral fellow at the Telfer School of Management, University of Ottawa, Canada. He received a B.S. degree in computer science and technology from Xidian University, Xi’an, China, in 2000 and a Ph.D. in pattern recognition and intelligent systems from Xidian University in 2004. Prior to joining the University of Ottawa, Dr. Zhong was a senior statistician in SPSS Inc. from Jan. 2005 to Dec. 2007. His current research interests include Internet Traffic Identification, Data Mining, and Evolutionary Computation. He is a member of the Institute of Electrical and Electronics Engineering (IEEE). Jing Liu   is an Associate Professor with Xidian University, China. She received a B.S. degree in computer science and technology from Xidian University, Xi’an, China, in 2000, and a Ph.D. in circuits and systems from Xidian University in 2004. Her research interests include Data Mining, Evolutionary Computation, and Multiagent Systems. She is a member of the Institute of Electrical and Electronics Engineering (IEEE).   相似文献   

6.
Companies, government agencies, and other organizations are making their data available to the world over the Internet. They often use large online relational tables for this purpose. Users query such tables with front-ends that typically use menus or form fillin interfaces, but these interfaces rarely give users information about the contents and distribution of the data. Such a situation leads users to waste time and network/server resources posing queries that have zero- or mega-hit results. Generalized query previews enable efficient browsing of large online data tables by supplying data distribution information to users. The data distribution information provides continuous feedback about the size of the result set as the query is being formed. Our paper presents a new user interface architecture and discusses three controlled experiments (with 12, 16, and 48 participants). Our prototype systems provide flexible user interfaces for research and testing of the ideas. The user studies show that for exploratory querying tasks, generalized query previews can speed user performance for certain user domains and can reduce network/server load.  相似文献   

7.
Traditional approaches for text data stream classification usually require the manual labeling of a number of documents, which is an expensive and time consuming process. In this paper, to overcome this limitation, we propose to classify text streams by keywords without labeled documents so as to reduce the burden of labeling manually. We build our base text classifiers with the help of keywords and unlabeled documents to classify text streams, and utilize classifier ensemble algorithms to cope with concept drifting in text data streams. Experimental results demonstrate that the proposed method can build good classifiers by keywords without manual labeling, and when the ensemble based algorithm is used, the concept drift in the streams can be well detected and adapted, which performs better than the single window algorithm.  相似文献   

8.
Most data-mining algorithms assume static behavior of the incoming data. In the real world, the situation is different and most continuously collected data streams are generated by dynamic processes, which may change over time, in some cases even drastically. The change in the underlying concept, also known as concept drift, causes the data-mining model generated from past examples to become less accurate and relevant for classifying the current data. Most online learning algorithms deal with concept drift by generating a new model every time a concept drift is detected. On one hand, this solution ensures accurate and relevant models at all times, thus implying an increase in the classification accuracy. On the other hand, this approach suffers from a major drawback, which is the high computational cost of generating new models. The problem is getting worse when a concept drift is detected more frequently and, hence, a compromise in terms of computational effort and accuracy is needed. This work describes a series of incremental algorithms that are shown empirically to produce more accurate classification models than the batch algorithms in the presence of a concept drift while being computationally cheaper than existing incremental methods. The proposed incremental algorithms are based on an advanced decision-tree learning methodology called “Info-Fuzzy Network” (IFN), which is capable to induce compact and accurate classification models. The algorithms are evaluated on real-world streams of traffic and intrusion-detection data.  相似文献   

9.
为了克服数据流概念漂移现象对分类模型的影响,提高数据流分类准确率,提出了一种基于概念漂移检测算法的数据流分类模型.针对不同概念漂移类型使用不同的方法进行检测,该模型通过对概念漂移进行监控,从而有效控制分类模型的更新频率,做到有的放矢地更新分类器模型,提高分类模型的分类性能.通过使用两种不同的数据集进行实验,并与传统分类模型进行比较,验证了该模型的有效性和正确性.  相似文献   

10.
Recent machine learning challenges require the capability of learning in non-stationary environments. These challenges imply the development of new algorithms that are able to deal with changes in the underlying problem to be learnt. These changes can be gradual or trend changes, abrupt changes and recurring contexts. As the dynamics of the changes can be very different, existing machine learning algorithms exhibit difficulties to cope with them. Several methods using, for instance, ensembles or variable length windowing have been proposed to approach this task.In this work we propose a new method, for single-layer neural networks, that is based on the introduction of a forgetting function in an incremental online learning algorithm. This forgetting function gives a monotonically increasing importance to new data. Due to the combination of incremental learning and increasing importance assignment the network forgets rapidly in the presence of changes while maintaining a stable behavior when the context is stationary.The performance of the method has been tested over several regression and classification problems and its results compared with those of previous works. The proposed algorithm has demonstrated high adaptation to changes while maintaining a low consumption of computational resources.  相似文献   

11.
Classifying streaming data requires the development of methods which are computationally efficient and able to cope with changes in the underlying distribution of the stream, a phenomenon known in the literature as concept drift. We propose a new method for detecting concept drift which uses an exponentially weighted moving average (EWMA) chart to monitor the misclassification rate of an streaming classifier. Our approach is modular and can hence be run in parallel with any underlying classifier to provide an additional layer of concept drift detection. Moreover our method is computationally efficient with overhead O(1) and works in a fully online manner with no need to store data points in memory. Unlike many existing approaches to concept drift detection, our method allows the rate of false positive detections to be controlled and kept constant over time.  相似文献   

12.
Analysing online handwritten notes is a challenging problem because of the content heterogeneity and the lack of prior knowledge, as users are free to compose documents that mix text, drawings, tables or diagrams. The task of separating text from non-text strokes is of crucial importance towards automated interpretation and indexing of these documents, but solving this problem requires a careful modelling of contextual information, such as the spatial and temporal relationships between strokes. In this work, we present a comprehensive study of contextual information modelling for text/non-text stroke classification in online handwritten documents. Formulating the problem with a conditional random field permits to integrate and combine multiple sources of context, such as several types of spatial and temporal interactions. Experimental results on a publicly available database of freely hand-drawn documents demonstrate the superiority of our approach and the benefit of contextual information combination for solving text/non-text classification.  相似文献   

13.
This study addresses building an interactive system that effectively prompts customers to make their decision while shopping online. It is especially targeted at purchasing as concept articulation where customers initially have a vague concept of what they want and then gradually clarify it in the course of interaction, which has not been covered by traditional online shopping systems. This paper proposes information presentation methods to effectively facilitate customers in their concept articulation process, and the framework for interaction design to enable the methods. Specifically, this study builds a system called S-Conart that facilitates purchasing as concept articulation through support for customers conception with spatial-arrangement style information presentation and for their conviction with scene information presentation, and then makes a set of evaluation experiments with the system to verify that the approach used in building the system is effective in facilitating the purchasing as concept articulation.  相似文献   

14.
In this paper we introduce two pattern classifiers for non-sparse data (i.e. data with overlapping class distributions) which use the optimal interpolative neural network (OI-net), derived by one of the authors based on a generalized Fock (GF) space formulation. We present a statistical pattern classifier operating as a two-stage algorithm. The first stage consists of a pre-processing operation involving a k-N N editing of the original training set T. The operation results in a new training set, Te, which in the second stage is classified by an OI-net constructed by the recursive least squares algorithm. We also propose a new data specific classifier which has an additional third computational stage, in which samples of the original training set are added to the network piece by piece until satisfactory classification results are obtained. During the computation process the training set is iteratively updated until the number of mis-classified samples is minimized. The performance of these two classifiers has been evaluated in some illustrative examples.  相似文献   

15.
We present an algorithm for robustly analyzing point data arising from sampling a 2D surface embedded in 3D, even in the presence of noise and non-uniform sampling. The algorithm outputs, for each data point, a surface normal, a local surface approximation in the form of a one-ring, the local shape (flat, ridge, bowl, saddle, sharp edge, corner, boundary), the feature size, and a confidence value that can be used to determine areas where the sampling is poor or not surface-like.We show that the normal estimation out-performs traditional fitting approaches, especially when the data points are non-uniformly sampled and in areas of high curvature. We demonstrate surface reconstruction, parameterization, and smoothing using the one-ring neighborhood at each point as an approximation of the full mesh structure.  相似文献   

16.
In these years, we often deal with an enormous amount of data in a large variety of pattern recognition tasks. Such data require a huge amount of memory space and computation time for processing. One of the approaches to cope with these problems is using prototypes. We propose volume prototypes as an extension of traditional point prototypes. A volume prototype is defined as a geometric configuration that represents some data points inside. A volume prototype is akin to a data point in the usage rather than a component of a mixture model. We show a one-pass algorithm to have such prototypes for data stream, along with an application for classification. An oblivion mechanism is also incorporated to adapt concept drift.  相似文献   

17.
Search engines are rapidly emerging to be the “go-to” sites for consumers to learn more about a product, concept or a term of interest, irrespective of the initial channel in which the interest originated — text, radio, TV, multi-media channels, word of mouth, etc. In this paper we argue that data on the search terms used by consumers can provide valuable measures and indicators of consumer interest in a product, concept or a term. Such data can be particularly valuable to managers in gauging potential product interest in a new product launch context or consumption interest in the post-release context. Based on this premise, we develop a model of pre-launch search activity and link the pre-launch search behavior and product characteristics to early sales of the product, thus providing a useful forecasting tool. Applying the model in the context of motion pictures, we find that search term usage follows rather predictable patterns in the pre-launch and post-launch periods and the model provides significant power in forecasting release week sales as a function of pre-release search activity. With advertising data included in the model, we find that the pre-release search data offers additional explanatory and forecasting power, thus highlighting the ability of the search data to capture other factors, such as possibly word-of-mouth, in impacting early sales. We offer specific insights into how managers can use search volume data and the model to plan their new product release.  相似文献   

18.
Lin  Jie  Jingyan   《Pattern recognition》2008,41(8):2447-2460
To track multiple objects through occlusion, either depth information of the scene or prior models of the objects such as spatial models and smooth/predictable motion models are usually assumed before tracking. When these assumptions are unreasonable, the tracker may fail. To overcome this limitation, we propose a novel online sample based framework, inspired by the fact that the corresponding local parts of objects in sequential frames are always similar in the local color and texture features and spatial features relative to the centers of objects. Experimental results illustrate that the proposed approach works robustly under difficult and complex conditions.  相似文献   

19.
Online auction sites are a target for fraud due to their anonymity, number of potential targets and low likelihood of identification. Researchers have developed methods for identifying fraud. However, these methods must be individually tailored for each type of fraud, since each differs in the characteristics important for their identification. Using supervised learning methods, it is possible to produce classifiers for specific types of fraud by providing a dataset where instances with behaviours of interest are assigned to a separate class. However this requires multiple labelled datasets: one for each fraud type of interest. It is difficult to use real-world datasets for this purpose since they are difficult to label, often limited in size, and contain zero or multiple suspicious behaviours that may or may not be under investigation.The aims of this work are to: (1) demonstrate the approach of using supervised learning together with a validated synthetic data generator to create fraud detection models that are experimentally more accurate than existing methods and that is effective over real data, and (2) to evaluate a set of features for use in general fraud detection is shown to further improve the performance of the created detection models.The approach is as follows: the data generator is an agent-based simulation modelled on users in commercial online auction data. The simulation is extended using fraud agents which model a known type of online auction fraud called competitive shilling. These agents are added to the simulation to produce the synthetic datasets. Features extracted from this data are used as training data for supervised learning. Using this approach, we optimise an existing fraud detection algorithm, and produce classifiers capable of detecting shilling fraud.Experimental results with synthetic data show the new models have significant improvements in detection accuracy. Results with commercial data show the models identify users with suspicious behaviour.  相似文献   

20.
The study of a disease using genetic identification has become possible by using haplotype information. The expectation-maximization algorithms are the standard approach in haplotype analysis. These approaches maximize the likelihood function of a genotypic distribution assuming Hardy-Weinberg equilibrium. However, these methods are time-consuming when applied to the sequence of many loci. In this study, we used a genetic algorithm to obtain the haplotype frequencies from the frequencies of genotypes. This work was presented in part at the 13th International Symposium on Artificial Life and Robotics, Oita, Japan, January 31–February 2, 2008  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号