首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
对于建立动态贝叶斯网络(DBN)分类模型时,带有类标注样本数据集获得困难的问题,提出一种基于EM和分类损失的半监督主动DBN学习算法.半监督学习中的EM算法可以有效利用未标注样本数据来学习DBN分类模型,但是由于迭代过程中易于加入错误的样本分类信息而影响模型的准确性.基于分类损失的主动学习借鉴到EM学习中,可以自主选择有用的未标注样本来请求用户标注,当把这些样本加入训练集后能够最大程度减少模型对未标注样本分类的不确定性.实验表明,该算法能够显著提高DBN学习器的效率和性能,并快速收敛于预定的分类精度.  相似文献   

2.
Text Classification from Labeled and Unlabeled Documents using EM   总被引:51,自引:0,他引:51  
This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.  相似文献   

3.
Semi-supervised learning has attracted a significant amount of attention in pattern recognition and machine learning. Most previous studies have focused on designing special algorithms to effectively exploit the unlabeled data in conjunction with labeled data. Our goal is to improve the classification accuracy of any given supervised learning algorithm by using the available unlabeled examples. We call this as the Semi-supervised improvement problem, to distinguish the proposed approach from the existing approaches. We design a metasemi-supervised learning algorithm that wraps around the underlying supervised algorithm and improves its performance using unlabeled data. This problem is particularly important when we need to train a supervised learning algorithm with a limited number of labeled examples and a multitude of unlabeled examples. We present a boosting framework for semi-supervised learning, termed as SemiBoost. The key advantages of the proposed semi-supervised learning approach are: 1) performance improvement of any supervised learning algorithm with a multitude of unlabeled data, 2) efficient computation by the iterative boosting algorithm, and 3) exploiting both manifold and cluster assumption in training classification models. An empirical study on 16 different data sets and text categorization demonstrates that the proposed framework improves the performance of several commonly used supervised learning algorithms, given a large number of unlabeled examples. We also show that the performance of the proposed algorithm, SemiBoost, is comparable to the state-of-the-art semi-supervised learning algorithms.  相似文献   

4.
一种增量贝叶斯分类模型   总被引:40,自引:0,他引:40  
分类一直是机器学习,模型识别和数据挖掘研究的核心问题,从海量数据中学习分类知识,尤其是当获得大量的带有类别标注的样本代价较高时,增量学习是解决该问题的有效途径,该文将简单贝叶期方法应用于增量分类中,提出了一种增量贝叶斯学习模型,给出了增量贝叶斯推理过程,包括增量地修正分类器参数和增量地分类测试样本,实验结果表明,该算法是可行的和有效。  相似文献   

5.
This paper proposes a semi-supervised Bayesian ARTMAP (SSBA) which integrates the advantages of both Bayesian ARTMAP (BA) and Expectation Maximization (EM) algorithm. SSBA adopts the training framework of BA, which makes SSBA adaptively generate categories to represent the distribution of both labeled and unlabeled training samples without any user’s intervention. In addition, SSBA employs EM algorithm to adjust its parameters, which realizes the soft assignment of training samples to categories instead of the hard assignment such as winner takes all. Experimental results on benchmark and real world data sets indicate that the proposed SSBA achieves significantly improved performance compared with BA and EM-based semi-supervised learning method; SSBA is appropriate for semi-supervised classification tasks with large amount of unlabeled samples or with strict demands for classification accuracy.  相似文献   

6.
In real-world data mining applications, it is often the case that unlabeled instances are abundant, while available labeled instances are very limited. Thus, semi-supervised learning, which attempts to benefit from large amount of unlabeled data together with labeled data, has attracted much attention from researchers. In this paper, we propose a very fast and yet highly effective semi-supervised learning algorithm. We call our proposed algorithm Instance Weighted Naive Bayes (simply IWNB). IWNB firstly trains a naive Bayes using the labeled instances only. And the trained naive Bayes is used to estimate the class membership probabilities of the unlabeled instances. Then, the estimated class membership probabilities are used to label and weight unlabeled instances. At last, a naive Bayes is trained again using both the originally labeled data and the (newly labeled and weighted) unlabeled data. Our experimental results based on a large number of UCI data sets show that IWNB often improves the classification accuracy of original naive Bayes when available labeled data are very limited.  相似文献   

7.
The traditional setting of supervised learning requires a large amount of labeled training examples in order to achieve good generalization. However, in many practical applications, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semisupervised learning has attracted much attention. Previous research on semisupervised learning mainly focuses on semisupervised classification. Although regression is almost as important as classification, semisupervised regression is largely understudied. In particular, although cotraining is a main paradigm in semisupervised learning, few works has been devoted to cotraining-style semisupervised regression algorithms. In this paper, a cotraining-style semisupervised regression algorithm, that is, COREG, is proposed. This algorithm uses two regressors, each labels the unlabeled data for the other regressor, where the confidence in labeling an unlabeled example is estimated through the amount of reduction in mean squared error over the labeled neighborhood of that example. Analysis and experiments show that COREG can effectively exploit unlabeled data to improve regression estimates.  相似文献   

8.
The accuracy of a classification-based surrogate model for reliability assessment can be improved by augmenting the training data (labeled data or data with known responses) with a large number of unlabeled data (data with unknown responses) in semi-supervised learning methods. In this research, an enhanced Probabilistic Neural Network (PNN) algorithm is proposed where the Gaussians at each labeled point are not assumed to be spherical. Each of the Gaussians has a ‘full’ covariance matrix instead of simply assuming the Gaussian with a ‘spherical’ covariance matrix. First, the Expectation-Maximization algorithm is applied on the labeled and unlabeled data while assuming that the number of ‘full’ Gaussians is equal to the number of labeled datapoints. The contribution of each of these ‘full’ Gaussians at a particular datapoint is found by using the Bayes Theorem. The Bayes decision criterion is then used in the final output layer of the PNN to classify test patterns into either the safe or the failure class. The primary benefit of the proposed method comes from utilizing unlabeled data for better estimation of ‘full’ covariance matrices of constituting Gaussian clusters of underlying data, which are then used to estimate the Probability Density Functions of classes for classification. This procedure does not require additional computational costs to improve the accuracy of the classification results since the cost of unlabeled data is negligible in general. Two examples including an analytic problem and a truss problem are presented in order to validate the proposed reliability estimation process. The results reflect considerable improvements of the classifier performance for estimating reliability while maintaining sufficient accuracy.  相似文献   

9.
In many machine learning settings, labeled examples are difficult to collect while unlabeled data are abundant. Also, for some binary classification problems, positive examples which are elements of the target concept are available. Can these additional data be used to improve accuracy of supervised learning algorithms? We investigate in this paper the design of learning algorithms from positive and unlabeled data only. Many machine learning and data mining algorithms, such as decision tree induction algorithms and naive Bayes algorithms, use examples only to evaluate statistical queries (SQ-like algorithms). Kearns designed the statistical query learning model in order to describe these algorithms. Here, we design an algorithm scheme which transforms any SQ-like algorithm into an algorithm based on positive statistical queries (estimate for probabilities over the set of positive instances) and instance statistical queries (estimate for probabilities over the instance space). We prove that any class learnable in the statistical query learning model is learnable from positive statistical queries and instance statistical queries only if a lower bound on the weight of any target concept f can be estimated in polynomial time. Then, we design a decision tree induction algorithm POSC4.5, based on C4.5, that uses only positive and unlabeled examples and we give experimental results for this algorithm. In the case of imbalanced classes in the sense that one of the two classes (say the positive class) is heavily underrepresented compared to the other class, the learning problem remains open. This problem is challenging because it is encountered in many real-world applications.  相似文献   

10.
As a recently proposed machine learning method, active learning of Gaussian processes can effectively use a small number of labeled examples to train a classifier, which in turn is used to select the most informative examples from unlabeled data for manual labeling. However, in the process of example selection, active learning usually need consider all the unlabeled data without exploiting the structural space connectivity among them. This will decrease the classification accuracy to some extent since the selected points may not be the most informative. To overcome this shortcoming, in this paper, we present a method which applies the manifold-preserving graph reduction (MPGR) algorithm to the traditional active learning method of Gaussian processes. MPGR is a simple and efficient example sparsification algorithm which can construct a subset to represent the global structure and simultaneously eliminate the influence of noisy points and outliers. Thereby, when actively selecting examples to label, we just choose from the subset constructed by MPGR instead of the whole unlabeled data. We report experimental results on multiple data sets which demonstrate that our method obtains better classification performance compared with the original active learning method of Gaussian processes.  相似文献   

11.
Previous partially supervised classification methods can partition unlabeled data into positive examples and negative examples for a given class by learning from positive labeled examples and unlabeled examples, but they cannot further group the negative examples into meaningful clusters even if there are many different classes in the negative examples. Here we proposed an automatic method to obtain a natural partitioning of mixed data (labeled data + unlabeled data) by maximizing a stability criterion defined on classification results from an extended label propagation algorithm over all the possible values of model order (or the number of classes) in mixed data. Our experimental results on benchmark corpora for word sense disambiguation task indicate that this model order identification algorithm with the extended label propagation algorithm as the base classifier outperforms SVM, a one-class partially supervised classification algorithm, and the model order identification algorithm with semi-supervised k-means clustering as the base classifier when labeled data is incomplete.  相似文献   

12.
李南 《计算机系统应用》2016,25(12):187-192
现有数据流分类算法大多使用有监督学习,而标记高速数据流上的样本需要很大的代价,因此缺乏实用性.针对以上问题,提出了一种低代价的数据流分类算法2SDC.新算法利用少量已标记类别的样本和大量未标记样本来训练和更新分类模型,并且动态监测数据流上可能发生的概念漂移.真实数据流上的实验表明,2SDC算法不仅具有和当前有监督学习分类算法相当的分类精度,并且能够自适应数据流上的概念漂移.  相似文献   

13.
It is an actual and challenging issue to learn cost-sensitive models from those datasets that are with few labeled data and plentiful unlabeled data, because some time labeled data are very difficult, time consuming and/or expensive to obtain. To solve this issue, in this paper we proposed two classification strategies to learn cost-sensitive classifier from training datasets with both labeled and unlabeled data, based on Expectation Maximization (EM). The first method, Direct-EM, uses EM to build a semi-supervised classifier, then directly computes the optimal class label for each test example using the class probability produced by the learning model. The second method, CS-EM, modifies EM by incorporating misclassification cost into the probability estimation process. We conducted extensive experiments to evaluate the efficiency, and results show that when using only a small number of labeled training examples, the CS-EM outperforms the other competing methods on majority of the selected UCI data sets across different cost ratios, especially when cost ratio is high.  相似文献   

14.
Tri-training: exploiting unlabeled data using three classifiers   总被引:24,自引:0,他引:24  
In many practical data mining applications, such as Web page classification, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have attracted much attention. In this paper, a new co-training style semi-supervised learning algorithm, named tri-training, is proposed. This algorithm generates three classifiers from the original labeled example set. These classifiers are then refined using unlabeled examples in the tri-training process. In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling, under certain conditions. Since tri-training neither requires the instance space to be described with sufficient and redundant views nor does it put any constraints on the supervised learning algorithm, its applicability is broader than that of previous co-training style algorithms. Experiments on UCI data sets and application to the Web page classification task indicate that tri-training can effectively exploit unlabeled data to enhance the learning performance.  相似文献   

15.
一种进化半监督式模糊聚类的入侵检测算法   总被引:3,自引:0,他引:3       下载免费PDF全文
在入侵检测系统中,未知标签数据容易获得,标签数据较难获得,对此提出了一种基于进化半监督式模糊聚类入侵检测算法。算法利用标签数据信息担任染色体的角色,引导非标签数据每个模糊分类的进化过程,能够使用少量的标签数据和大量未知标签数据生成入侵检测系统分类器,可处理模糊类标签,不易陷入局部最优,适合并行结构的实现。实验结果表明,算法有较高的检测率。  相似文献   

16.
Word Sense Disambiguation by Learning Decision Trees from Unlabeled Data   总被引:1,自引:0,他引:1  
In this paper we describe a machine learning approach to word sense disambiguation that uses unlabeled data. Our method is based on selective sampling with committees of decision trees. The committee members are trained on a small set of labeled examples which are then augmented by a large number of unlabeled examples. Using unlabeled examples is important because obtaining labeled data is expensive and time-consuming while it is easy and inexpensive to collect a large number of unlabeled examples. The idea behind this approach is that the labels of unlabeled examples can be estimated by using committees. Using additional unlabeled examples, therefore, improves the performance of word sense disambiguation and minimizes the cost of manual labeling. Effectiveness of this approach was examined on a raw corpus of one million words. Using unlabeled data, we achieved an accuracy improvement up to 20.2%.  相似文献   

17.
Transduction is an inference mechanism adopted from several classification algorithms capable of exploiting both labeled and unlabeled data and making the prediction for the given set of unlabeled data only. Several transductive learning methods have been proposed in the literature to learn transductive classifiers from examples represented as rows of a classical double-entry table (or relational table). In this work we consider the case of examples represented as a set of multiple tables of a relational database and we propose a new relational classification algorithm, named TRANSC, that works in a transductive setting and employs a probabilistic approach to classification. Knowledge on the data model, i.e., foreign keys, is used to guide the search process. The transductive learning strategy iterates on a k-NN based re-classification of labeled and unlabeled examples, in order to identify borderline examples, and uses the relational probabilistic classifier Mr-SBC to bootstrap the transductive algorithm. Experimental results confirm that TRANSC outperforms its inductive counterpart (Mr-SBC).  相似文献   

18.
Learning from labeled and unlabeled data using a minimal number of queries   总被引:4,自引:0,他引:4  
The considerable time and expense required for labeling data has prompted the development of algorithms which maximize the classification accuracy for a given amount of labeling effort. On the one hand, the effort has been to develop the so-called "active learning" algorithms which sequentially choose the patterns to be explicitly labeled so as to realize the maximum information gain from each labeling. On the other hand, the effort has been to develop algorithms that can learn from labeled as well as the more abundant unlabeled data. Proposed in this paper is an algorithm that integrates the benefits of active learning with the benefits of learning from labeled and unlabeled data. Our approach is based on reversing the roles of the labeled and unlabeled data. Specifically, we use a Genetic Algorithm (GA) to iteratively refine the class membership of the unlabeled patterns so that the maximum a posteriori (MAP) based predicted labels of the patterns in the labeled dataset are in agreement with the known labels. This reversal of the role of labeled and unlabeled patterns leads to an implicit class assignment of the unlabeled patterns. For active learning, we use a subset of the GA population to construct multiple MAP classifiers. Points in the input space where there is maximal disagreement amongst these classifiers are then selected for explicit labeling. The learning from labeled and unlabeled data and active learning phases are interlaced and together provide accurate classification while minimizing the labeling effort.  相似文献   

19.
Most machine learning tasks in data classification and information retrieval require manually labeled data examples in the training stage. The goal of active learning is to select the most informative examples for manual labeling in these learning tasks. Most of the previous studies in active learning have focused on selecting a single unlabeled example in each iteration. This could be inefficient, since the classification model has to be retrained for every acquired labeled example. It is also inappropriate for the setup of information retrieval tasks where the user's relevance feedback is often provided for the top K retrieved items. In this paper, we present a framework for batch mode active learning, which selects a number of informative examples for manual labeling in each iteration. The key feature of batch mode active learning is to reduce the redundancy among the selected examples such that each example provides unique information for model updating. To this end, we employ the Fisher information matrix as the measurement of model uncertainty, and choose the set of unlabeled examples that can efficiently reduce the Fisher information of the classification model. We apply our batch mode active learning framework to both text categorization and image retrieval. Promising results show that our algorithms are significantly more effective than the active learning approaches that select unlabeled examples based only on their informativeness for the classification model.  相似文献   

20.
本文提出一种基于半监督主动学习的算法,用于解决在建立动态贝叶斯网络(DBN)分类模型时遇到的难以获得大量带有类标注的样本数据集的问题.半监督学习可以有效利用未标注样本数据来学习DBN分类模型,但是在迭代过程中易于加入错误的样本分类信息,并因而影响模型的准确性.在半监督学习中借鉴主动学习,可以自主选择有用的未标注样本来请求用户标注.把这些样本加入训练集之后,能够最大程度提高半监督学习对未标注样本分类的准确性.实验结果表明,该算法能够显著提高DBN学习器的效率和性能,并快速收敛于预定的分类精度.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号