首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Tri-training: exploiting unlabeled data using three classifiers   总被引:24,自引:0,他引:24  
In many practical data mining applications, such as Web page classification, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semi-supervised learning algorithms such as co-training have attracted much attention. In this paper, a new co-training style semi-supervised learning algorithm, named tri-training, is proposed. This algorithm generates three classifiers from the original labeled example set. These classifiers are then refined using unlabeled examples in the tri-training process. In detail, in each round of tri-training, an unlabeled example is labeled for a classifier if the other two classifiers agree on the labeling, under certain conditions. Since tri-training neither requires the instance space to be described with sufficient and redundant views nor does it put any constraints on the supervised learning algorithm, its applicability is broader than that of previous co-training style algorithms. Experiments on UCI data sets and application to the Web page classification task indicate that tri-training can effectively exploit unlabeled data to enhance the learning performance.  相似文献   

2.
Traditional learning algorithms use only labeled data for training. However, labeled examples are often difficult or time consuming to obtain since they require substantial human labeling efforts. On the other hand, unlabeled data are often relatively easy to collect. Semisupervised learning addresses this problem by using large quantities of unlabeled data with labeled data to build better learning algorithms. In this paper, we use the manifold regularization approach to formulate the semisupervised learning problem where a regularization framework which balances a tradeoff between loss and penalty is established. We investigate different implementations of the loss function and identify the methods which have the least computational expense. The regularization hyperparameter, which determines the balance between loss and penalty, is crucial to model selection. Accordingly, we derive an algorithm that can fit the entire path of solutions for every value of the hyperparameter. Its computational complexity after preprocessing is quadratic only in the number of labeled examples rather than the total number of labeled and unlabeled examples.  相似文献   

3.
Semi-supervised learning has attracted a significant amount of attention in pattern recognition and machine learning. Most previous studies have focused on designing special algorithms to effectively exploit the unlabeled data in conjunction with labeled data. Our goal is to improve the classification accuracy of any given supervised learning algorithm by using the available unlabeled examples. We call this as the Semi-supervised improvement problem, to distinguish the proposed approach from the existing approaches. We design a metasemi-supervised learning algorithm that wraps around the underlying supervised algorithm and improves its performance using unlabeled data. This problem is particularly important when we need to train a supervised learning algorithm with a limited number of labeled examples and a multitude of unlabeled examples. We present a boosting framework for semi-supervised learning, termed as SemiBoost. The key advantages of the proposed semi-supervised learning approach are: 1) performance improvement of any supervised learning algorithm with a multitude of unlabeled data, 2) efficient computation by the iterative boosting algorithm, and 3) exploiting both manifold and cluster assumption in training classification models. An empirical study on 16 different data sets and text categorization demonstrates that the proposed framework improves the performance of several commonly used supervised learning algorithms, given a large number of unlabeled examples. We also show that the performance of the proposed algorithm, SemiBoost, is comparable to the state-of-the-art semi-supervised learning algorithms.  相似文献   

4.
一种增量贝叶斯分类模型   总被引:40,自引:0,他引:40  
分类一直是机器学习,模型识别和数据挖掘研究的核心问题,从海量数据中学习分类知识,尤其是当获得大量的带有类别标注的样本代价较高时,增量学习是解决该问题的有效途径,该文将简单贝叶期方法应用于增量分类中,提出了一种增量贝叶斯学习模型,给出了增量贝叶斯推理过程,包括增量地修正分类器参数和增量地分类测试样本,实验结果表明,该算法是可行的和有效。  相似文献   

5.
“半监督学习”方法,利用已经标注好的训练样本和无标注的训练样本一起训练分类器。在标准SVM分类器训练方法中融入这种思想,给分类面附近加入混合数据,提出了一种新的基于SVM的分类器设计方法,并将这种方法应用于小样本数据的分类问题中。实验表明,新的基于SVM的分类器与传统SVM相比较,在分类准确率上有很大提高,同时偏差有所降低。  相似文献   

6.
Most machine learning tasks in data classification and information retrieval require manually labeled data examples in the training stage. The goal of active learning is to select the most informative examples for manual labeling in these learning tasks. Most of the previous studies in active learning have focused on selecting a single unlabeled example in each iteration. This could be inefficient, since the classification model has to be retrained for every acquired labeled example. It is also inappropriate for the setup of information retrieval tasks where the user's relevance feedback is often provided for the top K retrieved items. In this paper, we present a framework for batch mode active learning, which selects a number of informative examples for manual labeling in each iteration. The key feature of batch mode active learning is to reduce the redundancy among the selected examples such that each example provides unique information for model updating. To this end, we employ the Fisher information matrix as the measurement of model uncertainty, and choose the set of unlabeled examples that can efficiently reduce the Fisher information of the classification model. We apply our batch mode active learning framework to both text categorization and image retrieval. Promising results show that our algorithms are significantly more effective than the active learning approaches that select unlabeled examples based only on their informativeness for the classification model.  相似文献   

7.
对于建立动态贝叶斯网络(DBN)分类模型时,带有类标注样本数据集获得困难的问题,提出一种基于EM和分类损失的半监督主动DBN学习算法.半监督学习中的EM算法可以有效利用未标注样本数据来学习DBN分类模型,但是由于迭代过程中易于加入错误的样本分类信息而影响模型的准确性.基于分类损失的主动学习借鉴到EM学习中,可以自主选择有用的未标注样本来请求用户标注,当把这些样本加入训练集后能够最大程度减少模型对未标注样本分类的不确定性.实验表明,该算法能够显著提高DBN学习器的效率和性能,并快速收敛于预定的分类精度.  相似文献   

8.
On classification with incomplete data   总被引:4,自引:0,他引:4  
We address the incomplete-data problem in which feature vectors to be classified are missing data (features). A (supervised) logistic regression algorithm for the classification of incomplete data is developed. Single or multiple imputation for the missing data is avoided by performing analytic integration with an estimated conditional density function (conditioned on the observed data). Conditional density functions are estimated using a Gaussian mixture model (GMM), with parameter estimation performed using both expectation-maximization (EM) and variational Bayesian EM (VB-EM). The proposed supervised algorithm is then extended to the semisupervised case by incorporating graph-based regularization. The semisupervised algorithm utilizes all available data-both incomplete and complete, as well as labeled and unlabeled. Experimental results of the proposed classification algorithms are shown  相似文献   

9.
Equilibrium-Based Support Vector Machine for Semisupervised Classification   总被引:2,自引:0,他引:2  
A novel learning algorithm for semisupervised classification is proposed. The proposed method first constructs a support function that estimates a support of a data distribution using both labeled and unlabeled data. Then, it partitions a whole data space into a small number of disjoint regions with the aid of a dynamical system. Finally, it labels the decomposed regions utilizing the labeled data and the cluster structure described by the constructed support function. Simulation results show the effectiveness of the proposed method to label out-of-sample unlabeled test data as well as in-sample unlabeled data  相似文献   

10.
Dataset size continues to increase and data are being collected from numerous applications. Because collecting labeled data is expensive and time consuming, the amount of unlabeled data is increasing. Semi-supervised learning (SSL) has been proposed to improve conventional supervised learning methods by training from both unlabeled and labeled data. In contrast to classification problems, the estimation of labels for unlabeled data presents added uncertainty for regression problems. In this paper, a semi-supervised support vector regression (SS-SVR) method based on self-training is proposed. The proposed method addresses the uncertainty of the estimated labels for unlabeled data. To measure labeling uncertainty, the label distribution of the unlabeled data is estimated with two probabilistic local reconstruction (PLR) models. Then, the training data are generated by oversampling from the unlabeled data and their estimated label distribution. The sampling rate is different based on uncertainty. Finally, expected margin-based pattern selection (EMPS) is employed to reduce training complexity. We verify the proposed method with 30 regression datasets and a real-world problem: virtual metrology (VM) in semiconductor manufacturing. The experiment results show that the proposed method improves the accuracy by 8% compared with conventional supervised SVR, and the training time for the proposed method is 20% shorter than that of the benchmark methods.  相似文献   

11.
张良  罗祎敏  马洪超  张帆  胡川 《计算机应用》2017,37(6):1768-1771
针对高光谱遥感影像分类中,传统的主动学习算法仅利用已标签数据训练样本,大量未标签数据被忽视的问题,提出一种结合未标签信息的主动学习算法。首先,通过K近邻一致性原则、前后预测一致性原则和主动学习算法信息量评估3重筛选得到预测标签可信度高并具备一定信息量的未标签样本;然后,将其预测标签当作真实标签加入到标签样本集中;最后,训练得到更优质的分类模型。实验结果表明,与被动学习算法和传统的主动学习算法相比,所提算法能够在同等标记的代价下获得更高的分类精度,同时具有更好的参数敏感性。  相似文献   

12.
Learning with Positive and Unlabeled Examples Using Topic-Sensitive PLSA   总被引:1,自引:0,他引:1  
It is often difficult and time-consuming to provide a large amount of positive and negative examples for training a classification system in many applications such as information retrieval. Instead, users often find it easier to indicate just a few positive examples of what he or she likes, and thus, these are the only labeled examples available for the learning system. A large amount of unlabeled data are easier to obtain. How to make use of the positive and unlabeled data for learning is a critical problem in machine learning and information retrieval. Several approaches for solving this problem have been proposed in the past, but most of these methods do not work well when only a small amount of labeled positive data are available. In this paper, we propose a novel algorithm called Topic-Sensitive pLSA to solve this problem. This algorithm extends the original probabilistic latent semantic analysis (pLSA), which is a purely unsupervised framework, by injecting a small amount of supervision information from the user. The supervision from users is in the form of indicating which documents fit the users' interests. The supervision is encoded into a set of constraints. By introducing the penalty terms for these constraints, we propose an objective function that trades off the likelihood of the observed data and the enforcement of the constraints. We develop an iterative algorithm that can obtain the local optimum of the objective function. Experimental evaluation on three data corpora shows that the proposed method can improve the performance especially only with a small amount of labeled positive data.  相似文献   

13.
Previous partially supervised classification methods can partition unlabeled data into positive examples and negative examples for a given class by learning from positive labeled examples and unlabeled examples, but they cannot further group the negative examples into meaningful clusters even if there are many different classes in the negative examples. Here we proposed an automatic method to obtain a natural partitioning of mixed data (labeled data + unlabeled data) by maximizing a stability criterion defined on classification results from an extended label propagation algorithm over all the possible values of model order (or the number of classes) in mixed data. Our experimental results on benchmark corpora for word sense disambiguation task indicate that this model order identification algorithm with the extended label propagation algorithm as the base classifier outperforms SVM, a one-class partially supervised classification algorithm, and the model order identification algorithm with semi-supervised k-means clustering as the base classifier when labeled data is incomplete.  相似文献   

14.
In many real-world applications, labeled data are usually expensive to get, while there may be a large amount of unlabeled data. To reduce the labeling cost, active learning attempts to discover the most informative data points for labeling. The challenge is which unlabeled samples should be labeled to improve the classifier the most. Classical optimal experimental design algorithms are based on least-square errors over the labeled samples only while the unlabeled points are ignored. In this paper, we propose a novel active learning algorithm called neighborhood preserving D-optimal design. Our algorithm is based on a neighborhood preserving regression model which simultaneously minimizes the least-square error on the measured samples and preserves the neighborhood structure of the data space. It selects the most informative samples which minimize the variance of the regression parameter. We also extend our algorithm to nonlinear case by using kernel trick. Experimental results on terrain classification show the effectiveness of proposed approach.  相似文献   

15.
Semisupervised learning from different information sources   总被引:2,自引:1,他引:1  
This paper studies the use of a semisupervised learning algorithm from different information sources. We first offer a theoretical explanation as to why minimising the disagreement between individual models could lead to the performance improvement. Based on the observation, this paper proposes a semisupervised learning approach that attempts to minimise this disagreement by employing a co-updating method and making use of both labeled and unlabeled data. Three experiments to test the effectiveness of the approach are presented in this paper: (i) webpage classification from both content and hyperlinks; (ii) functional classification of gene using gene expression data and phylogenetic data and (iii) machine self-maintaining from both sensory and image data. The results show the effectiveness and efficiency of our approach and suggest its application potentials.  相似文献   

16.
PU文本分类(以正例和未标识实例集训练分类器的分类方法)关键在于从U(未标识实例)集中提取尽可能多的可靠反例,然后在正例与可靠反例的基础上使用机器学习的方法构造有效分类器,而已有的方法可靠反例的数量少或不可靠,同样构造的分类器也精度不高,基于SVM主动学习技术的PU文本分类算法提出一种利用SVM与改进的Rocchio分类器进行主动学习的PU文本分类方法,并通过spy技术来提高SVM分类器的准确度,解决某些机器学习中训练样本获取代价过大,尤其是反例样本较难获取的实际问题。实验表明,该方法比目前其它的主动学习方法及面向PU的文本分类方法具有更高的准确率和召回率。  相似文献   

17.
18.
Semisupervised learning has been of growing interest over the past years and many methods have been proposed. While existing semisupervised methods have shown some promising empirical performances, their development has been based largely on heuristics. In this paper, we investigate semisupervised multicategory classification with an imperfect mixture density model. In the proposed model, the training data come from a probability distribution, which can be modeled imperfectly by an identifiable mixture distribution. Furthermore, we propose a semisupervised multicategory classification method and establish its generalization error bounds. The theoretical analysis illustrates that the proposed method can utilize unlabeled data effectively and can achieve fast convergence rate.  相似文献   

19.
Label Propagation through Linear Neighborhoods   总被引:8,自引:0,他引:8  
In many practical data mining applications such as text classification, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semi supervised learning algorithms have aroused considerable interests from the data mining and machine learning fields. In recent years, graph-based semi supervised learning has been becoming one of the most active research areas in the semi supervised learning community. In this paper, a novel graph-based semi supervised learning approach is proposed based on a linear neighborhood model, which assumes that each data point can be linearly reconstructed from its neighborhood. Our algorithm, named linear neighborhood propagation (LNP), can propagate the labels from the labeled points to the whole data set using these linear neighborhoods with sufficient smoothness. A theoretical analysis of the properties of LNP is presented in this paper. Furthermore, we also derive an easy way to extend LNP to out-of-sample data. Promising experimental results are presented for synthetic data, digit, and text classification tasks.  相似文献   

20.
The problem of learning from both labeled and unlabeled data is considered. In this paper, we present a novel semisupervised multimodal dimensionality reduction (SSMDR) algorithm for feature reduction and extraction. SSMDR can preserve the local and multimodal structures of labeled and unlabeled samples. As a result, data pairs in the close vicinity of the original space are projected in the nearby of the embedding space. Due to overfitting, supervised dimensionality reduction methods tend to perform inefficiently when only few labeled samples are available. In such cases, unlabeled samples play a significant role in boosting the learning performance. The proposed discriminant technique has an analytical form of the embedding transformations that can be effectively obtained by applying the eigen decomposition, or finding two close optimal sets of transforming basis vectors. By employing the standard kernel trick, SSMDR can be extended to the nonlinear dimensionality reduction scenarios. We verify the feasibility and effectiveness of SSMDR through conducting extensive simulations including data visualization and classification on the synthetic and real‐world datasets. Our obtained results reveal that SSMDR offers significant advantages over some widely used techniques. Compared with other methods, the proposed SSMDR exhibits superior performance on multimodal cases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号