首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Most machine learning tasks in data classification and information retrieval require manually labeled data examples in the training stage. The goal of active learning is to select the most informative examples for manual labeling in these learning tasks. Most of the previous studies in active learning have focused on selecting a single unlabeled example in each iteration. This could be inefficient, since the classification model has to be retrained for every acquired labeled example. It is also inappropriate for the setup of information retrieval tasks where the user's relevance feedback is often provided for the top K retrieved items. In this paper, we present a framework for batch mode active learning, which selects a number of informative examples for manual labeling in each iteration. The key feature of batch mode active learning is to reduce the redundancy among the selected examples such that each example provides unique information for model updating. To this end, we employ the Fisher information matrix as the measurement of model uncertainty, and choose the set of unlabeled examples that can efficiently reduce the Fisher information of the classification model. We apply our batch mode active learning framework to both text categorization and image retrieval. Promising results show that our algorithms are significantly more effective than the active learning approaches that select unlabeled examples based only on their informativeness for the classification model.  相似文献   

2.
In many real-world applications, labeled data are usually expensive to get, while there may be a large amount of unlabeled data. To reduce the labeling cost, active learning attempts to discover the most informative data points for labeling. The challenge is which unlabeled samples should be labeled to improve the classifier the most. Classical optimal experimental design algorithms are based on least-square errors over the labeled samples only while the unlabeled points are ignored. In this paper, we propose a novel active learning algorithm called neighborhood preserving D-optimal design. Our algorithm is based on a neighborhood preserving regression model which simultaneously minimizes the least-square error on the measured samples and preserves the neighborhood structure of the data space. It selects the most informative samples which minimize the variance of the regression parameter. We also extend our algorithm to nonlinear case by using kernel trick. Experimental results on terrain classification show the effectiveness of proposed approach.  相似文献   

3.
Image classification is an important task in computer vision and machine learning. However, it is known that manually labeling images is time-consuming and expensive, but the unlabeled images are easily available. Active learning is a mechanism which tries to determine which unlabeled data points would be the most informative (i.e., improve the classifier the most) if they are labeled and used as training samples. In this paper, we introduce the idea of column subset selection, which aims to select the most representation columns from a data matrix, into active learning and propose a novel active learning algorithm, column subset selection for active learning (CSSactive). CSSactive selects the most representative images to label, then the other images are reconstructed by these labeled images. The goal of CSSactive is to minimize the reconstruction error. Besides, most of the previous active learning approaches are based on linear model, and hence they only consider linear functions. Therefore, they fail to discover the intrinsic geometry in images when the image space is highly nonlinear. Therefore, we provide a kernel-based column subset selection for active learning (KCSSactive) algorithm which performs the active learning in Reproducing Kernel Hilbert Space (RKHS) instead of the original image space to address this problem. Experimental results on Yale, AT&T and COIL20 data sets demonstrate the effectiveness of our proposed approaches.  相似文献   

4.
对于建立动态贝叶斯网络(DBN)分类模型时,带有类标注样本数据集获得困难的问题,提出一种基于EM和分类损失的半监督主动DBN学习算法.半监督学习中的EM算法可以有效利用未标注样本数据来学习DBN分类模型,但是由于迭代过程中易于加入错误的样本分类信息而影响模型的准确性.基于分类损失的主动学习借鉴到EM学习中,可以自主选择有用的未标注样本来请求用户标注,当把这些样本加入训练集后能够最大程度减少模型对未标注样本分类的不确定性.实验表明,该算法能够显著提高DBN学习器的效率和性能,并快速收敛于预定的分类精度.  相似文献   

5.
In classification tasks, active learning is often used to select out a set of informative examples from a big unlabeled dataset. The objective is to learn a classification pattern that can accurately predict labels of new examples by using the selection result which is expected to contain as few examples as possible. The selection of informative examples also reduces the manual effort for labeling, data complexity, and data redundancy, thus improves learning efficiency. In this paper, a new active learning strategy with pool-based settings, called inconsistency-based active learning, is proposed. This strategy is built up under the guidance of two classical works: (1) the learning philosophy of query-by-committee (QBC) algorithm; and (2) the structure of the traditional concept learning model: from-general-to-specific (GS) ordering. By constructing two extreme hypotheses of the current version space, the strategy evaluates unlabeled examples by a new sample selection criterion as inconsistency value, and the whole learning process could be implemented without any additional knowledge. Besides, since active learning is favorably applied to support vector machine (SVM) and its related applications, the strategy is further restricted to a specific algorithm called inconsistency-based active learning for SVM (I-ALSVM). By building up a GS structure, the sample selection process in our strategy is formed by searching through the initial version space. We compare the proposed I-ALSVM with several other pool-based methods for SVM on selected datasets. The experimental result shows that, in terms of generalization capability, our model exhibits good feasibility and competitiveness.  相似文献   

6.
李延超  肖甫  陈志  李博 《软件学报》2020,31(12):3808-3822
主动学习从大量无标记样本中挑选样本交给专家标记.现有的批抽样主动学习算法主要受3个限制:(1)一些主动学习方法基于单选择准则或对数据、模型设定假设,这类方法很难找到既有不确定性又有代表性的未标记样本;(2)现有批抽样主动学习方法的性能很大程度上依赖于样本之间相似性度量的准确性,例如预定义函数或差异性衡量;(3)噪声标签问题一直影响批抽样主动学习算法的性能.提出一种基于深度学习批抽样的主动学习方法.通过深度神经网络生成标记和未标记样本的学习表示和采用标签循环模式,使得标记样本与未标记样本建立联系,再回到相同标签的标记样本.这样同时考虑了样本的不确定性和代表性,并且算法对噪声标签具有鲁棒性.在提出的批抽样主动学习方法中,算法使用的子模块函数确保选择的样本集合具有多样性.此外,自适应参数的优化,使得主动学习算法可以自动平衡样本的不确定性和代表性.将提出的主动学习方法应用到半监督分类和半监督聚类中,实验结果表明,所提出的主动学习方法的性能优于现有的一些先进的方法.  相似文献   

7.
一种增量贝叶斯分类模型   总被引:40,自引:0,他引:40  
分类一直是机器学习,模型识别和数据挖掘研究的核心问题,从海量数据中学习分类知识,尤其是当获得大量的带有类别标注的样本代价较高时,增量学习是解决该问题的有效途径,该文将简单贝叶期方法应用于增量分类中,提出了一种增量贝叶斯学习模型,给出了增量贝叶斯推理过程,包括增量地修正分类器参数和增量地分类测试样本,实验结果表明,该算法是可行的和有效。  相似文献   

8.
In classification problems, many different active learning techniques are often adopted to find the most informative samples for labeling in order to save human labors. Among them, active learning support vector machine (SVM) is one of the most representative approaches, in which model parameter is usually set as a fixed default value during the whole learning process. Note that model parameter is closely related to the training set. Hence dynamic parameter is desirable to make a satisfactory learning performance. To target this issue, we proposed a novel algorithm, called active learning SVM with regularization path, which can fit the entire solution path of SVM for every value of model parameters. In this algorithm, we first traced the entire solution path of the current classifier to find a series of candidate model parameters, and then used unlabeled samples to select the best model parameter. Besides, in the initial phase of training, we constructed a training sample sets by using an improved K-medoids cluster algorithm. Experimental results conducted from real-world data sets showed the effectiveness of the proposed algorithm for image classification problems.  相似文献   

9.
在很多信息处理任务中,人们容易获得大量的无标签样本,但对样本进行标注是非常费时和费力的。作为机器学习领域中一种重要的学习方法,主动学习通过选择最有信息量的样本进行标注,减少了人工标注的代价。然而,现有的大多数主动学习算法都是基于分类器的监督学习方法,这类算法并不适用于无任何标签信息的样本选择。针对这个问题,借鉴最优实验设计的算法思想,结合自适应稀疏邻域重构理论,提出基于自适应稀疏邻域重构的主动学习算法。该算法可以根据数据集各区域的不同分布自适应地选择邻域规模,同步完成邻域点的搜寻和重构系数的计算,能在无任何标签信息的情况下较好地选择最能代表样本集分布结构的样本。基于人工合成数据集和真实数据集的实验表明,在同等标注代价下,基于自适应稀疏邻域重构的主动学习算法在分类精度和鲁棒性上具有较高的性能。  相似文献   

10.
目前基于PU问题的时间序列分类常采用半监督学习对未标注数据集[U]中数据进行自动标注并构建分类器,但在这种方法中,边界数据样本类别的自动标注难以保证正确性,从而导致构建分类器的效果不佳。针对以上问题,提出一种采用主动学习对未标注数据集[U]中数据进行人工标注从而构建分类器的方法OAL(Only Active Learning),基于投票委员会(QBC)对标注数据集构建多个分类器进行投票,以计算未标注数据样本的类别不一致性,并综合考虑数据样本的分布密度,计算数据样本的信息量,作为主动学习的数据选择策略。鉴于人工标注数据量有限,在上述OAL方法的基础上,将主动学习与半监督学习相结合,即在主动学习迭代过程中,将类别一致性高的部分数据样本自动标注,以增加训练数据中标注数据量,保证构建分类器的训练数据量。实验表明了该方法通过部分人工标注,相比半监督学习,能够为PU数据集构建更高准确率的分类器。  相似文献   

11.
In classification problems, active learning is often adopted to alleviate the laborious human labeling efforts, by finding the most informative samples to query the labels. One of the most popular query strategy is selecting the most uncertain samples for the current classifier. The performance of such an active learning process heavily relies on the learned classifier before each query. Thus, stepwise classifier model/parameter selection is quite critical, which is, however, rarely studied in the literature. In this paper, we propose a novel active learning support vector machine algorithm with adaptive model selection. In this algorithm, before each new query, we trace the full solution path of the base classifier, and then perform efficient model selection using the unlabeled samples. This strategy significantly improves the active learning efficiency with comparatively inexpensive computational cost. Empirical results on both artificial and real world benchmark data sets show the encouraging gains brought by the proposed algorithm in terms of both classification accuracy and computational cost.  相似文献   

12.
Co-training is a good paradigm of semi-supervised, which requires the data set to be described by two views of features. There are a notable characteristic shared by many co-training algorithm: the selected unlabeled instances should be predicted with high confidence, since a high confidence score usually implies that the corresponding prediction is correct. Unfortunately, it is not always able to improve the classification performance with these high confidence unlabeled instances. In this paper, a new semi-supervised learning algorithm was proposed combining the benefits of both co-training and active learning. The algorithm applies co-training to select the most reliable instances according to the two criterions of high confidence and nearest neighbor for boosting the classifier, also exploit the most informative instances with human annotation for improve the classification performance. Experiments on several UCI data sets and natural language processing task, which demonstrate our method achieves more significant improvement for sacrificing the same amount of human effort.  相似文献   

13.
目的在多标签有监督学习框架中,构建具有较强泛化性能的分类器需要大量已标注训练样本,而实际应用中已标注样本少且获取代价十分昂贵。针对多标签图像分类中已标注样本数量不足和分类器再学习效率低的问题,提出一种结合主动学习的多标签图像在线分类算法。方法基于min-max理论,采用查询最具代表性和最具信息量的样本挑选策略主动地选择待标注样本,且基于KKT(Karush-Kuhn-Tucker)条件在线地更新多标签图像分类器。结果在4个公开的数据集上,采用4种多标签分类评价指标对本文算法进行评估。实验结果表明,本文采用的样本挑选方法比随机挑选样本方法和基于间隔的采样方法均占据明显优势;当分类器达到相同或相近的分类准确度时,利用本文的样本挑选策略选择的待标注样本数目要明显少于采用随机挑选样本方法和基于间隔的采样方法所需查询的样本数。结论本文算法一方面可以减少获取已标注样本所需的人工标注代价;另一方面也避免了传统的分类器重新训练时利用所有数据所产生的学习效率低下的问题,达到了当新数据到来时可实时更新分类器的目的。  相似文献   

14.
Label Propagation through Linear Neighborhoods   总被引:8,自引:0,他引:8  
In many practical data mining applications such as text classification, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semi supervised learning algorithms have aroused considerable interests from the data mining and machine learning fields. In recent years, graph-based semi supervised learning has been becoming one of the most active research areas in the semi supervised learning community. In this paper, a novel graph-based semi supervised learning approach is proposed based on a linear neighborhood model, which assumes that each data point can be linearly reconstructed from its neighborhood. Our algorithm, named linear neighborhood propagation (LNP), can propagate the labels from the labeled points to the whole data set using these linear neighborhoods with sufficient smoothness. A theoretical analysis of the properties of LNP is presented in this paper. Furthermore, we also derive an easy way to extend LNP to out-of-sample data. Promising experimental results are presented for synthetic data, digit, and text classification tasks.  相似文献   

15.
Active Sampling for Class Probability Estimation and Ranking   总被引:1,自引:0,他引:1  
In many cost-sensitive environments class probability estimates are used by decision makers to evaluate the expected utility from a set of alternatives. Supervised learning can be used to build class probability estimates; however, it often is very costly to obtain training data with class labels. Active learning acquires data incrementally, at each phase identifying especially useful additional data for labeling, and can be used to economize on examples needed for learning. We outline the critical features of an active learner and present a sampling-based active learning method for estimating class probabilities and class-based rankings. BOOTSTRAP-LV identifies particularly informative new data for learning based on the variance in probability estimates, and uses weighted sampling to account for a potential example's informative value for the rest of the input space. We show empirically that the method reduces the number of data items that must be obtained and labeled, across a wide variety of domains. We investigate the contribution of the components of the algorithm and show that each provides valuable information to help identify informative examples. We also compare BOOTSTRAP-LV with UNCERTAINTY SAMPLING, an existing active learning method designed to maximize classification accuracy. The results show that BOOTSTRAP-LV uses fewer examples to exhibit a certain estimation accuracy and provide insights to the behavior of the algorithms. Finally, we experiment with another new active sampling algorithm drawing from both UNCERTAINTY SAMPLING and BOOTSTRAP-LV and show that it is significantly more competitive with BOOTSTRAP-LV compared to UNCERTAINTY SAMPLING. The analysis suggests more general implications for improving existing active sampling algorithms for classification.  相似文献   

16.
In the past few years, active learning has been reasonably successful and it has drawn a lot of attention. However, recent active learning methods have focused on strategies in which a large unlabeled dataset has to be reprocessed at each learning iteration. As the datasets grow, these strategies become inefficient or even a tremendous computational challenge. In order to address these issues, we propose an effective and efficient active learning paradigm which attains a significant reduction in the size of the learning set by applying an a priori process of identification and organization of a small relevant subset. Furthermore, the concomitant classification and selection processes enable the classification of a very small number of samples, while selecting the informative ones. Experimental results showed that the proposed paradigm allows to achieve high accuracy quickly with minimum user interaction, further improving its efficiency.  相似文献   

17.
为了减少枯燥和耗时的训练进程和提高脑机接口系统的分类率,将半监督学习运用到了运动想象脑电的分类中,提出了一种基于分段重叠共空间模式的自训练算法,将分段重叠共空间模式作为特征提取算法,使用少量标记的数据进行学习,然后使用置信度评估准则从未标记样本中挑选信息量大的样本来提高线性判别分类器的性能。提出的算法在少量标记样本和大量未标记样本的帮助下,能够获得比基于共空间模式作为特征提取的自训练算法和基于滤波带宽共空间模式作为特征提取的自训练算法有更好的分类效果。使用2005 BCI竞赛的数据集Iva来证明算法的有效性,结果表明了提出的算法能有效提高运动想象脑电的分类率。  相似文献   

18.
Semi-supervised document clustering, which takes into account limited supervised data to group unlabeled documents into clusters, has received significant interest recently. Because of getting supervised data may be expensive, it is important to get most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm and a new method for actively selecting informative instance-level constraints to get improved clustering performance. The semi- supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that Cons-DBSCAN with our proposed active learning approach can improve the clustering performance significantly when given a relatively small amount of constraints.  相似文献   

19.
Learning from labeled and unlabeled data using a minimal number of queries   总被引:4,自引:0,他引:4  
The considerable time and expense required for labeling data has prompted the development of algorithms which maximize the classification accuracy for a given amount of labeling effort. On the one hand, the effort has been to develop the so-called "active learning" algorithms which sequentially choose the patterns to be explicitly labeled so as to realize the maximum information gain from each labeling. On the other hand, the effort has been to develop algorithms that can learn from labeled as well as the more abundant unlabeled data. Proposed in this paper is an algorithm that integrates the benefits of active learning with the benefits of learning from labeled and unlabeled data. Our approach is based on reversing the roles of the labeled and unlabeled data. Specifically, we use a Genetic Algorithm (GA) to iteratively refine the class membership of the unlabeled patterns so that the maximum a posteriori (MAP) based predicted labels of the patterns in the labeled dataset are in agreement with the known labels. This reversal of the role of labeled and unlabeled patterns leads to an implicit class assignment of the unlabeled patterns. For active learning, we use a subset of the GA population to construct multiple MAP classifiers. Points in the input space where there is maximal disagreement amongst these classifiers are then selected for explicit labeling. The learning from labeled and unlabeled data and active learning phases are interlaced and together provide accurate classification while minimizing the labeling effort.  相似文献   

20.
The traditional setting of supervised learning requires a large amount of labeled training examples in order to achieve good generalization. However, in many practical applications, unlabeled training examples are readily available, but labeled ones are fairly expensive to obtain. Therefore, semisupervised learning has attracted much attention. Previous research on semisupervised learning mainly focuses on semisupervised classification. Although regression is almost as important as classification, semisupervised regression is largely understudied. In particular, although cotraining is a main paradigm in semisupervised learning, few works has been devoted to cotraining-style semisupervised regression algorithms. In this paper, a cotraining-style semisupervised regression algorithm, that is, COREG, is proposed. This algorithm uses two regressors, each labels the unlabeled data for the other regressor, where the confidence in labeling an unlabeled example is estimated through the amount of reduction in mean squared error over the labeled neighborhood of that example. Analysis and experiments show that COREG can effectively exploit unlabeled data to improve regression estimates.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号