首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 218 毫秒
1.
目的在多标签有监督学习框架中,构建具有较强泛化性能的分类器需要大量已标注训练样本,而实际应用中已标注样本少且获取代价十分昂贵。针对多标签图像分类中已标注样本数量不足和分类器再学习效率低的问题,提出一种结合主动学习的多标签图像在线分类算法。方法基于min-max理论,采用查询最具代表性和最具信息量的样本挑选策略主动地选择待标注样本,且基于KKT(Karush-Kuhn-Tucker)条件在线地更新多标签图像分类器。结果在4个公开的数据集上,采用4种多标签分类评价指标对本文算法进行评估。实验结果表明,本文采用的样本挑选方法比随机挑选样本方法和基于间隔的采样方法均占据明显优势;当分类器达到相同或相近的分类准确度时,利用本文的样本挑选策略选择的待标注样本数目要明显少于采用随机挑选样本方法和基于间隔的采样方法所需查询的样本数。结论本文算法一方面可以减少获取已标注样本所需的人工标注代价;另一方面也避免了传统的分类器重新训练时利用所有数据所产生的学习效率低下的问题,达到了当新数据到来时可实时更新分类器的目的。  相似文献   

2.
在许多分类任务中,存在大量未标记的样本,并且获取样本标签耗时且昂贵。利用主动学习算法确定最应被标记的关键样本,来构建高精度分类器,可以最大限度地减少标记成本。本文提出一种基于PageRank的主动学习算法(PAL),充分利用数据分布信息进行有效的样本选择。利用PageRank根据样本间的相似度关系依次计算邻域、分值矩阵和排名向量;选择代表样本,并根据其相似度关系构建二叉树,利用该二叉树对代表样本进行聚类,标记和预测;将代表样本作为训练集,对其他样本进行分类。实验采用8个公开数据集,与5种传统的分类算法和3种流行的主动学习算法比较,结果表明PAL算法能取得更好的分类效果。  相似文献   

3.
针对主动学习中构造初始分类器难以选取代表性样本的问题,提出一种模糊核聚类采样算法。该算法首先通过聚类分析技术将样本集划分,然后分别在类簇中心和类簇边界区域选取样本进行标注,最后依此构造初始分类器。在该算法中,通过高斯核函数把原始样本空间中的点非线性变换到高维特征空间,以达到线性可聚的目的,并引入了一种基于局部密度的初始聚类中心选择方法,从而改善聚类效果。为了提高采样质量,结合划分后各类簇的样本个数设计了一种采样比例分配策略。同时,在采样结束阶段设计了一种后补采样策略,以确保采样个数达标。实验结果分析表明,所提算法可以有效地减少构造初始分类器所需的人工标注负担,并取得较高的分类正确率。  相似文献   

4.
Active learning (AL) has been shown to be a useful approach to improving the efficiency of the classification process for remote-sensing imagery. Current AL methods are essentially based on pixel-wise classification. In this paper, a new patch-based active learning (PTAL) framework is proposed for spectral-spatial classification on hyperspectral remote-sensing data. The method consists of two major steps. In the initialization stage, the original hyperspectral images are partitioned into overlapping patches. Then, for each patch, the spectral and spatial information as well as the label are extracted. A small set of patches is randomly selected from the data set for annotation, then a patch-based support vector machine (PTSVM) classifier is initially trained with these patches. In the second stage (close-loop stage of query and retraining), the trained PTSVM classifier is combined with one of three query methods, which are margin sampling (MS), entropy query-by-bagging (EQB), and multi-class level uncertainty (MCLU), and is subsequently employed to query the most informative samples from the candidate pool comprising the rest of the patches from the data set. The query selection cycle enables the PTSVM model to select the most informative queries for human annotation. Then, these informative queries are added to the training set. This process runs iteratively until a stopping criterion is met. Finally, the trained PTSVM is employed to patch classification. In order to compare this to pixel-based active learning (PXAL) models, the prediction label of a patch by PTSVM is transformed into a pixel-wise label of a pixel predictor to get the classification maps. Experimental results show better performance of the proposed PTAL methods on classification accuracy and computational time on three different hyperspectral data sets as compared with PXAL methods.  相似文献   

5.
针对现有的主动学习算法在多分类器应用中存在准确率低、速度慢等问题,将基于仿射传播(AP)聚类的主动学习算法引入到多分类支持向量机中,每次迭代主动选择最有利于改善多类SVM分类器性能的N个新样本点添加到训练样本点中进行学习,使得在花费较小标注代价情况下,能够获得较高的分类性能。在多个不同数据集上的实验结果表明,新方法能够有效地减少分类器训练时所需的人工标注样本点的数量,并获得较高的准确率和较好的鲁棒性。  相似文献   

6.
基于主动学习和半监督学习的多类图像分类   总被引:5,自引:0,他引:5  
陈荣  曹永锋  孙洪 《自动化学报》2011,37(8):954-962
多数图像分类算法需要大量的训练样本对分类器模型进行训练.在实际应用中, 对大量样本进行标注非常枯燥、耗时.对于一些特殊图像,如合成孔径雷达 (Synthetic aperture radar, SAR)图像, 对其内容判读非常困难,因此能够获得的标注样本数量非常有限. 本文将基于最优标号和次优标号(Best vs second-best, BvSB)的主动学习和带约束条件的自学习(Constrained self-training, CST) 引入到基于支持向量机(Support vector machine, SVM)分类器的图像分类算法中,提出了一种新的图像分类方法.通过BvSB 主动学习去挖掘那些对当前分类器模型最有价值的样本进行人工标注,并借助CST半 监督学习进一步利用样本集中大量的未标注样本,使得在花费较小标注代价情况下, 能够获得良好的分类性能.将新方法与随机样本选择、基于熵的不确定性采样主动学 习算法以及BvSB主动学习方法进行了性能比较.对3个光学图像集及1个SAR图像集分类 问题的实验结果显示,新方法能够有效地减少分类器训练时所需的人工标注样本的数 量,并获得较高的准确率和较好的鲁棒性.  相似文献   

7.
In this work, we study the problem of cross-domain video concept detection, where the distributions of the source and target domains are different. Active learning can be used to iteratively refine a source domain classifier by querying labels for a few samples in the target domain, which could reduce the labeling effort. However, traditional active learning method which often uses a discriminative query strategy that queries the most ambiguous samples to the source domain classifier for labeling would fail, when the distribution difference between two domains is too large. In this paper, we tackle this problem by proposing a joint active learning approach which combines a novel generative query strategy and the existing discriminative one. The approach adaptively fits the distribution difference and shows higher robustness than the ones using single strategy. Experimental results on two synthetic datasets and the TRECVID video concept detection task highlight the effectiveness of our joint active learning approach.  相似文献   

8.
In multi-label learning,it is rather expensive to label instances since they are simultaneously associated with multiple labels.Therefore,active learning,which reduces the labeling cost by actively querying the labels of the most valuable data,becomes particularly important for multi-label learning.A good multi-label active learning algorithm usually consists of two crucial elements:a reasonable criterion to evaluate the gain of querying the label for an instance,and an effective classification model,based on whose prediction the criterion can be accurately computed.In this paper,we first introduce an effective multi-label classification model by combining label ranking with threshold learning,which is incrementally trained to avoid retraining from scratch after every query.Based on this model,we then propose to exploit both uncertainty and diversity in the instance space as well as the label space,and actively query the instance-label pairs which can improve the classification model most.Extensive experiments on 20 datasets demonstrate the superiority of the proposed approach to state-of-the-art methods.  相似文献   

9.
In order to address the challenge of hyperspectral image (HSI) classification with very limited labelled samples, active learning (AL) has become a hot research issue in recent years. Although lots of AL approaches have been proposed in the literature, most of them concentrate on how to select the most informative samples, while ignore the significance of the input feature. We believe that the input feature and the query selection are both crucial for constituting an efficient AL algorithm. In this article, we propose a new discriminative feature, sparse code histogram (SCH), to conduct the AL procedure. SCH exhibits a much stronger distinguishability than several other widely used features, and thus a better AL performance could be expected. With this novel input feature, a probabilistic classifier, multinomial logistic regression, is trained to obtain the class probability of each sample. Considering that the class probability is usually biased due to the limited labelled samples, a graph-based spatial refinement is proposed to refine the class probability by exploiting the contextual information. Based on the refined class probability, informative samples are selected for manual labelling and classifier retraining. Such a process is iterated until a stopping criterion is met. Experimental results demonstrate that the proposed method could usually achieve above 90% classification accuracy with only two iterations, which significantly outperforms several state-of-the-art approaches.  相似文献   

10.
In this paper, we propose a lazy learning strategy for building classification learning models. Instead of learning the models with the whole training data set before observing the new instance, a selection of patterns is made depending on the new query received and a classification model is learnt with those selected patterns. The selection of patterns is not homogeneous, in the sense that the number of selected patterns depends on the position of the query instance in the input space. That selection is made using a weighting function to give more importance to the training patterns that are more similar to the query instance. Our intention is to provide a lazy learning mechanism suited to any machine learning classification algorithm. For this reason, we study two different methods to avoid fixing any parameter. Experimental results show that classification rates of traditional machine learning algorithms based on trees, rules, or functions can be improved when they are learnt with the lazy learning approach proposed. © 2011 Wiley Periodicals, Inc.  相似文献   

11.
In many real-world applications, labeled data are usually expensive to get, while there may be a large amount of unlabeled data. To reduce the labeling cost, active learning attempts to discover the most informative data points for labeling. The challenge is which unlabeled samples should be labeled to improve the classifier the most. Classical optimal experimental design algorithms are based on least-square errors over the labeled samples only while the unlabeled points are ignored. In this paper, we propose a novel active learning algorithm called neighborhood preserving D-optimal design. Our algorithm is based on a neighborhood preserving regression model which simultaneously minimizes the least-square error on the measured samples and preserves the neighborhood structure of the data space. It selects the most informative samples which minimize the variance of the regression parameter. We also extend our algorithm to nonlinear case by using kernel trick. Experimental results on terrain classification show the effectiveness of proposed approach.  相似文献   

12.
In classification problems, many different active learning techniques are often adopted to find the most informative samples for labeling in order to save human labors. Among them, active learning support vector machine (SVM) is one of the most representative approaches, in which model parameter is usually set as a fixed default value during the whole learning process. Note that model parameter is closely related to the training set. Hence dynamic parameter is desirable to make a satisfactory learning performance. To target this issue, we proposed a novel algorithm, called active learning SVM with regularization path, which can fit the entire solution path of SVM for every value of model parameters. In this algorithm, we first traced the entire solution path of the current classifier to find a series of candidate model parameters, and then used unlabeled samples to select the best model parameter. Besides, in the initial phase of training, we constructed a training sample sets by using an improved K-medoids cluster algorithm. Experimental results conducted from real-world data sets showed the effectiveness of the proposed algorithm for image classification problems.  相似文献   

13.
In this paper, we propose an active learning technique for solving multiclass problems with support vector machine (SVM) classifiers. The technique is based on both uncertainty and diversity criteria. The uncertainty criterion is implemented by analyzing the one-dimensional output space of the SVM classifier. A simple histogram thresholding algorithm is used to find out the low density region in the SVM output space to identify the most uncertain samples. Then the diversity criterion exploits the kernel k-means clustering algorithm to select uncorrelated informative samples among the selected uncertain samples. To assess the effectiveness of the proposed method we compared it with other batch mode active learning techniques presented in the literature using one toy data set and three real data sets. Experimental results confirmed that the proposed technique provided a very good tradeoff among robustness to biased initial training samples, classification accuracy, computational complexity, and number of new labeled samples necessary to reach the convergence.  相似文献   

14.
Active learning is understood as any form of learning in which the learning algorithm has some control over the input samples due to a specific sample selection process based on which it builds up the model. In this paper, we propose a novel active learning strategy for data-driven classifiers, which is based on unsupervised criterion during off-line training phase, followed by a supervised certainty-based criterion during incremental on-line training. In this sense, we call the new strategy hybrid active learning. Sample selection in the first phase is conducted from scratch (i.e. no initial labels/learners are needed) based on purely unsupervised criteria obtained from clusters: samples lying near cluster centers and near the borders of clusters are expected to represent the most informative ones regarding the distribution characteristics of the classes. In the second phase, the task is to update already trained classifiers during on-line mode with the most important samples in order to dynamically guide the classifier to more predictive power. Both strategies are essential for reducing the annotation and supervision effort of operators in off-line and on-line classification systems, as operators only have to label an exquisite subset of the off-line training data resp. give feedback only on specific occasions during on-line phase. The new active learning strategy is evaluated based on real-world data sets from UCI repository and collected at on-line quality control systems. The results show that an active learning based selection of training samples (1) does not weaken the classification accuracies compared to when using all samples in the training process and (2) can out-perform classifiers which are built on randomly selected data samples.  相似文献   

15.
在很多信息处理任务中,人们容易获得大量的无标签样本,但对样本进行标注是非常费时和费力的。作为机器学习领域中一种重要的学习方法,主动学习通过选择最有信息量的样本进行标注,减少了人工标注的代价。然而,现有的大多数主动学习算法都是基于分类器的监督学习方法,这类算法并不适用于无任何标签信息的样本选择。针对这个问题,借鉴最优实验设计的算法思想,结合自适应稀疏邻域重构理论,提出基于自适应稀疏邻域重构的主动学习算法。该算法可以根据数据集各区域的不同分布自适应地选择邻域规模,同步完成邻域点的搜寻和重构系数的计算,能在无任何标签信息的情况下较好地选择最能代表样本集分布结构的样本。基于人工合成数据集和真实数据集的实验表明,在同等标注代价下,基于自适应稀疏邻域重构的主动学习算法在分类精度和鲁棒性上具有较高的性能。  相似文献   

16.
In real applications of inductive learning for classifi cation, labeled instances are often defi cient, and labeling them by an oracle is often expensive and time-consuming. Active learning on a single task aims to select only informative unlabeled instances for querying to improve the classifi cation accuracy while decreasing the querying cost. However, an inevitable problem in active learning is that the informative measures for selecting queries are commonly based on the initial hypotheses sampled from only a few labeled instances. In such a circumstance, the initial hypotheses are not reliable and may deviate from the true distribution underlying the target task. Consequently, the informative measures will possibly select irrelevant instances. A promising way to compensate this problem is to borrow useful knowledge from other sources with abundant labeled information, which is called transfer learning. However, a signifi cant challenge in transfer learning is how to measure the similarity between the source and the target tasks. One needs to be aware of different distributions or label assignments from unrelated source tasks;otherwise, they will lead to degenerated performance while transferring. Also, how to design an effective strategy to avoid selecting irrelevant samples to query is still an open question. To tackle these issues, we propose a hybrid algorithm for active learning with the help of transfer learning by adopting a divergence measure to alleviate the negative transfer caused by distribution differences. To avoid querying irrelevant instances, we also present an adaptive strategy which could eliminate unnecessary instances in the input space and models in the model space. Extensive experiments on both the synthetic and the real data sets show that the proposed algorithm is able to query fewer instances with a higher accuracy and that it converges faster than the state-of-the-art methods.  相似文献   

17.
支持向量机是重要的机器学习方法之一,已成功解决了许多实际的分类问题。围绕如何提高支持向量机的分类精度与训练效率,以分类过程为主线,主要综述了在训练支持向量机之前不同的特征选取方法与学习策略。在此基础上,比较了不同的特征选取方法SFS,IWSS,IWSSr以及BARS的分类精度,分析了主动学习策略与支持向量机融合后获得的分类器在测试集上的分类精度与正确率/召回率平衡点两个性能指标。实验结果表明,包装方法与过滤方法相结合的特征选取方法能有效提高支持向量机的分类精度和减少训练样本量;在标签数据较少的情况下,主动学习能达到更好的分类精度,而为了达到相同的分类精度,被动学习需要的样本数量必须要达到主动学习的6倍。  相似文献   

18.
In this paper, we study the problem of learning from multiple model data for the purpose of document classification. In this problem, each document is composed of two different models of data, i.e., an image and a text. We propose to represent the data of two models by projecting them to a shared data space by using cross-model factor analysis formula and classify them in the shared space by using a linear class label predictor, named cross-model classifier. The parameters of both cross-model classifier and cross-model factor analysis are learned jointly, so that they can regularize the learning of each other. We construct a unified objective function for this learning problem. With this objective function, we minimize the distance between the projections of image and text of the same document, and the classification error of the projections measured by hinge loss function. The objective function is optimized by an alternate optimization strategy in an iterative algorithm. Experiments in two different multiple model document data sets show the advantage of the proposed algorithm over state-of-the-art multimedia data classification methods.  相似文献   

19.
20.
现实生活中存在大量的非平衡数据,大多数传统的分类算法假定类分布平衡或者样本的错分代价相同,因此在对这些非平衡数据进行分类时会出现少数类样本错分的问题。针对上述问题,在代价敏感的理论基础上,提出了一种新的基于代价敏感集成学习的非平衡数据分类算法--NIBoost(New Imbalanced Boost)。首先,在每次迭代过程中利用过采样算法新增一定数目的少数类样本来对数据集进行平衡,在该新数据集上训练分类器;其次,使用该分类器对数据集进行分类,并得到各样本的预测类标及该分类器的分类错误率;最后,根据分类错误率和预测的类标计算该分类器的权重系数及各样本新的权重。实验采用决策树、朴素贝叶斯作为弱分类器算法,在UCI数据集上的实验结果表明,当以决策树作为基分类器时,与RareBoost算法相比,F-value最高提高了5.91个百分点、G-mean最高提高了7.44个百分点、AUC最高提高了4.38个百分点;故该新算法在处理非平衡数据分类问题上具有一定的优势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号