首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
It is an actual and challenging issue to learn cost-sensitive models from those datasets that are with few labeled data and plentiful unlabeled data, because some time labeled data are very difficult, time consuming and/or expensive to obtain. To solve this issue, in this paper we proposed two classification strategies to learn cost-sensitive classifier from training datasets with both labeled and unlabeled data, based on Expectation Maximization (EM). The first method, Direct-EM, uses EM to build a semi-supervised classifier, then directly computes the optimal class label for each test example using the class probability produced by the learning model. The second method, CS-EM, modifies EM by incorporating misclassification cost into the probability estimation process. We conducted extensive experiments to evaluate the efficiency, and results show that when using only a small number of labeled training examples, the CS-EM outperforms the other competing methods on majority of the selected UCI data sets across different cost ratios, especially when cost ratio is high.  相似文献   

2.
In this paper, we address the problem of learning aspect models with partially labeled data for the task of document categorization. The motivation of this work is to take advantage of the amount of available unlabeled data together with the set of labeled examples to learn latent models whose structure and underlying hypotheses take more accurately into account the document generation process, compared to other mixture-based generative models. We present one semi-supervised variant of the Probabilistic Latent Semantic Analysis (PLSA) model (Hofmann, 2001). In our approach, we try to capture the possible data mislabeling errors which occur during the training of our model. This is done by iteratively assigning class labels to unlabeled examples using the current aspect model and re-estimating the probabilities of the mislabeling errors. We perform experiments over the 20Newsgroups, WebKB and Reuters document collections, as well as over a real world dataset coming from a Business Group of Xerox and show the effectiveness of our approach compared to a semi-supervised version of Naive Bayes, another semi-supervised version of PLSA and to transductive Support Vector Machines.  相似文献   

3.
This paper addresses classification problems in which the class membership of training data are only partially known. Each learning sample is assumed to consist of a feature vector xiX and an imprecise and/or uncertain “soft” label mi defined as a Dempster-Shafer basic belief assignment over the set of classes. This framework thus generalizes many kinds of learning problems including supervised, unsupervised and semi-supervised learning. Here, it is assumed that the feature vectors are generated from a mixture model. Using the generalized Bayesian theorem, an extension of Bayes’ theorem in the belief function framework, we derive a criterion generalizing the likelihood function. A variant of the expectation maximization (EM) algorithm, dedicated to the optimization of this criterion is proposed, allowing us to compute estimates of model parameters. Experimental results demonstrate the ability of this approach to exploit partial information about class labels.  相似文献   

4.
半监督集成学习综述   总被引:3,自引:0,他引:3  
半监督学习和集成学习是目前机器学习领域中两个非常重要的研究方向,半监督学习注重利用有标记样本与无标记样本来获得高性能分类器,而集成学习旨在利用多个学习器进行集成以提升弱学习器的精度。半监督集成学习是将半监督学习和集成学习进行组合来提升分类器泛化性能的机器学习新方法。首先,在分析半监督集成学习发展过程的基础上,发现半监督集成学习起源于基于分歧的半监督学习方法;然后,综合分析现有半监督集成学习方法,将其分为基于半监督的集成学习与基于集成的半监督学习两大类,并对主要的半监督集成方法进行了介绍;最后,对现有研究进了总结,并讨论了未来值得研究的问题。  相似文献   

5.
Extracting fuzzy classification rules from partially labeled data   总被引:1,自引:1,他引:0  
The interpretability and flexibility of fuzzy if-then rules make them a popular basis for classifiers. It is common to extract them from a database of examples. However, the data available in many practical applications are often unlabeled, and must be labeled manually by the user or by expensive analyses. The idea of semi-supervised learning is to use as much labeled data as available and try to additionally exploit the information in the unlabeled data. In this paper we describe an approach to learn fuzzy classification rules from partially labeled datasets.  相似文献   

6.
盛高斌  姚明海 《计算机仿真》2009,26(10):198-201,318
为了提高小数据量的有标记样本问题中学习器的性能,结合半监督学习和选择性集成学习,提出了基于半监督回归的选择性集成算法SSRES。算法基于半监督学习的基本思想,同时使用有标记样本和未标记样本训练学习器从而减少对有标记样本的需求,使用选择性集成算法GRES对不同学习器进行适当的选择,并将选择的结果结合提高学习器的泛化能力。实验结果表明,在小数据量的有标记样本问题中,该算法能够有效地提高学习器的性能。  相似文献   

7.
The problem of learning in pattern recognition using imperfectly labeled patterns is considered. Using a probabilistic model for the mislabeling of the training patterns, the author discusses performance of the Bayes and nearest neighbor classifiers with imperfect labels. Schemes are presented for training the classifier using both parametric and nonparametric techniques. Methods are developed for the correction of imperfect labels. To gain an understanding of the learning process, the author derives expressions for success probability as a function of training time for a one-dimensional increment error correction classifier with imperfect labels. Furthermore, feature selection with imperfectly labeled patterns is considered.  相似文献   

8.
In multi-instance learning, the training set is composed of labeled bags each consists of many unlabeled instances, that is, an object is represented by a set of feature vectors instead of only one feature vector. Most current multi-instance learning algorithms work through adapting single-instance learning algorithms to the multi-instance representation, while this paper proposes a new solution which goes at an opposite way, that is, adapting the multi-instance representation to single-instance learning algorithms. In detail, the instances of all the bags are collected together and clustered into d groups first. Each bag is then re-represented by d binary features, where the value of the ith feature is set to one if the concerned bag has instances falling into the ith group and zero otherwise. Thus, each bag is represented by one feature vector so that single-instance classifiers can be used to distinguish different classes of bags. Through repeating the above process with different values of d, many classifiers can be generated and then they can be combined into an ensemble for prediction. Experiments show that the proposed method works well on standard as well as generalized multi-instance problems. Zhi-Hua Zhou is currently Professor in the Department of Computer Science & Technology and head of the LAMDA group at Nanjing University. His main research interests include machine learning, data mining, information retrieval, and pattern recognition. He is associate editor of Knowledge and Information Systems and on the editorial boards of Artificial Intelligence in Medicine, International Journal of Data Warehousing and Mining, Journal of Computer Science & Technology, and Journal of Software. He has also been involved in various conferences. Min-Ling Zhang received his B.Sc. and M.Sc. degrees in computer science from Nanjing University, China, in 2001 and 2004, respectively. Currently he is a Ph.D. candidate in the Department of Computer Science & Technology at Nanjing University and a member of the LAMDA group. His main research interests include machine learning and data mining, especially in multi-instance learning and multi-label learning.  相似文献   

9.
In this paper, we present a novel semi-supervised dimensionality reduction technique to address the problems of inefficient learning and costly computation in coping with high-dimensional data. Our method named the dual subspace projections (DSP) embeds high-dimensional data in an optimal low-dimensional space, which is learned with a few user-supplied constraints and the structure of input data. The method projects data into two different subspaces respectively the kernel space and the original input space. Each projection is designed to enforce one type of constraints and projections in the two subspaces interact with each other to satisfy constraints maximally and preserve the intrinsic data structure. Compared to existing techniques, our method has the following advantages: (1) it benefits from constraints even when only a few are available; (2) it is robust and free from overfitting; and (3) it handles nonlinearly separable data, but learns a linear data transformation. As a conclusion, our method can be easily generalized to new data points and is efficient in dealing with large datasets. An empirical study using real data validates our claims so that significant improvements in learning accuracy can be obtained after the DSP-based dimensionality reduction is applied to high-dimensional data.  相似文献   

10.
One of the solutions to the classification problem are the ensemble methods, in particular a hierarchical approach. This method bases on dynamically splitting the original problem during training into smaller subproblems which should be easier to train. Then the answers are combined together to obtain the final classification. The main problem here is how to divide (cluster) the original problem to obtain best possible accuracy expressed in terms of risk function value. The exact value for a given clustering is known only after the whole training process. In this paper we propose the risk estimation method based on the analysis of the root classifier. This makes it possible to evaluate the risks for all subproblems without any training of children classifiers. Together with some earlier theoretical results on hierarchical approach, we show how to use the proposed method to evaluate the risk for the whole ensemble. A variant, which uses a genetic algorithm (GA), is proposed. We compare this method with an earlier one, based on the Bayes law. We show that the subproblem risk evaluation is highly correlated with the true risk, and that the Bayes/GA approaches give hierarchical classifiers which are superior to single ones. Our method works for any classifier which returns a class probability vector for a given example.  相似文献   

11.
Previous partially supervised classification methods can partition unlabeled data into positive examples and negative examples for a given class by learning from positive labeled examples and unlabeled examples, but they cannot further group the negative examples into meaningful clusters even if there are many different classes in the negative examples. Here we proposed an automatic method to obtain a natural partitioning of mixed data (labeled data + unlabeled data) by maximizing a stability criterion defined on classification results from an extended label propagation algorithm over all the possible values of model order (or the number of classes) in mixed data. Our experimental results on benchmark corpora for word sense disambiguation task indicate that this model order identification algorithm with the extended label propagation algorithm as the base classifier outperforms SVM, a one-class partially supervised classification algorithm, and the model order identification algorithm with semi-supervised k-means clustering as the base classifier when labeled data is incomplete.  相似文献   

12.
古平  朱庆生 《计算机科学》2006,33(4):159-161
无论是Boosting还是Bagging算法,在使用连续样本集进行分类器集合学习时,均需缓存大量数据,这对大容量样本集的应用不可行。本文提出一种基于贝叶斯集合的在线学习算法BEPOL,在保持Boosting算法加权采样思想的前提下,只需对样本集进行一次扫描,就可实现对贝叶斯集合的在线更新学习。算法针对串行训练时间长、成员相关性差的缺点,采用了并行学习的思想,通过将各贝叶斯分量映射到并行计算结构上,提高集合学习的效率。通过UCI数据集的实验表明,算法BEPOL具有与批量学习算法相近的分类性能和更小的时间开销,这使得算法对某些具有时间和空间限制的应用,如大型数据集或连续型数据集应用尤其有效。  相似文献   

13.
Clustering with constraints is a powerful method that allows users to specify background knowledge and the expected cluster properties. Significant work has explored the incorporation of instance-level constraints into non-hierarchical clustering but not into hierarchical clustering algorithms. In this paper we present a formal complexity analysis of the problem and show that constraints can be used to not only improve the quality of the resultant dendrogram but also the efficiency of the algorithms. This is particularly important since many agglomerative style algorithms have running times that are quadratic (or faster growing) functions of the number of instances to be clustered. We present several bounds on the improvement in the running times of algorithms obtainable using constraints. A preliminary version of this paper appeared as Davidson and Ravi (2005b).  相似文献   

14.
A data driven ensemble classifier for credit scoring analysis   总被引:2,自引:0,他引:2  
This study focuses on predicting whether a credit applicant can be categorized as good, bad or borderline from information initially supplied. This is essentially a classification task for credit scoring. Given its importance, many researchers have recently worked on an ensemble of classifiers. However, to the best of our knowledge, unrepresentative samples drastically reduce the accuracy of the deployment classifier. Few have attempted to preprocess the input samples into more homogeneous cluster groups and then fit the ensemble classifier accordingly. For this reason, we introduce the concept of class-wise classification as a preprocessing step in order to obtain an efficient ensemble classifier. This strategy would work better than a direct ensemble of classifiers without the preprocessing step. The proposed ensemble classifier is constructed by incorporating several data mining techniques, mainly involving optimal associate binning to discretize continuous values; neural network, support vector machine, and Bayesian network are used to augment the ensemble classifier. In particular, the Markov blanket concept of Bayesian network allows for a natural form of feature selection, which provides a basis for mining association rules. The learned knowledge is represented in multiple forms, including causal diagram and constrained association rules. The data driven nature of the proposed system distinguishes it from existing hybrid/ensemble credit scoring systems.  相似文献   

15.
This article addresses the problem of identifying the most likely music performer, given a set of performances of the same piece by a number of skilled candidate pianists. We propose a set of very simple features for representing stylistic characteristics of a music performer, introducing ‘norm-based’ features that relate to a kind of ‘average’ performance. A database of piano performances of 22 pianists playing two pieces by Frédéric Chopin is used in the presented experiments. Due to the limitations of the training set size and the characteristics of the input features we propose an ensemble of simple classifiers derived by both subsampling the training set and subsampling the input features. Experiments show that the proposed features are able to quantify the differences between music performers. The proposed ensemble can efficiently cope with multi-class music performer recognition under inter-piece conditions, a difficult musical task, displaying a level of accuracy unlikely to be matched by human listeners (under similar conditions).  相似文献   

16.
Visual categorization problems, such as object classification or action recognition, are increasingly often approached using a detection strategy: a classifier function is first applied to candidate subwindows of the image or the video, and then the maximum classifier score is used for class decision. Traditionally, the subwindow classifiers are trained on a large collection of examples manually annotated with masks or bounding boxes. The reliance on time-consuming human labeling effectively limits the application of these methods to problems involving very few categories. Furthermore, the human selection of the masks introduces arbitrary biases (e.g., in terms of window size and location) which may be suboptimal for classification. We propose a novel method for learning a discriminative subwindow classifier from examples annotated with binary labels indicating the presence of an object or action of interest, but not its location. During training, our approach simultaneously localizes the instances of the positive class and learns a subwindow SVM to recognize them. We extend our method to classification of time series by presenting an algorithm that localizes the most discriminative set of temporal segments in the signal. We evaluate our approach on several datasets for object and action recognition and show that it achieves results similar and in many cases superior to those obtained with full supervision.  相似文献   

17.
在软件缺陷预测中,标记样本不足与类不平衡问题会影响预测结果.为了解决这些问题,文中提出基于半监督集成学习的软件缺陷预测方法.该方法利用大量存在的未标记样本进行学习,得到较好的分类器,同时能集成一系列弱分类器,减少多数类数据对预测产生的偏倚.考虑到预测风险成本问题,文中还采用训练样本集权重向量更新策略,降低有缺陷模块预测为无缺陷模块的风险.在NASA MDP数据集上的对比实验表明,文中方法具有较好的预测效果.  相似文献   

18.
概念格(Galois格)是一种进行数据分类学习的有效工具,然而建格规模庞大使分类效率和准确率受到较大影响.将粗糙度理论应用到概念格分类问题研究中,提出一种新型的近似概念格动态建格和分类挖掘集成学习模型(CACLR).该模型在粗糙度区间根据样本空间分布构建多个相对独立分布且比较精确的近似概念格分类器,能及时消除建格过程中大量与分类知识无关的节点,有效缩减原格规模,融合得到的分类挖掘集成学习模型,具有较好的粗糙分类精度和知识预测学习能力.最后进行CACLR分类集成学习模型在标准UCI数据集中的对比实验,有效验证了该模型的实用价值.  相似文献   

19.
Clustering ensemble integrates multiple base clustering results to obtain a consensus result and thus improves the stability and robustness of the single clustering method. Since it is natural to use a hypergraph to represent the multiple base clustering results, where instances are represented by nodes and base clusters are represented by hyperedges, some hypergraph based clustering ensemble methods are proposed. Conventional hypergraph based methods obtain the final consensus result by partitioning a pre-defined static hypergraph. However, since base clusters may be imperfect due to the unreliability of base clustering methods, the pre-defined hypergraph constructed from the base clusters is also unreliable. Therefore, directly obtaining the final clustering result by partitioning the unreliable hypergraph is inappropriate. To tackle this problem, in this paper, we propose a clustering ensemble method via structured hypergraph learning, i.e., instead of being constructed directly, the hypergraph is dynamically learned from base results, which will be more reliable. Moreover, when dynamically learning the hypergraph, we enforce it to have a clear clustering structure, which will be more appropriate for clustering tasks, and thus we do not need to perform any uncertain postprocessing, such as hypergraph partitioning. Extensive experiments show that, our method not only performs better than the conventional hypergraph based ensemble methods, but also outperforms the state-of-the-art clustering ensemble methods.  相似文献   

20.
Evolutionary semi-supervised fuzzy clustering   总被引:3,自引:0,他引:3  
For learning classifier from labeled and unlabeled data, this paper proposes an evolutionary semi-supervised fuzzy clustering algorithm. Class labels information provided by labeled data is used to guide the evolution process of each fuzzy partition on unlabeled data, which plays the role of chromosome. The fitness of each chromosome is evaluated with a combination of fuzzy within cluster variance of unlabeled data and misclassification error of labeled data. The structure of the clusters obtained can be used to classify a future new pattern. The performance of the proposed approach is evaluated using two benchmark data sets. Experimental results indicate that the proposed approach can improve classification accuracy significantly, compared to classifier trained with a small number of labeled data only. Also, it outperforms a similar approach SSFCM.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号