首页 | 本学科首页   官方微博 | 高级检索  
     


Text Classification from Labeled and Unlabeled Documents using EM
Authors:Nigam  Kamal  Mccallum  Andrew Kachites  Thrun  Sebastian  Mitchell  Tom
Affiliation:(1) School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;(2) Just Research, 4616 Henry Street, Pittsburgh, PA 15213, USA;(3) School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;(4) School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA;(5) School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
Abstract:This paper shows that the accuracy of learned text classifiers can be improved by augmenting a small number of labeled training documents with a large pool of unlabeled documents. This is important because in many text classification problems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available.We introduce an algorithm for learning from labeled and unlabeled documents based on the combination of Expectation-Maximization (EM) and a naive Bayes classifier. The algorithm first trains a classifier using the available labeled documents, and probabilistically labels the unlabeled documents. It then trains a new classifier using the labels for all the documents, and iterates to convergence. This basic EM procedure works well when the data conform to the generative assumptions of the model. However these assumptions are often violated in practice, and poor performance can result. We present two extensions to the algorithm that improve classification accuracy under these conditions: (1) a weighting factor to modulate the contribution of the unlabeled data, and (2) the use of multiple mixture components per class. Experimental results, obtained using text from three different real-world tasks, show that the use of unlabeled data reduces classification error by up to 30%.
Keywords:text classification  Expectation-Maximization  integrating supervised and unsupervised learning  combining labeled and unlabeled data  Bayesian learning
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号