首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于紧密度的半监督文本分类方法
引用本文:郑海清,林琛,牛军钰.一种基于紧密度的半监督文本分类方法[J].中文信息学报,2007,21(3):54-60.
作者姓名:郑海清  林琛  牛军钰
作者单位:复旦大学 计算机科学与工程系 上海 200433
摘    要:自动的文本分类已经成为一个重要的研究课题。在实际的应用情况下,很多训练语料都只有一个数目有限的正例集合,同时语料中的正例和未标注文档在数量上的分布通常也是不均衡的。因此这种文本分类任务有着不同于传统的文本分类任务的特点,传统的文本分类器如果直接应用到这类问题上,也难以取得令人满意的效果。因此,本文提出了一种基于紧密度衡量的方法来解决这一类问题。由于没有标注出来的负例文档,所以,本文先提取出一些可信的负例,然后再根据紧密度衡量对提取出的负例集合进行扩展,进而得到包含正负例的训练集合,从而提高分类器的性能。该方法不需要借助特别的外部知识库来对特征提取,因此能够比较好的应用到各个不同的分类环境中。在TREC’05(国际文本检索会议)的基因项目的文本分类任务语料上的实验表明,该算法在解决半监督文本分类问题中取得了优异的成绩。

关 键 词:计算机应用  中文信息处理  文本分类  半监督机器学习  支持向量机  紧密度  
文章编号:1003-0077(2007)03-0054-07
收稿时间:2006-05-27
修稿时间:2007-01-31

A Closeness-Based Semi-Supervised Text Classification Method
ZHENG Hai-qing,LIN Chen,NIU Jun-yu.A Closeness-Based Semi-Supervised Text Classification Method[J].Journal of Chinese Information Processing,2007,21(3):54-60.
Authors:ZHENG Hai-qing  LIN Chen  NIU Jun-yu
Affiliation:Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China
Abstract:Automatic text categorization has become a very important research area.In most applications,there's only a positive document set with a limited size and a large portion of unlabeled data in the training set while the distribution of the number of the positive set and the negative set is also unbalanced.So,this kind of text categorization task is different from those traditional ones which have not only labeled positive but also labeled negative samples in its training set.Those traditional classification methods can not be directly used in such tasks.This paper proposed a closeness-based method to solve this semi-supervised text categorization problem.It firstly extracts a reliable negative set from the unlabeled set,and then uses the closeness-based algorithm to enlarge initially extracted reliable negative set to a proper size.Based on the labeled positive set and the extracted negative set,the classifier will be constructed.This method will improve the performance of the classifier without any outside resources to help the feature selection,so,it can be used in a lot semi-supervised text categorization tasks in different domains.The experiment on TREC'05 Genomics track data shows that this algorithm performs well in this kind of text categorization tasks.
Keywords:computer application  Chinese information processing  text categorization  semi-supervised learning  support vector machine  closeness
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号