一种基于紧密度的半监督文本分类方法 A Closeness-Based Semi-Supervised Text Classification Method期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

一种基于紧密度的半监督文本分类方法

引用本文：	郑海清,林琛,牛军钰.一种基于紧密度的半监督文本分类方法[J].中文信息学报,2007,21(3):54-60.

作者姓名：	郑海清林琛牛军钰

作者单位：	复旦大学计算机科学与工程系上海 200433

摘要：	自动的文本分类已经成为一个重要的研究课题。在实际的应用情况下,很多训练语料都只有一个数目有限的正例集合,同时语料中的正例和未标注文档在数量上的分布通常也是不均衡的。因此这种文本分类任务有着不同于传统的文本分类任务的特点,传统的文本分类器如果直接应用到这类问题上,也难以取得令人满意的效果。因此,本文提出了一种基于紧密度衡量的方法来解决这一类问题。由于没有标注出来的负例文档,所以,本文先提取出一些可信的负例,然后再根据紧密度衡量对提取出的负例集合进行扩展,进而得到包含正负例的训练集合,从而提高分类器的性能。该方法不需要借助特别的外部知识库来对特征提取,因此能够比较好的应用到各个不同的分类环境中。在TREC’05(国际文本检索会议)的基因项目的文本分类任务语料上的实验表明,该算法在解决半监督文本分类问题中取得了优异的成绩。
关键词：	计算机应用中文信息处理文本分类半监督机器学习支持向量机紧密度
文章编号：	1003-0077（2007）03-0054-07
收稿时间：	2006-05-27
修稿时间：	2007-01-31
A Closeness-Based Semi-Supervised Text Classification Method

ZHENG Hai-qing,LIN Chen,NIU Jun-yu.A Closeness-Based Semi-Supervised Text Classification Method[J].Journal of Chinese Information Processing,2007,21(3):54-60.

Authors:	ZHENG Hai-qing LIN Chen NIU Jun-yu

Affiliation:	Department of Computer Science and Engineering, Fudan University, Shanghai 200433, China

Abstract:	Automatic text categorization has become a very important research area.In most applications,there's only a positive document set with a limited size and a large portion of unlabeled data in the training set while the distribution of the number of the positive set and the negative set is also unbalanced.So,this kind of text categorization task is different from those traditional ones which have not only labeled positive but also labeled negative samples in its training set.Those traditional classification methods can not be directly used in such tasks.This paper proposed a closeness-based method to solve this semi-supervised text categorization problem.It firstly extracts a reliable negative set from the unlabeled set,and then uses the closeness-based algorithm to enlarge initially extracted reliable negative set to a proper size.Based on the labeled positive set and the extracted negative set,the classifier will be constructed.This method will improve the performance of the classifier without any outside resources to help the feature selection,so,it can be used in a lot semi-supervised text categorization tasks in different domains.The experiment on TREC'05 Genomics track data shows that this algorithm performs well in this kind of text categorization tasks.

Keywords:	computer application Chinese information processing text categorization semi-supervised learning support vector machine closeness
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏