中文文本分类的特征选取评价 An Evaluation of Feature Selection Methods for Text Categorization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

中文文本分类的特征选取评价

引用本文：	孙国菊,张杰. 中文文本分类的特征选取评价[J]. 哈尔滨理工大学学报, 2005, 10(1): 76-78

作者姓名：	孙国菊张杰

作者单位：	辽宁信息职业技术学院,辽宁,辽阳,111000;解放军炮兵学院,运筹教研组,安徽,合肥,230031

摘要：	在对中文文本分类的特征选取方法进行综合评价的基础上，对目前比较流行的5种特征选取方法(文档频度DF、互信息MI、信息增益IG、x^2统计X^2、术语强度TS)进行评价，选用Naive Bayes作为文本分类器，对一个中文文本分类语料库进行分类评测．实验结果表明，DF和x^2的分类性能十分接近，处于较好水平；而TS分类性能稍差一些；IG和MI的分类性能与其他相比都有较大的差距．特别是在特征数目少的情况下，MI和IG的结果较差．在特征数目为1000时，MI的F1值为64．60％;IG为69.36％，而DF则达到87．01％．
关键词：	文本分类特征选取文本表示
文章编号：	1007-2683(2005)01-0076-03
修稿时间：	2004-06-04
An Evaluation of Feature Selection Methods for Text Categorization

SUN Guo-ju,ZHANG Jie. An Evaluation of Feature Selection Methods for Text Categorization[J]. Journal of Harbin University of Science and Technology, 2005, 10(1): 76-78

Authors:	SUN Guo-ju ZHANG Jie

Abstract:	This paper evaluates five feature selection methods for text categorization. We study the following feature selection methods: Document Frequency (DF), Mutual Information (MI); Information Gains (IG); statistics; Term Strength (TS). We use naive Bayes as text classifier and conduct the experiments on a Chinese text corpus. The experimental results show that DF and x2 are top performers in this evaluation. In contrast, IG and MI provide a lower performance. In specially, MI and IG perform worse when the feature size is small. When the feature size is 1000, MI yields 64.60% F,, IG is 69.36% and DF provides 87.01%.

Keywords:	text categorization feature selection text representation
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏