首页 | 本学科首页   官方微博 | 高级检索  
     

中文文本分类的特征选取评价
引用本文:孙国菊,张杰. 中文文本分类的特征选取评价[J]. 哈尔滨理工大学学报, 2005, 10(1): 76-78
作者姓名:孙国菊  张杰
作者单位:辽宁信息职业技术学院,辽宁,辽阳,111000;解放军炮兵学院,运筹教研组,安徽,合肥,230031
摘    要:在对中文文本分类的特征选取方法进行综合评价的基础上,对目前比较流行的5种特征选取方法(文档频度DF、互信息MI、信息增益IG、x^2统计X^2、术语强度TS)进行评价,选用Naive Bayes作为文本分类器,对一个中文文本分类语料库进行分类评测.实验结果表明,DF和x^2的分类性能十分接近,处于较好水平;而TS分类性能稍差一些;IG和MI的分类性能与其他相比都有较大的差距.特别是在特征数目少的情况下,MI和IG的结果较差.在特征数目为1000时,MI的F1值为64.60%;IG为69.36%,而DF则达到87.01%.

关 键 词:文本分类  特征选取  文本表示
文章编号:1007-2683(2005)01-0076-03
修稿时间:2004-06-04

An Evaluation of Feature Selection Methods for Text Categorization
SUN Guo-ju,ZHANG Jie. An Evaluation of Feature Selection Methods for Text Categorization[J]. Journal of Harbin University of Science and Technology, 2005, 10(1): 76-78
Authors:SUN Guo-ju  ZHANG Jie
Abstract:This paper evaluates five feature selection methods for text categorization. We study the following feature selection methods: Document Frequency (DF), Mutual Information (MI); Information Gains (IG); statistics; Term Strength (TS). We use naive Bayes as text classifier and conduct the experiments on a Chinese text corpus. The experimental results show that DF and x2 are top performers in this evaluation. In contrast, IG and MI provide a lower performance. In specially, MI and IG perform worse when the feature size is small. When the feature size is 1000, MI yields 64.60% F,, IG is 69.36% and DF provides 87.01%.
Keywords:text categorization  feature selection  text representation
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号