中文文本分类中特征抽取方法的比较研究 A Comparative Study on Feature Selection in Chinese Text Categorization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

中文文本分类中特征抽取方法的比较研究

引用本文：	代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):27-33.

作者姓名：	代六玲黄河燕陈肇雄

作者单位：	1.南京理工大学计算机科学系2.中国科学院计算机语言信息工程研究中心

摘要：	本文比较研究了在中文文本分类中特征选取方法对分类效果的影响。考察了文档频率DF、信息增益IG、互信息MI、χ2分布CHI四种不同的特征选取方法。采用支持向量机(SVM)和KNN两种不同的分类器以考察不同抽取方法的有效性。实验结果表明,在英文文本分类中表现良好的特征抽取方法(IG、MI和CHI)在不加修正的情况下并不适合中文文本分类。文中从理论上分析了产生差异的原因,并分析了可能的矫正方法包括采用超大规模训练语料和采用组合的特征抽取方法。最后通过实验验证组合特征抽取方法的有效性。
关键词：	计算机应用中文信息处理文本自动分类特征抽取支持向量机 KNN
文章编号：	1003-0077(2004)01-0026-07
修稿时间：	2003年9月22日
A Comparative Study on Feature Selection in Chinese Text Categorization

DAI Liu ling ,HUANG He yan ,CHEN Zhao xiong.A Comparative Study on Feature Selection in Chinese Text Categorization[J].Journal of Chinese Information Processing,2004,18(1):27-33.

Authors:	DAI Liu ling HUANG He yan CHEN Zhao xiong

Affiliation:	1.Department of Computer Science , NUST2.Language Information Engineering , CAS

Abstract:	This paper is a comparative study of feature selection methods in text categorization. Four methods were evaluated, including document frequency (DF) , information gain (IG) , mutual information (MI) and χ2-test (CHI) . A Support Vector Machine (SVM) and a k-nearest neighbor (KNN) were selected as the evaluating classifiers. We found IG, MI and CHI had poor performance in our test , though they behave well in English text categorization. We analyzed the reasons theoretically and put forwarded the possible solutions. A furthermore experiment proved that the combined feature selection method is effective.

Keywords:	KNN
本文献已被 CNKI 维普万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏