首页 | 本学科首页   官方微博 | 高级检索  
     

文本分类中一种特征选择方法研究
引用本文:赵 婧,邵雄凯,刘建舟,王春枝.文本分类中一种特征选择方法研究[J].计算机应用研究,2019,36(8).
作者姓名:赵 婧  邵雄凯  刘建舟  王春枝
作者单位:湖北工业大学计算机学院,武汉,430068;湖北工业大学计算机学院,武汉,430068;湖北工业大学计算机学院,武汉,430068;湖北工业大学计算机学院,武汉,430068
基金项目:国家自然科学基金面上资助项目(61772180)
摘    要:针对文本分类中传统特征选择方法卡方统计量和信息增益的不足进行了分析,得出文本分类中的特征选择关键在于选择出集中分布于某类文档并在该类文档中均匀分布且频繁出现的特征词。因此,综合考虑特征词的文档频、词频以及特征词的类间集中度、类内分散度,提出一种基于类内类间文档频和词频统计的特征选择评估函数,并利用该特征选择评估函数在训练集每个类别中选取一定比例的特征词组成该类别的特征词库,而训练集的特征词库则为各类别特征词库的并集。通过基于SVM的中文文本分类实验表明,该方法与传统的卡方统计量和信息增益相比,在一定程度上提高了文本分类的效果。

关 键 词:文本分类  特征选择  分散度  集中度  频度
收稿时间:2018/1/31 0:00:00
修稿时间:2019/6/28 0:00:00

Study on feature selection method in text classification
Zhao Jing,Shao Xiongkai,Liu Jianzhou and Wang Chunzhi.Study on feature selection method in text classification[J].Application Research of Computers,2019,36(8).
Authors:Zhao Jing  Shao Xiongkai  Liu Jianzhou and Wang Chunzhi
Affiliation:School of Computer Science,Hubei University of Technology,,,
Abstract:The traditional feature selection method of chi-square test and information gain in text classification has its inherent defect. This paper analyzed the key of feature selection in text classification being to select feature words distributed evenly and frequently in each type of documents. This should consider not only the document frequency and term frequency of feature words, but also the inter class concentration degree and the intra class scatter degree of feature words. It proposed a feature selection evaluation function that is based on document frequency of within-class and between-class and term frequency statistics. The feature selection evaluation function could select a certain proportion of the feature words in each category of the training set to form the corresponding class of the feature word library. The entire feature word library of the training set could be composed by each of such classes as a result. It carried out the experiment of Chinese text classification based on SVM. The experimental results show that the proposed method improves the effectiveness of text classification to a certain extent, compared with the traditional chi-square test and information gain.
Keywords:text classification  feature selection  distribution  concentration  frequency
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号