首页 | 本学科首页   官方微博 | 高级检索  
     

基于类内频率的文本分类特征选择方法
引用本文:崔彩霞,王素格. 基于类内频率的文本分类特征选择方法[J]. 计算机工程与设计, 2007, 28(17): 4249-4251,4265
作者姓名:崔彩霞  王素格
作者单位:太原师范学院,计算机系,山西,太原,030012;山西大学,数学科学学院,山西,太原,030006
摘    要:随着计算机技术和WWW的飞速发展,文本分类已经成为信息检索的关键技术之一,而特征选择对分类效果起着至关重要的作用.对文本分类的4种常用特征选择方法进行了介绍和分析,提出了一种基于类内频率的特征选择方法.选用kNN法和支持向量机作为分类器,利用以上5种文本特征选择方法在平衡语料和非平衡语料上进行了测试.实验结果表明,该方法能够有效选出真正对分类有意义的特征,分类效果较好,尤其适合支持向量机分类器.

关 键 词:文本分类  特征选择  文档频率  信息增益  互信息
文章编号:1000-7024(2007)17-4249-03
修稿时间:2006-10-09

Feature selection method for text categorization based on frequency in kind
CUI Cai-xia,WANG Su-ge. Feature selection method for text categorization based on frequency in kind[J]. Computer Engineering and Design, 2007, 28(17): 4249-4251,4265
Authors:CUI Cai-xia  WANG Su-ge
Affiliation:1. Department of Computer, Taiyuan Normal University, Taiyuan 030012, China; 2. College of Mathematics Scientific, Shanxi University, Taiyuan 030006, China
Abstract:With the development at full speed of the technology of the computer and WWW,text categorization had become one of the key technologies of information retrieval,and feature selection played avery important roletocategorization result.Four common feature selection methods in text categorization are introduced and analyzed.And a feature selection method based on frequency in kind is pro-posed.Then combined with the kNN and the support vector machine,the test of five methods is carried out on the balance language ma-terial and non-balance language material.The experiment result indicates that the method can select meaningful features and has a better classification result,especially is suitable for SVM.
Keywords:text categorization  feature selection  frequency of file  information gaining  mutual information
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号