首页 | 本学科首页   官方微博 | 高级检索  
     

文本分类中信息增益特征选择方法的研究
引用本文:郭亚维,刘晓霞. 文本分类中信息增益特征选择方法的研究[J]. 计算机工程与应用, 2012, 48(27): 119-122,127
作者姓名:郭亚维  刘晓霞
作者单位:西北大学信息科学与技术学院,西安,710127
基金项目:陕西省教育厅专项科研计划资助项目(No.11JK1040)
摘    要:分析了传统信息增益(IG)特征选择方法忽略了特征项在类间、类内分布信息的缺点,引入类内分散度、类间集中度等因素,区分与类强相关的特征;针对传统信息增益(IG)特征选择方法没有很好组合正相关特征和负相关特征的问题,引入比例因子来平衡特征出现和不出现时的信息量,降低在不平衡语料集上负相关特征的比例,提高分类效果.通过实验证明了改进的信息增益特征选择方法的有效性和可行性.

关 键 词:文本分类  信息增益  特征选择  类内分散度  类间集中度  比例因子

Study on information gain-based feature selection in Chinese text categorization
GUO Yawei , LIU Xiaoxia. Study on information gain-based feature selection in Chinese text categorization[J]. Computer Engineering and Applications, 2012, 48(27): 119-122,127
Authors:GUO Yawei    LIU Xiaoxia
Affiliation:College of Information Science & Technology,Northwest University,Xi’an 710127,China
Abstract:The feature selection method of traditional Information Gain(IG)ignoring the shortcoming of distributing information inside class and between classes is analysed.Distribution information inside class and concentration information between classes are introduced,which is used to distinguish characteristics of strong correlation with class.Considering the problem of the feature selection method of traditional Information Gain(IG)not well combining positive feature and negative feature,the ratio of positive feature and negative feature is introduced with proportional factor to balance the effect of feature appear and disappear,which decreases the effect of negative feature on the corpus of category uneven distribution and increases classification effect.The experimental results verify the efficiency and probability of the improved IG approach.
Keywords:text categorization  information gain  feature selection  distribution information inside class  Concentration information between classes  proportional factor
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号