首页 | 本学科首页   官方微博 | 高级检索  
     

基于偏斜数据集的文本分类特征选择方法研究
引用本文:刘振岩,孟 丹,王伟平,王 勇.基于偏斜数据集的文本分类特征选择方法研究[J].中文信息学报,2014,28(2):116-121.
作者姓名:刘振岩  孟 丹  王伟平  王 勇
作者单位:1. 中国科学院 计算技术研究所,北京 100190;
2. 中国科学院大学,北京 100049;
3. 中国科学院 信息工程研究所,北京 100093;
4. 北京理工大学 软件学院,北京 100081
基金项目:国家242信息安全计划项目(2010A007),国家863项目(2011AA01A203),国家自然科学基金(60903047,61272361),中国科学院先导专项项目(XDA06030200)
摘    要:对于不同类别样本数量差别很大的偏斜文本数据集,使用传统的特征选择方法所选出的特征绝大多数来自于大类,会使得分类器偏重大类而忽略小类,直接影响分类效果。该文首先针对偏斜文本数据集的数据特点,分析发现偏斜数据集中影响特征选择的两个重要因素,即特征项的类别分布和类间差异,其中类别分布因素反映的是特征项在整个数据集中的类别频率差异;而类别差异因素反映的是特征项在不同类别之间的相对文档频率差异。然后基于这两个重要因素构造形成一个新的尤其适用于偏斜文本分类的特征选择函数— 相对类别差异(Relative Category Difference,RCD)。与传统的特征选择方法进行对比实验的结果表明,RCD特征选择方法对于偏斜文本分类效果更优。

关 键 词:文本分类  偏斜数据集  特征选择  类别差异  

Feature Selection for Skewed Text Categorization
LIU Zhenyan,MENG Dan,WANG Weiping,WANG Yong.Feature Selection for Skewed Text Categorization[J].Journal of Chinese Information Processing,2014,28(2):116-121.
Authors:LIU Zhenyan  MENG Dan  WANG Weiping  WANG Yong
Affiliation:1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
2. University of Chinese Academy of Sciences, Beijing 100049, China;
3. Institnte of Information Engineering Chinese Academy of Sciences, Beijing 100093, China;
4. School of Software, Beijing Institute of Technology, Beijing 100081, China
Abstract:The existing for feature selection methods are not appropriate for the skewed corpus in which most of samples belong to a majority class and far fewer samples belong to a minority class. The reason is that these methods select features without considering the relative distribution of each class. As a result, most of selected features using these methods come from the majority class, which tend to misclassify minority class samples. This paper analyzes the characters of the skewed corpus and finds two important factors which can influence feature selection on the skewed data: category distribution and category difference. The category distribution factor indicates category frequency difference in whole dataset, and the category difference factor indicates relative documents frequency difference between classes. Then a new feature selection function called Relative Category Difference (RCD) is constructed based on the two factors. Experimental results show that the new feature selection method outperforms other methods for the skewed text categorization.
Keywords:text categorization  skewed dataset  feature selection  category difference  
本文献已被 CNKI 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号