基于偏斜数据集的文本分类特征选择方法研究 Feature Selection for Skewed Text Categorization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于偏斜数据集的文本分类特征选择方法研究

引用本文：	刘振岩,孟丹,王伟平,王勇.基于偏斜数据集的文本分类特征选择方法研究[J].中文信息学报,2014,28(2):116-121.

作者姓名：	刘振岩孟丹王伟平王勇

作者单位：	1. 中国科学院计算技术研究所,北京 100190; 2. 中国科学院大学,北京 100049; 3. 中国科学院信息工程研究所,北京 100093; 4. 北京理工大学软件学院,北京 100081

基金项目：	国家242信息安全计划项目(2010A007),国家863项目(2011AA01A203),国家自然科学基金(60903047,61272361),中国科学院先导专项项目(XDA06030200)

摘要：	对于不同类别样本数量差别很大的偏斜文本数据集,使用传统的特征选择方法所选出的特征绝大多数来自于大类,会使得分类器偏重大类而忽略小类,直接影响分类效果。该文首先针对偏斜文本数据集的数据特点,分析发现偏斜数据集中影响特征选择的两个重要因素,即特征项的类别分布和类间差异,其中类别分布因素反映的是特征项在整个数据集中的类别频率差异;而类别差异因素反映的是特征项在不同类别之间的相对文档频率差异。然后基于这两个重要因素构造形成一个新的尤其适用于偏斜文本分类的特征选择函数— 相对类别差异(Relative Category Difference,RCD)。与传统的特征选择方法进行对比实验的结果表明,RCD特征选择方法对于偏斜文本分类效果更优。
关键词：	文本分类偏斜数据集特征选择类别差异
Feature Selection for Skewed Text Categorization

LIU Zhenyan,MENG Dan,WANG Weiping,WANG Yong.Feature Selection for Skewed Text Categorization[J].Journal of Chinese Information Processing,2014,28(2):116-121.

Authors:	LIU Zhenyan MENG Dan WANG Weiping WANG Yong

Affiliation:	1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; 2. University of Chinese Academy of Sciences, Beijing 100049, China; 3. Institnte of Information Engineering Chinese Academy of Sciences, Beijing 100093, China; 4. School of Software, Beijing Institute of Technology, Beijing 100081, China

Abstract:	The existing for feature selection methods are not appropriate for the skewed corpus in which most of samples belong to a majority class and far fewer samples belong to a minority class. The reason is that these methods select features without considering the relative distribution of each class. As a result, most of selected features using these methods come from the majority class, which tend to misclassify minority class samples. This paper analyzes the characters of the skewed corpus and finds two important factors which can influence feature selection on the skewed data: category distribution and category difference. The category distribution factor indicates category frequency difference in whole dataset, and the category difference factor indicates relative documents frequency difference between classes. Then a new feature selection function called Relative Category Difference (RCD) is constructed based on the two factors. Experimental results show that the new feature selection method outperforms other methods for the skewed text categorization.

Keywords:	text categorization skewed dataset feature selection category difference
本文献已被 CNKI 等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏