Comparison of text feature selection policies and using an adaptive framework期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Comparison of text feature selection policies and using an adaptive framework

Authors:	Şerafettin Taşcı Tunga Güngör

Affiliation:	Bo?aziçi University, Computer Engineering Department, Bebek, 34342 ?stanbul, Turkey

Abstract:	Text categorization is the task of automatically assigning unlabeled text documents to some predefined category labels by means of an induction algorithm. Since the data in text categorization are high-dimensional, often feature selection is used for reducing the dimensionality. In this paper, we make an evaluation and comparison of the feature selection policies used in text categorization by employing some of the popular feature selection metrics. For the experiments, we use datasets which vary in size, complexity, and skewness. We use support vector machine as the classifier and tf-idf weighting for weighting the terms. In addition to the evaluation of the policies, we propose new feature selection metrics which show high success rates especially with low number of keywords. These metrics are two-sided local metrics and are based on the difference of the distributions of a term in the documents belonging to a class and in the documents not belonging to that class. Moreover, we propose a keyword selection framework called adaptive keyword selection. It is based on selecting different number of terms for each class and it shows significant improvement on skewed datasets that have a limited number of training instances for some of the classes.

Keywords:
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏