Best terms: an efficient feature-selection algorithm for text categorization期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Best terms: an efficient feature-selection algorithm for text categorization

Authors:	Dimitris Fragoudis Dimitris Meretakis Spiridon Likothanassis

Affiliation:	(1) Computer Engineering and Informatics Department, University of Patras, Rio—Patras, GR-26500, Greece;(2) Novartis Pharma, Griffith University, Basel, Switzerland;(3) Computer Technology Institute, Patras, Greece

Abstract:	In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the number of categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20-Newsgroups, using two classification algorithms, naive Bayes (NB) and support vector machines (SVM). Our experimental results, comparing BT with an extensive and representative list of feature-selection algorithms, show that (1) BT is faster than the existing feature-selection algorithms; (2) BT leads to a considerable increase in the classification accuracy of NB and SVM as measured by the F1 measure; (3) BT leads to a considerable improvement in the speed of NB and SVM; in most cases, the training time of SVM has dropped by an order of magnitude; (4) in most cases, the combination of BT with the simple, but very fast, NB algorithm leads to classification accuracy comparable with SVM while sometimes it is even more accurate.

Keywords:	Feature selection Machine learning Text categorization
本文献已被 SpringerLink 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏