首页 | 本学科首页   官方微博 | 高级检索  
     


Best terms: an efficient feature-selection algorithm for text categorization
Authors:Dimitris Fragoudis  Dimitris Meretakis  Spiridon Likothanassis
Affiliation:(1) Computer Engineering and Informatics Department, University of Patras, Rio—Patras, GR-26500, Greece;(2) Novartis Pharma, Griffith University, Basel, Switzerland;(3) Computer Technology Institute, Patras, Greece
Abstract:In this paper, we propose a new feature-selection algorithm for text classification, called best terms (BT). The complexity of BT is linear in respect to the number of the training-set documents and is independent from both the vocabulary size and the number of categories. We evaluate BT on two benchmark document collections, Reuters-21578 and 20-Newsgroups, using two classification algorithms, naive Bayes (NB) and support vector machines (SVM). Our experimental results, comparing BT with an extensive and representative list of feature-selection algorithms, show that (1) BT is faster than the existing feature-selection algorithms; (2) BT leads to a considerable increase in the classification accuracy of NB and SVM as measured by the F1 measure; (3) BT leads to a considerable improvement in the speed of NB and SVM; in most cases, the training time of SVM has dropped by an order of magnitude; (4) in most cases, the combination of BT with the simple, but very fast, NB algorithm leads to classification accuracy comparable with SVM while sometimes it is even more accurate.
Keywords:Feature selection  Machine learning  Text categorization
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号