Turning from TF-IDF to TF-IGM for term weighting in text classification |
| |
Affiliation: | 1. Institute of Information Science, Pre?ernova 17, SI-2000 Maribor, Slovenia;2. University of Maribor, FERI, Institute of Informatics, Smetanova 17, SI-2000 Maribor, Slovenia |
| |
Abstract: | Massive textual data management and mining usually rely on automatic text classification technology. Term weighting is a basic problem in text classification and directly affects the classification accuracy. Since the traditional TF-IDF (term frequency & inverse document frequency) is not fully effective for text classification, various alternatives have been proposed by researchers. In this paper we make comparative studies on different term weighting schemes and propose a new term weighting scheme, TF-IGM (term frequency & inverse gravity moment), as well as its variants. TF-IGM incorporates a new statistical model to precisely measure the class distinguishing power of a term. Particularly, it makes full use of the fine-grained term distribution across different classes of text. The effectiveness of TF-IGM is validated by extensive experiments of text classification using SVM (support vector machine) and kNN (k nearest neighbors) classifiers on three commonly used corpora. The experimental results show that TF-IGM outperforms the famous TF-IDF and the state-of-the-art supervised term weighting schemes. In addition, some new findings different from previous studies are obtained and analyzed in depth in the paper. |
| |
Keywords: | |
本文献已被 ScienceDirect 等数据库收录! |
|