首页 | 本学科首页   官方微博 | 高级检索  
     


Modified frequency-based term weighting schemes for text classification
Affiliation:1. Department of Computer Engineering and Applications, GLA University, Mathura, Uttar Pradesh, India;2. Department of Electrical Engineering, Dayalbagh Educational Institute, Agra, Uttar Pradesh, India;1. College of Computer Science and Technology, Jilin University, China;2. Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, China;3. School Of Software Engineering, South China University of Technology, China
Abstract:With the rapid growth of textual content on the Internet, automatic text categorization is a comparatively more effective solution in information organization and knowledge management. Feature selection, one of the basic phases in statistical-based text categorization, crucially depends on the term weighting methods In order to improve the performance of text categorization, this paper proposes four modified frequency-based term weighting schemes namely; mTF, mTFIDF, TFmIDF, and mTFmIDF. The proposed term weighting schemes take the amount of missing terms into account calculating the weight of existing terms. The proposed schemes show the highest performance for a SVM classifier with a micro-average F1 classification performance value of 97%. Moreover, benchmarking results on Reuters-21578, 20Newsgroups, and WebKB text-classification datasets, using different classifying algorithms such as SVM and KNN show that the proposed schemes mTF, mTFIDF, and mTFmIDF outperform other weighting schemes such as TF, TFIDF, and Entropy. Additionally, the statistical significance tests show a significant enhancement of the classification performance based on the modified schemes.
Keywords:Term-weighting  Missing features  Absent terms  Vector Space Model  Text classification
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号