首页 | 本学科首页   官方微博 | 高级检索  
     


On entropy-based term weighting schemes for text categorization
Authors:Wang  Tao  Cai  Yi  Leung  Ho-fung  Lau  Raymond Y K  Xie  Haoran  Li  Qing
Affiliation:1.Department of Biostatistics and Health Informatics, King’s College London, London, UK
;2.School of Software Engineering, South China University of Technology, Guangzhou, China
;3.Key Laboratory of Big Data and Intelligent Robot, (South China University of Technology) Ministry of Education, Guangzhou, China
;4.Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
;5.Department of Information Systems, City University of Hong Kong, Hong Kong, China
;6.Department of Computing and Decision Sciences, Lingnan University, Hong Kong, China
;7.Department of Computing, The Hong Kong Polytechnic University, Hong Kong, China
;
Abstract:

In text categorization, Vector Space Model (VSM) has been widely used for representing documents, in which a document is represented by a vector of terms. Since different terms contribute to a document’s semantics in various degrees, a number of term weighting schemes have been proposed for VSM to improve text categorization performance. Much evidence shows that the performance of a term weighting scheme often varies across different text categorization tasks, while the mechanism underlying variability in a scheme’s performance remains unclear. Moreover, existing schemes often weight a term with respect to a category locally, without considering the global distribution of a term’s occurrences across all categories in a corpus. In this paper, we first systematically examine pros and cons of existing term weighting schemes in text categorization and explore the reasons why some schemes with sound theoretical bases, such as chi-square test and information gain, perform poorly in empirical evaluations. By measuring the concentration that a term distributes across all categories in a corpus, we then propose a series of entropy-based term weighting schemes to measure the distinguishing power of a term in text categorization. Through extensive experiments on five different datasets, the proposed term weighting schemes consistently outperform the state-of-the-art schemes. Moreover, our findings shed new light on how to choose and develop an effective term weighting scheme for a specific text categorization task.

Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号