首页 | 本学科首页   官方微博 | 高级检索  
     


Ternary encoding based feature extraction for binary text classification
Authors:Hakan Alt?nçay  Zafer Erenel
Affiliation:1. Department of Computer Engineering, Eastern Mediterranean University, Famagusta, Northern Cyprus, Turkey
2. Department of Computer Engineering, European University of Lefke, Gemikona??, Lefke, Northern Cyprus, Turkey
Abstract:A novel framework for termset based feature extraction is proposed for binary text classification. The proposed approach is based on the encoding of the terms within a termset. The ternary codes ‘+1’ and ‘?1’ are used to represent the class that the term supports, whereas ‘0’ denotes no support to any of the classes. Four different encoding schemes are proposed where the term weights and the term occurrence probabilities in the positive and negative documents are used to define the ternary code of a given term. The ternary patterns are utilized to define novel features by splitting them into positive and negative codes where each code is treated as a different feature extractor. Use of the derived features individually and together with bag of words representation are both investigated. The histograms of the resultant features are also employed to study the improvements that can be achieved using a small number of additional features to augment bag of words representation. Experiments conducted on four benchmark datasets with different characteristics have shown that the proposed feature extraction framework provides significant improvements compared to the bag of words representation.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号