Towards filtering undesired short text messages using an online learning approach with semantic indexing |
| |
Affiliation: | 1. Department of Systems and Energy, University of Campinas – UNICAMP, Campinas, São Paulo, Brazil;2. Department of Computer Science, Federal University of São Carlos – UFSCar, Sorocaba, São Paulo, Brazil;1. College of Computer Science and Technology, University South China, Hengyang 421001, China;2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;1. School of Computer Science and Engineer, Nanjing University of Science and Technology, Nanjing, China;2. Jiangsu Key Lab of BDSIP, School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, China;3. School of Computer and Information Sciences, Florida International University, Miami, FL, USA;4. Automation Department, Xiamen University, Xiamen, China;1. Department of Software Engineering, University of Granada, 18071 Granada, Spain;2. Department of Marketing and Market Research, Complutense University of Madrid, 28015 Madrid, Spain;3. Department of Computer Science and Artificial Intelligence, University of Granada, 18071 Granada, Spain;4. Department of Electrical and Computer Engineering, King Abdulaziz University, 21589 Jeddah, Saudi Arabia;5. Centre for Computational Intelligence, De Montfort University, LE1 9BH Leicester, UK;1. Department of Civil Engineering, New Mexico State University, MSC 3CE, PO Box 30001, Las Cruces, NM, USA, 88003;2. Texas AgriLife Research & Extension Center at El Paso, Texas A&M University System, 1380 A&M Circle, El Paso, TX 79927, USA |
| |
Abstract: | The popularity and reach of short text messages commonly used in electronic communication have led spammers to use them to propagate undesired content. This is often composed by misleading information, advertisements, viruses, and malwares that can be harmful and annoying to users. The dynamic nature of spam messages demands for knowledge-based systems with online learning and, therefore, the most traditional text categorization techniques can not be used. In this study, we introduce the MDLText, a text classifier based on the minimum description length principle, to the context of filtering undesired short text messages. The proposed approach supports incremental learning and, therefore, its predictive model is scalable and can adapt to continuously evolving spamming techniques. It is also fast, with computational cost increasing linearly with the number of samples and features, which is very desirable for expert systems applied to real-time electronic communication. In addition to the dynamic nature of these messages, they are also short and usually poorly written, rife with slangs, symbols, and abbreviations that difficult text representation, learning, and filtering. In this scenario, we also investigated the benefits of using text normalization and semantic indexing techniques. We showed these techniques can improve the text content quality and, consequently, enhance the performance of the expert systems for spamming detection. Based on these findings, we propose a new hybrid ensemble approach that combines the predictions obtained by the classifiers using the original text samples along with their variations created by applying text normalization and semantic indexing techniques. It has the advantages of being independent of the classification method and the results indicated it is efficient to filter undesired short text messages. |
| |
Keywords: | |
本文献已被 ScienceDirect 等数据库收录! |
|