Textual case-based reasoning for spam filtering: a comparison of feature-based and feature-free approaches |
| |
Authors: | Sarah Jane Delany Derek Bridge |
| |
Affiliation: | (1) Dublin Institute of Technology, Dublin, Ireland;(2) University College Cork, Cork, Ireland |
| |
Abstract: | Spam filtering is a text classification task to which Case-Based Reasoning (CBR) has been successfully applied. We describe
the ECUE system, which classifies emails using a feature-based form of textual CBR. Then, we describe an alternative way to
compute the distances between cases in a feature-free fashion, using a distance measure based on text compression. This distance
measure has the advantages of having no set-up costs and being resilient to concept drift. We report an empirical comparison,
which shows the feature-free approach to be more accurate than the feature-based system. These results are fairly robust over
different compression algorithms in that we find that the accuracy when using a Lempel-Ziv compressor (GZip) is approximately
the same as when using a statistical compressor (PPM). We note, however, that the feature-free systems take much longer to
classify emails than the feature-based system. Improvements in the classification time of both kinds of systems can be obtained
by applying case base editing algorithms, which aim to remove noisy and redundant cases from a case base while maintaining,
or even improving, generalisation accuracy. We report empirical results using the Competence-Based Editing (CBE) technique.
We show that CBE removes more cases when we use the distance measure based on text compression (without significant changes
in generalisation accuracy) than it does when we use the feature-based approach. |
| |
Keywords: | Spam filtering Case-based reasoning Case-base editing Case-based maintenance Feature selection Distance measures Text compression |
本文献已被 SpringerLink 等数据库收录! |
|