SAAD,a content based Web Spam Analyzer and Detector |
| |
Authors: | Ví ctor M. Prieto,Manuel Á lvarezFidel Cacheda |
| |
Affiliation: | Communications and Information Technologies Department, University of A Coruna, Campus de Elvia s/n, 15071 A Coruna, Spain |
| |
Abstract: | Web Spam is one of the main difficulties that crawlers have to overcome and therefore one of the main problems of the WWW. There are several studies about characterising and detecting Web Spam pages. However, none of them deals with all the possible kinds of Web Spam. This paper shows an analysis of different kinds of Web Spam pages and identifies new elements that characterise it, to define heuristics which are able to partially detect them. We also discuss and explain several heuristics from the point of view of their effectiveness and computational efficiency. Taking them into account, we study several sets of heuristics and demonstrate how they improve the current results. Finally, we propose a new Web Spam detection system called SAAD (Spam Analyzer And Detector), which is based on the set of proposed heuristics and their use in a C4.5 classifier improved by means of Bagging and Boosting techniques. We have also tested our system in some well known Web Spam datasets and we have found it to be very effective. |
| |
Keywords: | Web characterization Web Spam Malware Data mining Statistical properties of Web Spam |
本文献已被 ScienceDirect 等数据库收录! |
|