基于遗传规划集成学习的网络作弊检测 Web Spam Detection by the Genetic Programming-based Ensemble Learning期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于遗传规划集成学习的网络作弊检测

引用本文：	牛小飞,马军,马少平,张冬梅.基于遗传规划集成学习的网络作弊检测[J].中文信息学报,2012,26(5):94-101.

作者姓名：	牛小飞马军马少平张冬梅

作者单位：	1.山东大学计算机科学与技术学院,山东济南 250101; 2. 山东建筑大学计算机科学与技术学院,山东济南 250101; 3. 清华大学计算机科学与技术系,北京 100084

基金项目：	国家自然科学基金资助项目，山东省自然科学基金资助项目，山东省高等学校优秀青年教师国内访问学者资助项目

摘要：	网络作弊检测是搜索引擎的重要挑战之一,该文提出基于遗传规划的集成学习方法 (简记为GPENL)来检测网络作弊。该方法首先通过欠抽样技术从原训练集中抽样得到t个不同的训练集;然后使用c个不同的分类算法对t个训练集进行训练得到tc个基分类器;最后利用遗传规划得到tc个基分类器的集成方式。新方法不仅将欠抽样技术和集成学习融合起来提高非平衡数据集的分类性能,还能方便地集成不同类型的基分类器。在WEBSPAM-UK2006数据集上所做的实验表明无论是同态集成还是异态集成,GPENL均能提高分类的性能,且异态集成比同态集成更加有效;GPENL比AdaBoost、Bagging、RandomForest、多数投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。
关键词：	网络作弊集成学习遗传规划非平衡数据集分类
Web Spam Detection by the Genetic Programming-based Ensemble Learning

NIU Xiaofei,MA Jun,MA Shaoping,ZHANG Dongmei.Web Spam Detection by the Genetic Programming-based Ensemble Learning[J].Journal of Chinese Information Processing,2012,26(5):94-101.

Authors:	NIU Xiaofei MA Jun MA Shaoping ZHANG Dongmei

Affiliation:	1. School of Computer Science and Technology, Shandong University, Jinan, Shandong 250101, China; 2. School of Computer Science and Technology of Shandong Jianzhu University, Jinan, Shandong 250101, China; 3. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Abstract:	Web spam detection is a challenging issue for web search engines. This paper proposes a Genetic Programming-based ensemble learning approach (GPENL) to detect web spam. First, the method gets t different training sets by the under-sampling from the original training set. Then, c different classification algorithms are used on t training sets to get tc* base classifiers. Finally, an integrated approach of tc* base classifiers is obtained by Genetic Programming. The new method can not only merge the under-sampling technology and ensemble learning to improve the classification performance on imbalanced datasets, but also conveniently integrate different types of base classifiers. The experiments on WEBSPAM-UK2006 show that this method improve the classification performance whether the base classifiers belong to the same type or not, and in most cases the heterogeneous classifier ensembles work better than the homogeneous ones; and GPENL can get higher F-measure than those done by AdaBoost, Bagging, RandomForest, Vote, EDKC algorithm and the method based on Prediction Spamicity. Key wordsweb spam; ensemble learning; genetic programming; classification on the imbalanced dataset

Keywords:	web spam ensemble learning genetic programming classification on the imbalanced dataset
本文献已被万方数据等数据库收录！
	点击此处可从《中文信息学报》浏览原始摘要信息
	点击此处可从《中文信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏