首页 | 本学科首页   官方微博 | 高级检索  
     

基于随机森林和欠采样集成的垃圾网页检测
引用本文:卢晓勇,陈木生.基于随机森林和欠采样集成的垃圾网页检测[J].计算机应用,2016,36(3):731-734.
作者姓名:卢晓勇  陈木生
作者单位:1. 南昌大学 软件学院, 南昌 330047;2. 南昌大学 信息工程学院, 南昌 330031
基金项目:江西省科技支撑计划项目(20131102040039)。
摘    要:为解决垃圾网页检测过程中的不平衡分类和"维数灾难"问题,提出一种基于随机森林(RF)和欠采样集成的二元分类器算法。首先使用欠采样技术将训练样本集大类抽样成多个子样本集,再将其分别与小类样本集合并构成多个平衡的子训练样本集;然后基于各个子训练样本集训练出多个随机森林分类器;最后用多个随机森林分类器对测试样本集进行分类,采用投票法确定测试样本的最终所属类别。在WEBSPAM UK-2006数据集上的实验表明,该集成分类器算法应用于垃圾网页检测比随机森林算法及其Bagging和Adaboost集成分类器算法效果更好,准确率、F1测度、ROC曲线下面积(AUC)等指标提高至少14%,13%和11%。与Web spam challenge 2007 优胜团队的竞赛结果相比,该集成分类器算法在F1测度上提高至少1%,在AUC上达到最优结果。

关 键 词:垃圾网页检测    随机森林    欠采样    集成分类器    机器学习
收稿时间:2015-08-10
修稿时间:2015-10-03

Web spam detection based on random forest and under-sampling ensemble
LU Xiaoyong,CHEN Musheng.Web spam detection based on random forest and under-sampling ensemble[J].journal of Computer Applications,2016,36(3):731-734.
Authors:LU Xiaoyong  CHEN Musheng
Affiliation:1. School of Software, Nanchang University, Nanchang Jiangxi 330047, China;2. Information Engineering School, Nanchang University, Nanchang Jiangxi 330031, China
Abstract:In order to solve the problem of imbalance classification and "curse of dimensionality", a binary classifier algorithm based on Random Forest (RF) and under-sampling ensemble was proposed to detect Web spam. Firstly, majority samples in training dataset were sampled into several sub sample sets, each of them was combined with minority samples and several balanced training sample sub sets were generated; then several RF classifiers were trained by these training sample sub sets to classify the testing samples; finally, the testing samples' classifications were determined by voting. Experiments on the WEBSPAM UK-2006 dataset show that the ensemble classifier outperformed RF, Bagging with RF and Adaboost with RF etc., and its accuracy, F1-measure, AUC increased by at least 14%, 13% and 11%. Compared with the winners of Web spam challenge 2007, the ensemble classifier increased F1-measure by at least 1% and reached to the optimum result in AUC.
Keywords:Web spam detection                                                                                                                        Random Forest (RF)                                                                                                                        under-sampling                                                                                                                        ensemble classifier                                                                                                                        machine learning
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号