首页 | 本学科首页   官方微博 | 高级检索  
     

三种用于垃圾网页检测的随机欠采样集成分类器
引用本文:陈木生,卢晓勇.三种用于垃圾网页检测的随机欠采样集成分类器[J].计算机应用,2017,37(2):535-539.
作者姓名:陈木生  卢晓勇
作者单位:1. 南昌大学 信息工程学院, 江西 南昌 330031;2. 南昌大学 软件学院, 江西 南昌 330047
基金项目:江西省科技支撑计划项目(20131102040039)。
摘    要:针对垃圾网页检测过程中轻微的不平衡分类问题,提出三种随机欠采样集成分类器算法,分别为一次不放回随机欠采样(RUS-once)、多次不放回随机欠采样(RUS-multiple)和有放回随机欠采样(RUS-replacement)算法。首先使用其中一种随机欠采样技术将训练样本集转换成平衡样本集,然后对每个平衡样本集使用分类回归树(CART)分类器算法进行分类,最后采用简单投票法构建集成分类器对测试样本进行分类。实验表明,三种随机欠采样集成分类器均取得了良好的分类效果,其中RUS-multiple和RUS-replacement比RUS-once的分类效果更好。与CART及其Bagging和Adaboost集成分类器相比,在WEBSPAM UK-2006数据集上,RUS-multiple和RUS-replacement方法的AUC指标值提高了10%左右,在WEBSPAM UK-2007数据集上,提高了25%左右;与其他最优研究结果相比,RUS-multiple和RUS-replacement方法在AUC指标上能达到最优分类结果。

关 键 词:垃圾网页检测  不平衡分类  集成学习  欠采样  分类回归树  
收稿时间:2016-08-01
修稿时间:2016-08-22

Three random under-sampling based ensemble classifiers for Web spam detection
CHEN Musheng,LU Xiaoyong.Three random under-sampling based ensemble classifiers for Web spam detection[J].journal of Computer Applications,2017,37(2):535-539.
Authors:CHEN Musheng  LU Xiaoyong
Affiliation:1. School of Information Engineering, Nanchang University, Nanchang Jiangxi 330031, China;2. School of Software, Nanchang University, Nanchang Jiangxi 330047, China
Abstract:In order to solve the problem of slighty imbalanced classification in Web spam detection, three ensemble classifiers based on random under-sampling techniques were proposed, including Random Under-Sampling once without replacement (RUS-once), Random Under-Sampling multiple times without replacement (RUS-multiple) and Random Under-Sampling with replacement (RUS-replacement). At first, the unbalanced training dataset was converted into several balanced datasets by using one of the under-sampling techniques. Secondly, the Classification And Regression Tree (CART) classifiers were trained based on the balanced datasets. Finally, an ensemble classifier was constructed with all of the CART classifiers based on simple voting rule and used to classify the test samples. The experimental results show that the three kinds of random under-sampling based ensemble classifiers achieve good classification results, the performance of RUS-multiple and RUS-replacement are better than RUS-once. Compared with CART, Bagging with CART and Adaboost with CART, the AUC values of RUS-multiple and RUS-replacement increase about 10% on WEBSPAM UK-2006 and about 25% on WEBSPAM UK-2007; compared with several state-of-the-art baseline classification models, RUS-multiple and RUS-replacement achieve the optimal results in AUC value.
Keywords:Web spam detection                                                                                                                        imbalanced classification                                                                                                                        ensemble learning                                                                                                                        under-sampling                                                                                                                        Classification And Regression Tree (CART)
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号