首页 | 本学科首页   官方微博 | 高级检索  
     

混合采样与遗传算法相结合的垃圾网页检测
引用本文:刘寒.混合采样与遗传算法相结合的垃圾网页检测[J].北京邮电大学学报,2019,42(6):111-117.
作者姓名:刘寒
作者单位:北京邮电大学软件学院,北京100876;北京邮电大学 可信分布式计算与服务教育部重点实验室,北京100876
基金项目:国家重点研发计划项目(2017YFC1307705)
摘    要:垃圾网页检测存在数据不平衡、特征空间维度较高的问题,为此,提出一种基于随机混合采样和遗传算法的集成分类算法.首先,使用随机混合采样技术,通过随机抽样,减少多数类样本数量,用少数类样本合成过采样技术方法生成少数类样本,获得多个平衡的训练数据子集;然后使用改进的遗传算法对训练数据集进行降维,得到多个具有最优特征的训练数据子集;使用极端梯度算法(XGBoost)作为分类器,训练多个平衡数据子集,用简单投票法对多个分类器进行集成,得到新的分类器;最后对测试集进行预测,得到最终预测结果.实验结果表明,提出算法的分类结果与XGBoost的结果相比,准确率提高了约19.25%,且减少了建立学习模型的时间,提高了分类性能,是一种较好的分类算法.

关 键 词:垃圾网页检测  混合采样  集成分类  遗传算法  极端梯度算法
收稿时间:2019-11-22

Spam Web Detection Based on Hybrid-Sampling and Genetic Algorithm
LIU Han.Spam Web Detection Based on Hybrid-Sampling and Genetic Algorithm[J].Journal of Beijing University of Posts and Telecommunications,2019,42(6):111-117.
Authors:LIU Han
Affiliation:1. School of Software Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China;
2. Key Laboratory of Trustworthy Distributed Computing and Service(Beijing University of Posts and Telecommunications), Ministry of Education, Beijing 100876, China
Abstract:Spam web detection is of ten troubled by the problem of unbalanced data and high feature space dimension. In order to solve these two problems, the ensemble classification algorithm based on random hybrid-sampling and genetic algorithm was proposed. Firstly, a number of balanced training data subsets is obtained by reducing the number of majority samples through random sampling and generating minority samples by synthetic minority over-sampling technique(SMOTE) method. Then, the improved genetic algorithm is used to reduce the dimension of training data set to obtain multiple subsets of training data with optimal feature. Extreme gradient boosting(XGBoost)is also used as the classifier to train multiple balanced data subsets, and so a new classifier is obtained by ensemble multiple classifiers with simple voting method. Finally, the test set is predicted and the final prediction is obtained. Experiments show that, compared with XGBoost, the proposed algorithm improves the accuracy by about 19.25%, reduces the time to build the learning model, and improves the classification performance.
Keywords:spam web detection  hybrid-sampling  ensemble classification  genetic algorithm  extreme gradient boosting  
本文献已被 万方数据 等数据库收录!
点击此处可从《北京邮电大学学报》浏览原始摘要信息
点击此处可从《北京邮电大学学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号