首页 | 本学科首页   官方微博 | 高级检索  
     

通信垃圾文本识别的半监督学习优化算法
引用本文:邱宁佳,沈卓睿,王辉,王鹏. 通信垃圾文本识别的半监督学习优化算法[J]. 计算机工程与应用, 2020, 56(17): 121-128. DOI: 10.3778/j.issn.1002-8331.1906-0149
作者姓名:邱宁佳  沈卓睿  王辉  王鹏
作者单位:长春理工大学 计算机科学技术学院,长春 130022
基金项目:吉林省教育厅"十三五"科学技术项目;吉林省科技发展计划技术攻关项目
摘    要:在对非平衡通信文本使用随机下采样来提高分类器性能时,为了解决随机下采样样本发生有偏估计的问题,提出基于否定选择密度聚类的下采样算法(NSDC-DS)。利用否定选择算法的自体异常检测机制改善传统聚类,将样本中心点和待聚类样本分别作为检测器和自体集,对两者进行异常匹配;使用否定选择密度聚类算法对样本相似性进行评估,改进传统的下采样方法,使用NBSVM分类器对采样后的通信样本进行垃圾识别;使用PCA对样本所具有的信息量进行评估,提出改进的PCA-SGD算法对模型参数进行调优,完成通信垃圾文本的半监督识别任务。为了验证改进算法的优越性,使用不平衡通信文本等多个数据集,在否定选择密度聚类、NSDC-DS算法、PCASGD与传统模型上进行对比分析。实验结果表明,改进的模型不仅具有较好的通信垃圾文本识别能力,而且具有较快和稳定的收敛速度。

关 键 词:非平衡数据  垃圾文本识别  否定选择密度聚类  基于否定选择密度聚类的下采样算法(NSDC-DS)  基于主成分分析的随机梯度下降(PCA-SGD)算法

Semi-supervised Learning Optimization Algorithm for Communication Spam Text Recognition
QIU Ningjia,SHEN Zhuorui,WANG Hui,WANG Peng. Semi-supervised Learning Optimization Algorithm for Communication Spam Text Recognition[J]. Computer Engineering and Applications, 2020, 56(17): 121-128. DOI: 10.3778/j.issn.1002-8331.1906-0149
Authors:QIU Ningjia  SHEN Zhuorui  WANG Hui  WANG Peng
Affiliation:School of Computer Science and Technology, Changchun University of Science and Technology, Changchun 130022, China
Abstract:In order to solve the problem of biased estimation of random samples, when using random under-sampling to improve the classifier performance for unbalanced communication samples, a Down-Sampling algorithm based on Negative Selection Density Clustering(NSDC-DS) is proposed. Firstly, the autogenous anomaly detection mechanism of negative selection algorithm is used to improve the traditional clustering, and the two are matched abnormally. The sampled communication samples are recognized with the NBSVM classifier. Then the negative selection clustering algorithm is used to evaluate the similarity of samples and improve the traditional down-sampling method. Finally, PCA is used to evaluate the information content of samples, and an improved PCA-SGD algorithm is proposed to tune model parameters and complete the semi-supervised recognition task of communication spam text. In order to verify the superiority of the improved algorithm, multiple data sets such as unbalanced communication text are used to compare and analyze the negative selection cluster, NSBC-US, PCA-SGD and the traditional model. Experimental results show that the improved model not only has good communication spam text recognition ability, but also has fast and stable convergence speed.
Keywords:unbalanced data  spam text recognition  negative selection density clustering  Down-Sampling algorithm based on Negative Selection Density Clustering(NSDC-DS)  Stochastic Gradient Descent based on Principal Component Analysis(PCA-SGD) algorithm  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号