首页 | 本学科首页   官方微博 | 高级检索  
     

基于神经网络的复杂垃圾信息过滤算法分析
引用本文:张建,严珂,马祥.基于神经网络的复杂垃圾信息过滤算法分析[J].计算机应用,2022,42(3):770-777.
作者姓名:张建  严珂  马祥
作者单位:中国计量大学 信息工程学院,杭州 310018
基金项目:浙江省自然科学基金资助项目(LY19F020016,LQ20F050009)~~;
摘    要:垃圾信息的识别是自然语言处理方面主要的任务之一。传统方法是基于文本特征或词频的方法,其识别准确率主要依赖于特定关键词的出现与否,存在对关键词识别错误或对未出现关键词的垃圾信息文本识别能力较差的问题,提出基于神经网络的方法。首先,利用传统方法针对这一类垃圾信息文本进行识别训练和测试;然后,利用从垃圾短信、广告和垃圾邮件数据集中挑选出传统方法识别困难的垃圾信息,再从原数据集中随机挑选出同样数量的正常信息,将其组成三个无重复数据的新数据集;最后,以卷积神经网络和循环神经网络为基础,建立了三个模型,并在新数据集上进行识别训练。实验结果表明,基于神经网络的方法可以从文本中学习到更好的语义特征,在三个数据集上均能达到98%以上的准确率,高于朴素贝叶斯(NB)、随机森林(RF)、支持向量机(SVM)等传统方法。实验结果还显示,不同的神经网络适用于不同长度的文本分类,由循环神经网络组成的模型擅长识别句子长度的文本,由卷积神经网络组成的模型擅长识别段落长度的文本,由两者共同组成的模型擅长识别篇章长度的文本

关 键 词:垃圾信息  识别与过滤  文本特征  词频  神经网络  
收稿时间:2021-05-17
修稿时间:2021-06-04

Analysis of complex spam filtering algorithm based on neural network
ZHANG Jian,YAN Ke,MA Xiang.Analysis of complex spam filtering algorithm based on neural network[J].journal of Computer Applications,2022,42(3):770-777.
Authors:ZHANG Jian  YAN Ke  MA Xiang
Affiliation:College of Information Engineering,China Jiliang University,Hangzhou Zhejiang 310018,China
Abstract:The recognition of spam is one of the main tasks in natural language processing. The traditional methods are based on text features or word frequency, which recognition accuracies mainly depend on the presence or absence of specific keywords. When there are no keywords or errors in recognizing keywords in the spam, the traditional methods have poor recognition performance. Neural network-based methods were proposed. Recognition training and testing were conducted on complex spam. The spams that cannot be recognized by traditional methods were collected and the same amount of normal information was randomly selected from spam messages, advertisement and spam email datasets to form three new datasets without duplicate data. Three models were proposed based on convolutional neural network and recurrent neural network and tested on three new datasets for spam recognition. The experimental results show that the neural network-based models learned better semantic features from the text and achieved the accuracies of more than 98% on all three datasets, which are significantly higher than those of the traditional methods, such as Naive Bayes (NB), Random Forest (RF) and Support Vector Machine (SVM). The experimental results also show that different neural networks are suitable for text classification with different lengths. The models composed of recurrent neural networks are good at recognizing text with sentence length, the models composed of convolutional neural networks are good at recognizing text with paragraph length, and the models composed of both neural networks are good at recognizing text with chapter length.
Keywords:spam  recognition and filtering  text feature  word frequency  neural network  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号