基于神经网络的复杂垃圾信息过滤算法分析 Analysis of complex spam filtering algorithm based on neural network期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于神经网络的复杂垃圾信息过滤算法分析

引用本文：	张建,严珂,马祥.基于神经网络的复杂垃圾信息过滤算法分析[J].计算机应用,2022,42(3):770-777.

作者姓名：	张建严珂马祥

作者单位：	中国计量大学信息工程学院，杭州 310018

基金项目：	浙江省自然科学基金资助项目（LY19F020016,LQ20F050009）~~；

摘要：	垃圾信息的识别是自然语言处理方面主要的任务之一。传统方法是基于文本特征或词频的方法，其识别准确率主要依赖于特定关键词的出现与否，存在对关键词识别错误或对未出现关键词的垃圾信息文本识别能力较差的问题，提出基于神经网络的方法。首先，利用传统方法针对这一类垃圾信息文本进行识别训练和测试；然后，利用从垃圾短信、广告和垃圾邮件数据集中挑选出传统方法识别困难的垃圾信息，再从原数据集中随机挑选出同样数量的正常信息，将其组成三个无重复数据的新数据集；最后，以卷积神经网络和循环神经网络为基础，建立了三个模型，并在新数据集上进行识别训练。实验结果表明，基于神经网络的方法可以从文本中学习到更好的语义特征，在三个数据集上均能达到98%以上的准确率，高于朴素贝叶斯（NB）、随机森林（RF）、支持向量机（SVM）等传统方法。实验结果还显示，不同的神经网络适用于不同长度的文本分类，由循环神经网络组成的模型擅长识别句子长度的文本，由卷积神经网络组成的模型擅长识别段落长度的文本，由两者共同组成的模型擅长识别篇章长度的文本。
关键词：	垃圾信息识别与过滤文本特征词频神经网络
收稿时间：	2021-05-17
修稿时间：	2021-06-04
Analysis of complex spam filtering algorithm based on neural network

ZHANG Jian,YAN Ke,MA Xiang.Analysis of complex spam filtering algorithm based on neural network[J].journal of Computer Applications,2022,42(3):770-777.

Authors:	ZHANG Jian YAN Ke MA Xiang

Affiliation:	College of Information Engineering，China Jiliang University，Hangzhou Zhejiang 310018，China

Abstract:	The recognition of spam is one of the main tasks in natural language processing. The traditional methods are based on text features or word frequency， which recognition accuracies mainly depend on the presence or absence of specific keywords. When there are no keywords or errors in recognizing keywords in the spam， the traditional methods have poor recognition performance. Neural network-based methods were proposed. Recognition training and testing were conducted on complex spam. The spams that cannot be recognized by traditional methods were collected and the same amount of normal information was randomly selected from spam messages， advertisement and spam email datasets to form three new datasets without duplicate data. Three models were proposed based on convolutional neural network and recurrent neural network and tested on three new datasets for spam recognition. The experimental results show that the neural network-based models learned better semantic features from the text and achieved the accuracies of more than 98% on all three datasets， which are significantly higher than those of the traditional methods， such as Naive Bayes （NB）， Random Forest （RF） and Support Vector Machine （SVM）. The experimental results also show that different neural networks are suitable for text classification with different lengths. The models composed of recurrent neural networks are good at recognizing text with sentence length， the models composed of convolutional neural networks are good at recognizing text with paragraph length， and the models composed of both neural networks are good at recognizing text with chapter length.

Keywords:	spam recognition and filtering text feature word frequency neural network
本文献已被万方数据等数据库收录！
	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏