首页 | 本学科首页   官方微博 | 高级检索  
     

基于主题模型的垃圾邮件过滤系统的设计与实现
引用本文:寇晓淮,程华.基于主题模型的垃圾邮件过滤系统的设计与实现[J].电信科学,2017,33(11):73-82.
作者姓名:寇晓淮  程华
作者单位:华东理工大学信息科学与工程学院,上海200237
摘    要:垃圾邮件过滤技术在保证信息安全、提高资源利用、分拣信息数据等方面都发挥着重要作用。然而,垃圾邮件的出现影响了用户的体验,并且会造成不必要的经济与时间损失。针对现有的垃圾邮件过滤技术的不足,基于多个主题词理论,构建了基于朴素贝叶斯的垃圾邮件分类方法。在邮件主题获取中,采用主题模型LDA得到邮件的相关主题及主题词;并进一步采用Word2Vec寻找主题词的同义词和关联词,扩展主题词集合。在邮件分类中,对训练数据集进行统计学习得到词语的先验概率;基于扩展的主题词集合及其概率,通过贝叶斯公式推导得到某个主题和某封邮件的联合概率,以此作为垃圾邮件判定的依据。同时,基于主题模型的垃圾邮件过滤系统具有简洁易应用的特点。通过与其他典型垃圾邮件过滤方法的对比实验,证明基于主题模型的垃圾邮件分类方法及基于Word2Vec的改进方法均能有效提高垃圾邮件过滤的准确度。

关 键 词:文本分类  垃圾邮件  主题模型  贝叶斯原理  

Design and implementation of spam filtering system based on topic model
Xiaohuai KOU,Hua CHENG.Design and implementation of spam filtering system based on topic model[J].Telecommunications Science,2017,33(11):73-82.
Authors:Xiaohuai KOU  Hua CHENG
Affiliation:College of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
Abstract:Spam filtering technology plays a key role in many areas including information security,transmission efficiency,and automatic information classification.However,the emergence of spam affects the user's sense of experience,and can cause unnecessary economic and time loss.The deficiency of spam filtering technology was researched,and a method of spam classification based on naive Bayesian was put forward based on multiple keywords.In the subject of mail,the theme model was used by LDA to get the related subject and keyword of the message,and Word2Vec was further used to search keyword synonyms and related words,extending the keyword collection.In the classification of mails,the transcendental probability of the words in the training dataset was obtained by statistical learning.Based on the extended keyword collection and its probability,the joint probability of a subject and a message was deduced by the Bayesian formula as a basis for the spam judgment.At the same time,the spam filtering system based on topic model was simple and easy to apply.By comparing experiments with other typical spam filtering method,it is proved that the method of spam classification based on theme model and the improved method based on Word2Vec can effectively improve the accuracy of sparm filtering.
Keywords:text classification  spam  topic model  Bayesian theory  
本文献已被 万方数据 等数据库收录!
点击此处可从《电信科学》浏览原始摘要信息
点击此处可从《电信科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号