首页 | 本学科首页   官方微博 | 高级检索  
     

基于小样本学习的垃圾邮件过滤方法
引用本文:潘洁珠,周晓,吴共庆,胡学钢.基于小样本学习的垃圾邮件过滤方法[J].计算机工程,2010,36(21):245-247.
作者姓名:潘洁珠  周晓  吴共庆  胡学钢
作者单位:(1. 合肥师范学院计算机科学与技术系,合肥 230061;2. 合肥工业大学计算机与信息学院,合肥 230009)
基金项目:国家"973"计划基金资助项目,国家自然科学基金资助项目,安徽高等学校省级自然科学研究基金资助项目
摘    要:针对客户端垃圾邮件过滤器难以获取足够训练样本的问题,提出一种基于小样本学习的垃圾邮件过滤方法,利用容易获取的未标记样本提高垃圾邮件过滤的性能。该方法使用已标记的小样本邮件实例集训练一个初始Na?ve Bayes分类器,以此标注未标记邮件,再使用所有数据训练新的分类器,利用EM算法进行迭代直至收敛。实验结果证明,当给定5个~20个已标记小样本训练邮件时,该方法可有效提高垃圾邮件过滤性能。

关 键 词:小样本学习  EM算法  未标记数据  垃圾邮件过滤

Spam Filtering Method Based on Learning from Small Samples
PAN Jie-zhu,ZHOU Xiao,WU Gong-qing,HU Xue-gang.Spam Filtering Method Based on Learning from Small Samples[J].Computer Engineering,2010,36(21):245-247.
Authors:PAN Jie-zhu  ZHOU Xiao  WU Gong-qing  HU Xue-gang
Affiliation:(1. Department of Computer Science and Technology, Hefei Normal University, Hefei 230061, China; 2. School of Computer and Information, Hefei University of Technology, Hefei 230009, China)
Abstract:It is difficult to collect sufficient labeled E-mails for training a client spam classifier. Aiming at the problem, this paper proposes a spam filtering method based on learning from small samples, which improves the filtering performance with unlabeled samples. An initial Naive Bayes(NB) classifier is trained with a dataset of labeled E-mails, and unlabeled E-mails are probabilistically labeled with it. A new classifier is trained with all E-mails, and iterates to convergence with EM algorithm. Experimental results prove that, given labeled small training samples with a size of 5 to 20, the performance of spam filtering can be effectively improved.
Keywords:learning from small samples  EM algorithm  unlabeled data  spam filtering
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机工程》浏览原始摘要信息
点击此处可从《计算机工程》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号