首页 | 本学科首页   官方微博 | 高级检索  
     

基于偏最小二乘特征抽取的垃圾邮件过滤
引用本文:王鹏鸣,吴水秀,王明文,黄国斌.基于偏最小二乘特征抽取的垃圾邮件过滤[J].中文信息学报,2008,22(1):74-79.
作者姓名:王鹏鸣  吴水秀  王明文  黄国斌
作者单位:江西师范大学 计算机信息工程学院,江西 南昌 330022
基金项目:国家自然科学基金资助项目(60663007),江西省科技攻关项目(2006-184),江西省教育厅科技项目(2007-129)
摘    要:随着垃圾邮件逐渐成为网络用户的一大困扰,垃圾邮件过滤技术的研究显得越来越重要。针对电子邮件存在数据极度稀疏性、高特征维数和多重相关性等特点,本文提出了一种基于偏最小二乘原理的特征抽取方法,可以通过对原始特征进行线性组合抽取出既可反映邮件内容又可反映邮件类型的潜在语义特征,并可解决多重相关性问题。在Enron-Spam邮件数据集上的实验结果表明,同χ2特征选择方法相比,该方法在较低维数上可以获取良好的邮件过滤性能。

关 键 词:计算机应用  中文信息处理  垃圾邮件过滤  偏最小二乘  特征抽取  
文章编号:1003-0077(2008)01-0074-06
收稿时间:2007-05-27
修稿时间:2007-12-06

Spam Filtering Based on PLS Feature Extraction
WANG Peng-ming,WU Shui-xiu,WANG Ming-wen,HUANG Guo-bin.Spam Filtering Based on PLS Feature Extraction[J].Journal of Chinese Information Processing,2008,22(1):74-79.
Authors:WANG Peng-ming  WU Shui-xiu  WANG Ming-wen  HUANG Guo-bin
Affiliation:School of Computer Information Engineering, Jiangxi Normal University, Nanchang, Jiangxi 330022, China
Abstract:Along with the coming of network times, the research of spam filtering technology has been imperative under the situation. However, some specialties of mail dataset such as the data sparseness, high dimensionalities and multi-collinearity in mail content make great difference between spam filtering work and text classification work. In this paper, a Partial Least Squares (PLS) feature extraction method on spam filtering is proposed, which could extract latent semantic components that can capture the content information and class information, and could copy with the multi-collinearity. The experiments on Enron-Spam database show that our method can get very good performance in spam filtering compared with χ2 statistics feature selection.
Keywords:computer application  Chinese information processing  spam filtering  partial least squares  feature extraction
本文献已被 CNKI 维普 万方数据 等数据库收录!
点击此处可从《中文信息学报》浏览原始摘要信息
点击此处可从《中文信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号