共查询到20条相似文献,搜索用时 15 毫秒
1.
Clotilde Lopes Paulo Cortez Pedro Sousa Miguel Rocha Miguel Rio 《Expert systems with applications》2011,38(8):9365-9372
This paper presents a novel spam filtering technique called Symbiotic Filtering (SF) that aggregates distinct local filters from several users to improve the overall performance of spam detection. SF is an hybrid approach combining some features from both Collaborative (CF) and Content-Based Filtering (CBF). It allows for the use of social networks to personalize and tailor the set of filters that serve as input to the filtering. A comparison is performed against the commonly used Naive Bayes CBF algorithm. Several experiments were held with the well-known Enron data, under both fixed and incremental symbiotic groups. We show that our system is competitive in performance and is robust against both dictionary and focused contamination attacks. Moreover, it can be implemented and deployed with few effort and low communication costs, while assuring privacy. 相似文献
2.
Email spam causes a serious waste of time and resources. This paper addresses the email spam filtering problem and proposes an online active multi-field learning approach, which is based on the following ideas: (1) Email spam filtering is an online application, which suggests an online learning idea; (2) Email document has a multi-field text structure, which suggests a multi-field learning idea; and (3) It is costly to obtain a label for a real-world email spam filter, which suggests an active learning idea. The online learner regards the email spam filtering as an incremental supervised binary streaming text classification. The multi-field learner combines multiple results predicted by field classifiers in a novel compound weight schema, and each field classifier calculates the arithmetical average of multiple conditional probabilities calculated from feature strings according to a data structure of string-frequency index. Comparing the current variance of field classifying results with the historical variance, the active learner evaluates the classifying confidence and takes the more uncertain email as the more informative sample for which to request a label. The experimental results show that the proposed approach can achieve the state-of-the-art performance with greatly reduced label requirements and very low space-time costs. The performance of our online active multi-field learning, the standard (1-ROCA)% measurement, even exceeds the full feedback performance of some advanced individual text classification algorithms. 相似文献
3.
4.
Neural Computing and Applications - Email has become extremely popular among people nowadays. In fact, it has been reported to be the cheapest, popular and fastest means of communication in recent... 相似文献
5.
Carlos Laorden Igor Santos Borja Sanz Gonzalo Alvarez Pablo G. Bringas 《Electronic Commerce Research and Applications》2012,11(3):290-298
Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms, and phishing. More than 86% of received e-mails are spam. Historical approaches to combating these messages, including simple techniques such as sender blacklisting or the use of e-mail signatures, are no longer completely reliable. Many current solutions feature machine-learning algorithms trained using statistical representations of the terms that most commonly appear in such e-mails. However, these methods are merely syntactic and are unable to account for the underlying semantics of terms within messages. In this paper, we explore the use of semantics in spam filtering by introducing a pre-processing step of Word Sense Disambiguation (WSD). Based upon this disambiguated representation, we apply several well-known machine-learning models and show that the proposed method can detect the internal semantics of spam messages. 相似文献
6.
从图片垃圾邮件的现状着手,通过对图片垃圾邮件的分析,将图片垃圾邮件与文本垃圾邮件之间的不同点进行了对比,并对图片垃圾邮件的特征进行了总结.与此同时,对图片垃圾邮件过滤中常用的一些过滤方法,例如OCR(最优字符识别)以及指纹技术进行了介绍,分析了其优缺点,并结合它们自身的缺点提出了一些建设性看法.最后对最新的反垃圾邮件研究成果作了简略描述,并对垃圾邮件的发展作出了展望. 相似文献
7.
8.
Unsolicited or spam email has recently become a major threat that can negatively impact the usability of electronic mail. Spam substantially wastes time and money for business users and network administrators, consumes network bandwidth and storage space, and slows down email servers. In addition, it provides a medium for distributing harmful code and/or offensive content. In this paper, we explore the application of the GMDH (Group Method of Data Handling) based inductive learning approach in detecting spam messages by automatically identifying content features that effectively distinguish spam from legitimate emails. We study the performance for various network model complexities using spambase, a publicly available benchmark dataset. Results reveal that classification accuracies of 91.7% can be achieved using only 10 out of the available 57 attributes, selected through abductive learning as the most effective feature subset (i.e. 82.5% data reduction). We also show how to improve classification performance using abductive network ensembles (committees) trained on different subsets of the training data. Comparison with other techniques such as neural networks and naïve Bayesian classifiers shows that the GMDH-based learning approach can provide better spam detection accuracy with false-positive rates as low as 4.3% and yet requires shorter training time. 相似文献
9.
垃圾邮件过滤系统的探究与实现 总被引:2,自引:0,他引:2
曾小宁 《计算机工程与设计》2009,30(15)
电子邮件已成为现代通信中不可缺少的一部分,但垃圾邮件的日益泛滥给计算机系统安全和人们的工作与生活带来了极大的威胁,反垃圾邮件已成为一个非常重要的任务.在传统的黑白名单过滤技术的基础上,引入了IP信誉评分机制,并结合基于规则的过滤技术和基于内容的贝叶斯过滤技术,从而建立了一个多层次的垃圾邮件过滤系统模型.同时在系统中应用了反馈学习技术,以弥补因误判而造成的损失和提高系统的准确率.经实践验证,本系统适用于用户终端使用,有较高的可行性. 相似文献
10.
11.
Spammers often embed text into images in order to avoid filtering by text-based spam filters, which result in a large number of advertisement spam images. Garbage image recognition has become one of the hotspots in the field of Internet spam filtering research. Its goal is to solve the problem that traditional spam information filtering methods encounter a sharp performance decline or even failure when filtering spam image information. Based on the clustering algorithm, this paper proposes a method to expand the data samples, which greatly improves the number of high-quality training samples and meets the needs of model training. Then, we train a convolutional neural networks using the enlarged data samples to recognize the SPAM in real time. The experimental results show that the accuracy of the model is increased by more than 14% after using the method of data augmentation. The accuracy of the model can be improved by 6% compared with other methods of data augmentation. Combined with convolutional neural networks and the proposed method of data augmentation, the accuracy of our SPAM filtering model is 7–11% higher than that of the traditional method. 相似文献
12.
Hans Henseler 《Artificial Intelligence and Law》2010,18(4):413-430
The information overload in E-Discovery proceedings makes reviewing expensive and it increases the risk of failure to produce results on time and consistently. New interactive techniques have been introduced to increase reviewer productivity. In contrast, the techniques presented in this article propose an alternative method that tries to reduce information during culling so that less information needs to be reviewed. The proposed method first focuses on mapping the email collection universe using straightforward statistical methods based on keyword filtering combined with date time and custodian identities. Subsequently, a social network is constructed from the email collection that is analyzed by filtering on date time and keywords. By using the network context we expect to provide a better understanding of the keyword hits and the ability to discard certain parts of the collection. 相似文献
13.
A suffix tree approach to anti-spam email filtering 总被引:1,自引:0,他引:1
We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using
the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the
character level representation of emails and classes facilitated by the suffix tree can significantly improve classification
accuracy when compared with the currently popular methods, such as naive Bayes. We believe the method can be extended to the
classification of documents in other domains.
Editor: Tom Fawcett 相似文献
14.
在垃圾邮件过滤中,考虑到特征词对合法邮件和垃圾邮件分类贡献的不同,通过定义分类贡献比系数,将特征词分类贡献的思想应用到特征选择和朴素贝叶斯过滤器的设计中,在英文语料库上进行实验,实验结果表明,应用特征词分类贡献的垃圾邮件过滤方法可以有效提高过滤器对合法邮件和垃圾邮件的识别能力,降低过滤器对合法邮件和垃圾邮件的误判率。 相似文献
15.
16.
17.
Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical,
fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful
emails known as spam emails. A major concern is the developing of suitable filters that can adequately capture those emails and achieve high performance
rate. Machine learning (ML) researchers have developed many approaches in order to tackle this problem. Within the context
of machine learning, support vector machines (SVM) have made a large contribution to the development of spam email filtering.
Based on SVM, different schemes have been proposed through text classification approaches (TC). A crucial problem when using
SVM is the choice of kernels as they directly affect the separation of emails in the feature space. This paper presents thorough
investigation of several distance-based kernels and specify spam filtering behaviors using SVM. The majority of used kernels
in recent studies concern continuous data and neglect the structure of the text. In contrast to classical kernels, we propose
the use of various string kernels for spam filtering. We show how effectively string kernels suit spam filtering problem.
On the other hand, data preprocessing is a vital part of text classification where the objective is to generate feature vectors
usable by SVM kernels. We detail a feature mapping variants in TC that yield improved performance for the standard SVM in
filtering task. Furthermore, to cope for realtime scenarios we propose an online active framework for spam filtering. We present
empirical results from an extensive study of online, transductive, and online active methods for classifying spam emails in
real time. We show that active online method using string kernels achieves higher precision and recall rates. 相似文献
18.
提出了一种新型的双层垃圾邮件过滤方法.该方法基于免疫学习,免疫记忆和免疫识别等机制,具有一定的自适应能力和多样性,充分利用了垃圾邮件与非垃圾邮件的特征,从而降低了非垃圾邮件被错判的风险.实验结果表明,双层过滤方法可有效的降低垃圾邮件的虚报率(非垃圾邮件被错判为垃圾邮件的比例). 相似文献
19.
Igor Santos Carlos Laorden Borja Sanz Pablo G. Bringas 《Expert systems with applications》2012,39(1):437-444
Spam has become a major issue in computer security because it is a channel for threats such as computer viruses, worms and phishing. More than 85% of received e-mails are spam. Historical approaches to combat these messages including simple techniques such as sender blacklisting or the use of e-mail signatures, are no longer completely reliable. Currently, many solutions feature machine-learning algorithms trained using statistical representations of the terms that usually appear in the e-mails. Still, these methods are merely syntactic and are unable to account for the underlying semantics of terms within the messages. In this paper, we explore the use of semantics in spam filtering by representing e-mails with a recently introduced Information Retrieval model: the enhanced Topic-based Vector Space Model (eTVSM). This model is capable of representing linguistic phenomena using a semantic ontology. Based upon this representation, we apply several well-known machine-learning models and show that the proposed method can detect the internal semantics of spam messages. 相似文献
20.
垃圾邮件制造者常常将文字嵌入到图像中,产生了大量的图片垃圾邮件.为解决这一问题,提出并实现了一个基于截图内容的图片垃圾邮件过滤方案.首先由用户从垃圾邮件中截取某一子域图片,每一截图对应一类垃圾图片,所有的截图构成一个自定义的垃圾图片“黑名单”.其次对读入的每一封图片邮件,其内置图片与“黑名单”中的图片进行图像匹配.最后若存在匹配项,则判定该邮件含有用户已指定的垃圾图片信息.将此图片垃圾邮件过滤方案应用于一个小型的邮件收发系统,使用3 534幅垃圾邮件图片进行实验,结果证明了该垃圾邮件过滤方案有效. 相似文献