首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
支持向量机在垃圾邮件过滤中能达到较高的分类准确率,实际应用中,将正常邮件误判为垃圾邮件会给用户造成更大的损失。该文提出一个基于代价敏感支持向量机的垃圾邮件过滤方案,通过为正类和负类训练样本设置不同的错误惩罚系数对分类器进行训练,在保证较高的垃圾邮件召回率的前提下,尽可能降低正常邮件的误判率(假阳性率)。实验结果表明,该方案能有效地提高过滤器的整体性能,更好地满足垃圾邮件过滤的实际要求。  相似文献   

2.
Spam is a big problem for email users. The battle between spamming and anti-spamming technologies has been going on for many years. Though many advanced anti-spamming technologies are progressing significantly, spam is still able to bombard many email users. The problem worsens when some anti-spamming methods unintentionally filtered legitimate emails instead! In this paper, we first review existing anti-spam technologies, then propose a layered defense framework using a combination of anti-spamming methods. Under this framework, the server-level defense is targeted for common spam while the client-level defense further filters specific spam for individual users. This layered structure improves on filtering accuracy and yet reduces the number of false positives. A sub-system using our pre-challenge method is implemented as an add-on in Microsoft Outlook 2002. In addition, we extend our client-based pre-challenge method to a domain-based solution thus further reducing the individual email users' overheads.  相似文献   

3.
基于粗糙集的加权朴素贝叶斯邮件过滤方法   总被引:5,自引:3,他引:2  
邮件过滤中有两个关键问题,一是如何选择有效的邮件特征集,二是设计较好的邮件过滤算法。在对邮件特性进行分析的基础上,综合邮件头及邮件内容的主要形象特征给出了一种新的邮件特征集提取方法。用粗糙集的信息观点度量了各属性的重要性,并以此为权重进行加权朴素贝叶斯垃圾邮件过滤,有效地解决了朴素贝叶斯分类中的条件依赖性问题。通过在中英文邮件集上的测试实验,证明了所提出的邮件过滤方法的有效性。  相似文献   

4.
Email classification and prioritization expert systems have the potential to automatically group emails and users as communities based on their communication patterns, which is one of the most tedious tasks. The exchange of emails among users along with the time and content information determine the pattern of communication. The intelligent systems extract these patterns from an email corpus of single or all users and are limited to statistical analysis. However, the email information revealed in those methods is either constricted or widespread, i.e. single or all users respectively, which limits the usability of the resultant communities. In contrast to extreme views of the email information, we relax the aforementioned restrictions by considering a subset of all users as multi-user information in an incremental way to extend the personalization concept. Accordingly, we propose a multi-user personalized email community detection method to discover the groupings of email users based on their structural and semantic intimacy. We construct a social graph using multi-user personalized emails. Subsequently, the social graph is uniquely leveraged with expedient attributes, such as semantics, to identify user communities through collaborative similarity measure. The multi-user personalized communities, which are evaluated through different quality measures, enable the email systems to filter spam or malicious emails and suggest contacts while composing emails. The experimental results over two randomly selected users from email network, as constrained information, unveil partial interaction among 80% email users with 14% search space reduction where we notice 25% improvement in the clustering coefficient.  相似文献   

5.
Highly discriminative statistical features for email classification   总被引:2,自引:2,他引:0  
This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.  相似文献   

6.
垃圾邮件的处理是电子邮件服务中非常重要的功能,该文在对标准邮件集表示为向量空间模型,降维处理处理工作的基础上,运用神经网络集成的方法来构造邮件分类器,对邮件进行过滤;该方法在垃圾邮件语料库上进行了实验,实验证明该方法对于垃圾邮件的过滤有较好的效果。  相似文献   

7.
介绍了一种工作在Linux环境下、基于多层架构邮件过滤系统的实现,通过在邮件传递的不同层次提供过滤,有效地节制垃圾邮件和病毒邮件的泛滥.  相似文献   

8.
Electronic mail is a major revolution taking place over traditional communication systems due to its convenient, economical, fast, and easy to use nature. A major bottleneck in electronic communications is the enormous dissemination of unwanted, harmful emails known as spam emails. A major concern is the developing of suitable filters that can adequately capture those emails and achieve high performance rate. Machine learning (ML) researchers have developed many approaches in order to tackle this problem. Within the context of machine learning, support vector machines (SVM) have made a large contribution to the development of spam email filtering. Based on SVM, different schemes have been proposed through text classification approaches (TC). A crucial problem when using SVM is the choice of kernels as they directly affect the separation of emails in the feature space. This paper presents thorough investigation of several distance-based kernels and specify spam filtering behaviors using SVM. The majority of used kernels in recent studies concern continuous data and neglect the structure of the text. In contrast to classical kernels, we propose the use of various string kernels for spam filtering. We show how effectively string kernels suit spam filtering problem. On the other hand, data preprocessing is a vital part of text classification where the objective is to generate feature vectors usable by SVM kernels. We detail a feature mapping variants in TC that yield improved performance for the standard SVM in filtering task. Furthermore, to cope for realtime scenarios we propose an online active framework for spam filtering. We present empirical results from an extensive study of online, transductive, and online active methods for classifying spam emails in real time. We show that active online method using string kernels achieves higher precision and recall rates.  相似文献   

9.
Bo Yu  Zong-ben Xu   《Knowledge》2008,21(4):355-362
The growth of email users has resulted in the dramatic increasing of the spam emails during the past few years. In this paper, four machine learning algorithms, which are Naïve Bayesian (NB), neural network (NN), support vector machine (SVM) and relevance vector machine (RVM), are proposed for spam classification. An empirical evaluation for them on the benchmark spam filtering corpora is presented. The experiments are performed based on different training set size and extracted feature size. Experimental results show that NN classifier is unsuitable for using alone as a spam rejection tool. Generally, the performances of SVM and RVM classifiers are obviously superior to NB classifier. Compared with SVM, RVM is shown to provide the similar classification result with less relevance vectors and much faster testing time. Despite the slower learning procedure, RVM is more suitable than SVM for spam classification in terms of the applications that require low complexity.  相似文献   

10.
The annoyance of spam emails increasingly plagues both individuals and organizations. In response, most of prior research investigates spam filtering as a classical text categorization task, in which training examples must include both spam (positive examples) and legitimate (negative examples) emails. However, in many spam filtering scenarios, obtaining legitimate emails for training purpose can be more difficult than collecting spam and unclassified emails. Hence, it is more appropriate to construct a classification model for spam filtering that uses positive training examples (i.e., spam) and unlabeled instances only and does not require legitimate emails as negative training examples. Several single-class learning techniques, such as PNB and PEBL, have been proposed in the literature. However, they incur inherent limitations with regard to spam filtering. In this study, we propose and develop an ensemble approach, referred to as E2, to address these limitations. Specifically, we follow the two-stage framework of PEBL but extend each stage with an ensemble strategy. The empirical evaluation results from two spam filtering corpora suggest that our proposed E2 technique generally outperforms benchmark techniques (i.e., PNB and PEBL) and exhibits more stable performance than its counterparts.  相似文献   

11.
The Internet has been flooded with spam emails, and during the last decade there has been an increasing demand for reliable anti-spam email filters. The problem of filtering emails can be considered as a classification problem in the field of supervised learning. Theoretically, many mature technologies, for example, support vector machines (SVM), can be used to solve this problem. However, in real enterprise applications, the training data are typically collected via honeypots and thus are always of huge amounts and highly biased towards spam emails. This challenges both efficiency and effectiveness of conventional technologies. In this article, we propose an undersampling method to compress and balance the training set used for the conventional SVM classifier with minimal information loss. The key observation is that we can make a trade-off between training set size and information loss by carefully defining a similarity measure between data samples. Our experiments show that the SVM classifier provides a better performance by applying our compressing and balancing approach.  相似文献   

12.
A new technique for managing and disseminating Web-based email prefetches messages and generates dynamic pages, displaying them at the network edge. Compared to other popular Web-based email servers, the prefetching and caching emails (PACE) prototype shows an improved performance with respect to user-perceived latency. Additionally, PACE'S centralized neural-network-based personalized spam filter will filter spam and viruses at the server's origin, thus saving bandwidth. Another major concern for users is the email accounts being clogged with spam. Spam filters can be classified as server-side or client-side. Server-side filters are integrated with email servers and filter out spam at the server end.  相似文献   

13.
Artificial immune system inspired behavior-based anti-spam filter   总被引:2,自引:1,他引:1  
This paper proposes a novel behavior-based anti-spam technology for email service based on an artificial immune-inspired clustering algorithm. The suggested method is capable of continuously delivering the most relevant spam emails from the collection of all spam emails that are reported by the members of the network. Mail servers could implement the anti-spam technology by using the “black lists” that have been already recognized. Two main concepts are introduced, which defines the behavior-based characteristics of spam and to continuously identify the similar groups of spam when processing the spam streams. Experiment results using real-world datasets reveal that the proposed technology is reliable, efficient and scalable. Since no single technology can achieve one hundred percent spam detection with zero false positives, the proposed method may be used in conjunction with other filtering systems to minimize errors.  相似文献   

14.
An Operable Email Based Intelligent Personal Assistant   总被引:1,自引:0,他引:1  
The recent phenomena of email-function-overloading and email-centricness in daily life and business have created new problems to users. There is a practical need for developing a software assistant to facilitate the management of personal and organizational emails, and to enable users to complete their email-centric jobs or tasks smoothly. This paper presents the status, goals, and key technical elements of an Email-Centric Intelligent Personal Assistant, called ECIPA. ECIPA provides various assisting functions, including automated and cost-sensitive spam filtering based on corresponding analysis, ontology-mediated email classification, query and archiving. ECIPA can learn from dynamic user behaviors to effectively sort and automatically respond email. Techniques developed in Web Intelligence (WI) are adopted to implement ECIPA. In order to facilitate cooperation of ECIPAs of different users, the concept of operable email, an extension of traditional email with an operable form, is introduced. ECIPA can in fact be viewed as a family of collaborative agents working together on the operable email.  相似文献   

15.
一种无尺度网络上垃圾邮件蠕虫的传播模型   总被引:2,自引:0,他引:2  
用有向图描述了电子邮件网络的结构,并分析了电子邮件网络的无尺度特性。在此基础上,通过用户检查邮件的频率和打开邮件附件的概率建立了一种电子邮件蠕虫的传播模型。分别仿真了电子邮件蠕虫在无尺度网络和随机网络中的传播,结果表明,邮件蠕虫在无尺度网络中的传播速度比在随机网络中更快,与理论分析相一致。  相似文献   

16.
贝叶斯过滤算法和费舍尔过滤算法均是利用统计学知识对于垃圾邮件进行过滤的算法,有着良好的过滤效果。该文设计将某一词组(单词)出现概率使用加权计算的方法,改善了朴素贝叶斯算法和朴素费舍尔的邮件过滤算法对于出现较少的单词误判情况,使系统对于垃圾邮件判断的准确率上升。设计可以使用个性化的垃圾邮件过滤方案,支持使用邮件下载协议(POP3、IMAP协议)从邮件服务器下载邮件,以及使用邮件解析协议(MIME协议)对于邮件进行解析,支持邮件发送协议(SMTP协议)帮助用户发送邮件。  相似文献   

17.
With the incremental use of emails as an essential and popular communication mean over the Internet, there comes a serious threat that impacts the Internet and the society. This problem is known as spam. By receiving spam messages, Internet users are exposed to security issues, and minors are exposed to inappropriate contents. Moreover, spam messages waste resources in terms of storage, bandwidth, and productivity. What makes the problem worse is that spammers keep inventing new techniques to dodge spam filters. On the other side, the massive data flow of hundreds of millions of individuals, and the large number of attributes make the problem more cumbersome and complex. Therefore, proposing evolutionary and adaptable spam detection models becomes a necessity. In this paper, an intelligent detection system that is based on Genetic Algorithm (GA) and Random Weight Network (RWN) is proposed to deal with email spam detection tasks. In addition, an automatic identification capability is also embedded in the proposed system to detect the most relevant features during the detection process. The proposed system is intensively evaluated through a series of extensive experiments based on three email corpora. The experimental results confirm that the proposed system can achieve remarkable results in terms of accuracy, precision, and recall. Furthermore, the proposed detection system can automatically identify the most relevant features of the spam emails.  相似文献   

18.
Spam as unsolicited email has certainly crossed the border of just being bothersome. In 2003, it surpassed legitimate email — growing to more than 50% of all Internet emails. Annually, it causes economic harms of several billion Euros. Fighting spam, beside legal approaches especially technical means are deployed in practical systems, mainly focussing on blocking and filtering mechanisms. This article introduces into the spam field and describes, assesses, and classifies the currently most important approaches against spam.  相似文献   

19.
Email has become one of the fastest and most economical forms of communication. Email is also one of the most ubiquitous and pervasive applications used on a daily basis by millions of people worldwide. However, the increase in email users has resulted in a dramatic increase in spam emails during the past few years. This paper proposes a new spam filtering system using revised back propagation (RBP) neural network and automatic thesaurus construction. The conventional back propagation (BP) neural network has slow learning speed and is prone to trap into a local minimum, so it will lead to poor performance and efficiency. The authors present in this paper the RBP neural network to overcome the limitations of the conventional BP neural network. A well constructed thesaurus has been recognized as a valuable tool in the effective operation of text classification, it can also overcome the problems in keyword-based spam filters which ignore the relationship between words. The authors conduct the experiments on Ling-Spam corpus. Experimental results show that the proposed spam filtering system is able to achieve higher performance, especially for the combination of RBP neural network and automatic thesaurus construction.  相似文献   

20.
Unsolicited or spam email has recently become a major threat that can negatively impact the usability of electronic mail. Spam substantially wastes time and money for business users and network administrators, consumes network bandwidth and storage space, and slows down email servers. In addition, it provides a medium for distributing harmful code and/or offensive content. In this paper, we explore the application of the GMDH (Group Method of Data Handling) based inductive learning approach in detecting spam messages by automatically identifying content features that effectively distinguish spam from legitimate emails. We study the performance for various network model complexities using spambase, a publicly available benchmark dataset. Results reveal that classification accuracies of 91.7% can be achieved using only 10 out of the available 57 attributes, selected through abductive learning as the most effective feature subset (i.e. 82.5% data reduction). We also show how to improve classification performance using abductive network ensembles (committees) trained on different subsets of the training data. Comparison with other techniques such as neural networks and naïve Bayesian classifiers shows that the GMDH-based learning approach can provide better spam detection accuracy with false-positive rates as low as 4.3% and yet requires shorter training time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号