首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Email spam is one of the biggest threats to today’s Internet. To deal with this threat, there are long-established measures like supervised anti-spam filters. In this paper, we report the development and evaluation of sentinel—an anti-spam filter based on natural language and stylometry attributes. The performance of the filter is evaluated not only on non-personalized emails (i.e., emails collected randomly) but also on personalized emails (i.e., emails collected from particular individuals). Among the non-personalized datasets are CSDMC2010, SpamAssassin, and LingSpam, while the Enron-Spam collection comprises personalized emails. The proposed filter extracts natural language attributes from email text that are closely related to writer stylometry and generate classifiers using multiple learning algorithms. Experimental outcomes show that classifiers generated by meta-learning algorithms such as adaboostm1 and bagging are the best, performing equally well and surpassing the performance of a number of filters proposed in previous studies, while a random forest generated classifier is a close second. On the other hand, the performance of classifiers using support vector machine and Naïve Bayes is not satisfactory. In addition, we find much improved results on personalized emails and mixed results on non-personalized emails.  相似文献   

2.
Bo Yu  Zong-ben Xu   《Knowledge》2008,21(4):355-362
The growth of email users has resulted in the dramatic increasing of the spam emails during the past few years. In this paper, four machine learning algorithms, which are Naïve Bayesian (NB), neural network (NN), support vector machine (SVM) and relevance vector machine (RVM), are proposed for spam classification. An empirical evaluation for them on the benchmark spam filtering corpora is presented. The experiments are performed based on different training set size and extracted feature size. Experimental results show that NN classifier is unsuitable for using alone as a spam rejection tool. Generally, the performances of SVM and RVM classifiers are obviously superior to NB classifier. Compared with SVM, RVM is shown to provide the similar classification result with less relevance vectors and much faster testing time. Despite the slower learning procedure, RVM is more suitable than SVM for spam classification in terms of the applications that require low complexity.  相似文献   

3.
Without imposing restrictions, many enterprises find nonwork-related contents consuming network resources. Business communication over emails thus incurs undesired delays and inflicts damages to businesses, explaining why many enterprises are concerned with the competition to use email services. Obviously, enterprises should prioritize business emails over personal ones in their email service. Therefore, previous works present content-based classification methods to categorize enterprise emails into business or personal correspondence. Accuracy of these methods is largely determined by their ability to survey as much information as possible. However, in addition to decreasing the performance of these methods, monitoring the details of email contents may violate privacy rights that are under legal protection, requiring a careful balance of accurately classifying enterprise emails and protecting privacy rights. The proposed email classification method is thus based on social features rather than a survey of emails contents. Social-based metrics are also designed to characterize emails as social features; the obtained features are treated as an input of machine learning-based classifiers for email classification. Experimental results demonstrate the high accuracy of the proposed method in classifying emails. In contrast with other content-based methods that examine email contents, the emphasis on social features in the proposed method is a promising alternative for solving similar email classification problems.  相似文献   

4.
We consider the well-studied pattern recognition problem of designing linear classifiers. When dealing with normally distributed classes, it is well known that the optimal Bayes classifier is linear only when the covariance matrices are equal. This was the only known condition for classifier linearity. In a previous work, we presented the theoretical framework for optimal pairwise linear classifiers for two-dimensional normally distributed random vectors. We derived the necessary and sufficient conditions that the distributions have to satisfy so as to yield the optimal linear classifier as a pair of straight lines.In this paper we extend the previous work to d-dimensional normally distributed random vectors. We provide the necessary and sufficient conditions needed so that the optimal Bayes classifier is a pair of hyperplanes. Various scenarios have been considered including one which resolves the multi-dimensional Minskys paradox for the perceptron. We have also provided some three-dimensional examples for all the cases, and tested the classification accuracy of the corresponding pairwise-linear classifier. In all the cases, these linear classifiers achieve very good performance. To demonstrate that the current pairwise-linear philosophy yields superior discriminants on real-life data, we have shown how linear classifiers determined using a maximum-likelihood estimate (MLE) applicable for this approach, yield better accuracy than the discriminants obtained by the traditional Fisher's classifier on a real-life data set. The multi-dimensional generalization of the MLE for these classifiers is currently being investigated.  相似文献   

5.
An important tool for the heart disease diagnosis is the analysis of electrocardiogram (ECG) signals, since the non-invasive nature and simplicity of the ECG exam. According to the application, ECG data analysis consists of steps such as preprocessing, segmentation, feature extraction and classification aiming to detect cardiac arrhythmias (i.e., cardiac rhythm abnormalities). Aiming to made a fast and accurate cardiac arrhythmia signal classification process, we apply and analyze a recent and robust supervised graph-based pattern recognition technique, the optimum-path forest (OPF) classifier. To the best of our knowledge, it is the first time that OPF classifier is used to the ECG heartbeat signal classification task. We then compare the performance (in terms of training and testing time, accuracy, specificity, and sensitivity) of the OPF classifier to the ones of other three well-known expert system classifiers, i.e., support vector machine (SVM), Bayesian and multilayer artificial neural network (MLP), using features extracted from six main approaches considered in literature for ECG arrhythmia analysis. In our experiments, we use the MIT-BIH Arrhythmia Database and the evaluation protocol recommended by The Association for the Advancement of Medical Instrumentation. A discussion on the obtained results shows that OPF classifier presents a robust performance, i.e., there is no need for parameter setup, as well as a high accuracy at an extremely low computational cost. Moreover, in average, the OPF classifier yielded greater performance than the MLP and SVM classifiers in terms of classification time and accuracy, and to produce quite similar performance to the Bayesian classifier, showing to be a promising technique for ECG signal analysis.  相似文献   

6.
周冠玮  程娟  平西建 《计算机工程》2007,33(15):199-201
如何利用邮件的正文与附件信息有效地实现其分类,是现在邮件处理领域一个重要的课题。该文从商业应用角度提出了一种基于图像信息度量与关键词的邮件智能过滤与分发方法,通过基于朴素贝叶斯分类器的邮件关键词信息处理,及附件图像信息的基于归一化PIM文本图像检测理论的分析,能够综合运用邮件正文、地址等文本信息与附件图像信息作为分类的评价参数,有效地实现了邮件的智能分类。  相似文献   

7.
Data preprocessing techniques for classification without discrimination   总被引:1,自引:0,他引:1  
Recently, the following Discrimination-Aware Classification Problem was introduced: Suppose we are given training data that exhibit unlawful discrimination; e.g., toward sensitive attributes such as gender or ethnicity. The task is to learn a classifier that optimizes accuracy, but does not have this discrimination in its predictions on test data. This problem is relevant in many settings, such as when the data are generated by a biased decision process or when the sensitive attribute serves as a proxy for unobserved features. In this paper, we concentrate on the case with only one binary sensitive attribute and a two-class classification problem. We first study the theoretically optimal trade-off between accuracy and non-discrimination for pure classifiers. Then, we look at algorithmic solutions that preprocess the data to remove discrimination before a classifier is learned. We survey and extend our existing data preprocessing techniques, being suppression of the sensitive attribute, massaging the dataset by changing class labels, and reweighing or resampling the data to remove discrimination without relabeling instances. These preprocessing techniques have been implemented in a modified version of Weka and we present the results of experiments on real-life data.  相似文献   

8.
Context‐based email classification requires understanding of semantic and structural attributes of email. Most of the research has focused on generating semantic properties through structural components of email. By viewing emails as events (as a major subset of class of email), a rich contextual test‐bed representation for understanding of the semantic attributes of emails has been devised. The event‐ based emails have traditionally been studied based on simple structural properties. In this paper, we present a novel approach by first representing such class of emails as graphs, followed by heuristically applying graph mining and matching algorithm to pick templates representing contextual and semantic attributes that help classify emails. The classification templates used three key event classes: social, personal and professional. Results show that our graph mining and matching supported template‐based approach performs consistently well over event email data set with high accuracy.  相似文献   

9.
A suffix tree approach to anti-spam email filtering   总被引:1,自引:0,他引:1  
We present an approach to email filtering based on the suffix tree data structure. A method for the scoring of emails using the suffix tree is developed and a number of scoring and score normalisation functions are tested. Our results show that the character level representation of emails and classes facilitated by the suffix tree can significantly improve classification accuracy when compared with the currently popular methods, such as naive Bayes. We believe the method can be extended to the classification of documents in other domains. Editor: Tom Fawcett  相似文献   

10.
Feature selection and feature weighting are useful techniques for improving the classification accuracy of K-nearest-neighbor (K-NN) rule. The term feature selection refers to algorithms that select the best subset of the input feature set. In feature weighting, each feature is multiplied by a weight value proportional to the ability of the feature to distinguish pattern classes. In this paper, a novel hybrid approach is proposed for simultaneous feature selection and feature weighting of K-NN rule based on Tabu Search (TS) heuristic. The proposed TS heuristic in combination with K-NN classifier is compared with several classifiers on various available data sets. The results have indicated a significant improvement in the performance in classification accuracy. The proposed TS heuristic is also compared with various feature selection algorithms. Experiments performed revealed that the proposed hybrid TS heuristic is superior to both simple TS and sequential search algorithms. We also present results for the classification of prostate cancer using multispectral images, an important problem in biomedicine.  相似文献   

11.
Accurate grading for hepatocellular carcinoma (HCC) biopsy images is important to prognosis and treatment planning. In this paper, we propose an automatic system for grading HCC biopsy images. In preprocessing, we use a dual morphological grayscale reconstruction method to remove noise and accentuate nuclear shapes. A marker-controlled watershed transform is applied to obtain the initial contours of nuclei and a snake model is used to segment the shapes of nuclei smoothly and precisely. Fourteen features are then extracted based on six types of characteristics for HCC classification. Finally, we propose a SVM-based decision-graph classifier to classify HCC biopsy images. Experimental results show that 94.54% of classification accuracy can be achieved by using our SVM-based decision-graph classifier while 90.07% and 92.88% of classification accuracy can be achieved by using k-NN and SVM classifiers, respectively.  相似文献   

12.
Unsolicited or spam email has recently become a major threat that can negatively impact the usability of electronic mail. Spam substantially wastes time and money for business users and network administrators, consumes network bandwidth and storage space, and slows down email servers. In addition, it provides a medium for distributing harmful code and/or offensive content. In this paper, we explore the application of the GMDH (Group Method of Data Handling) based inductive learning approach in detecting spam messages by automatically identifying content features that effectively distinguish spam from legitimate emails. We study the performance for various network model complexities using spambase, a publicly available benchmark dataset. Results reveal that classification accuracies of 91.7% can be achieved using only 10 out of the available 57 attributes, selected through abductive learning as the most effective feature subset (i.e. 82.5% data reduction). We also show how to improve classification performance using abductive network ensembles (committees) trained on different subsets of the training data. Comparison with other techniques such as neural networks and naïve Bayesian classifiers shows that the GMDH-based learning approach can provide better spam detection accuracy with false-positive rates as low as 4.3% and yet requires shorter training time.  相似文献   

13.
The Internet has been flooded with spam emails, and during the last decade there has been an increasing demand for reliable anti-spam email filters. The problem of filtering emails can be considered as a classification problem in the field of supervised learning. Theoretically, many mature technologies, for example, support vector machines (SVM), can be used to solve this problem. However, in real enterprise applications, the training data are typically collected via honeypots and thus are always of huge amounts and highly biased towards spam emails. This challenges both efficiency and effectiveness of conventional technologies. In this article, we propose an undersampling method to compress and balance the training set used for the conventional SVM classifier with minimal information loss. The key observation is that we can make a trade-off between training set size and information loss by carefully defining a similarity measure between data samples. Our experiments show that the SVM classifier provides a better performance by applying our compressing and balancing approach.  相似文献   

14.
Spectral features of images, such as Gabor filters and wavelet transform can be used for texture image classification. That is, a classifier is trained based on some labeled texture features as the training set to classify unlabeled texture features of images into some pre-defined classes. The aim of this paper is twofold. First, it investigates the classification performance of using Gabor filters, wavelet transform, and their combination respectively, as the texture feature representation of scenery images (such as mountain, castle, etc.). A k-nearest neighbor (k-NN) classifier and support vector machine (SVM) are also compared. Second, three k-NN classifiers and three SVMs are combined respectively, in which each of the combined three classifiers uses one of the above three texture feature representations respectively, to see whether combining multiple classifiers can outperform the single classifier in terms of scenery image classification. The result shows that a single SVM using Gabor filters provides the highest classification accuracy than the other two spectral features and the combined three k-NN classifiers and three SVMs.  相似文献   

15.
Classification is the most used supervized machine learning method. As each of the many existing classification algorithms can perform poorly on some data, different attempts have arisen to improve the original algorithms by combining them. Some of the best know results are produced by ensemble methods, like bagging or boosting. We developed a new ensemble method called allocation. Allocation method uses the allocator, an algorithm that separates the data instances based on anomaly detection and allocates them to one of the micro classifiers, built with the existing classification algorithms on a subset of training data. The outputs of micro classifiers are then fused together into one final classification. Our goal was to improve the results of original classifiers with this new allocation method and to compare the classification results with existing ensemble methods. The allocation method was tested on 30 benchmark datasets and was used with six well known basic classification algorithms (J48, NaiveBayes, IBk, SMO, OneR and NBTree). The obtained results were compared to those of the basic classifiers as well as other ensemble methods (bagging, MultiBoost and AdaBoost). Results show that our allocation method is superior to basic classifiers and also to tested ensembles in classification accuracy and f-score. The conducted statistical analysis, when all of the used classification algorithms are considered, confirmed that our allocation method performs significantly better both in classification accuracy and f-score. Although the differences are not significant for each of the used basic classifier alone, the allocation method achieved the biggest improvements on all six basic classification algorithms. In this manner, allocation method proved to be a competitive ensemble method for classification that can be used with various classification algorithms and can possibly outperform other ensembles on different types of data.  相似文献   

16.
为了从大量的电子邮件中检测垃圾邮件,提出了一个基于Hadoop平台的电子邮件分类方法。不同于传统的基于内容的垃圾邮件检测,通过在Map Reduce框架上统计分析邮件收发记录,提取邮件账号的行为特征。然后使用Map Reduce框架并行的实现随机森林分类器,并基于带有行为特征的样本训练分类器和分类邮件。实验结果表明,基于Hadoop平台的电子邮件分类方法大大提高了大规模电子邮件的分类效率。  相似文献   

17.
Incremental learning has been used extensively for data stream classification. Most attention on the data stream classification paid on non-evolutionary methods. In this paper, we introduce new incremental learning algorithms based on harmony search. We first propose a new classification algorithm for the classification of batch data called harmony-based classifier and then give its incremental version for classification of data streams called incremental harmony-based classifier. Finally, we improve it to reduce its computational overhead in absence of drifts and increase its robustness in presence of noise. This improved version is called improved incremental harmony-based classifier. The proposed methods are evaluated on some real world and synthetic data sets. Experimental results show that the proposed batch classifier outperforms some batch classifiers and also the proposed incremental methods can effectively address the issues usually encountered in the data stream environments. Improved incremental harmony-based classifier has significantly better speed and accuracy on capturing concept drifts than the non-incremental harmony based method and its accuracy is comparable to non-evolutionary algorithms. The experimental results also show the robustness of improved incremental harmony-based classifier.  相似文献   

18.
19.
Remote sensing image classification is a common application of remote sensing images. In order to improve the performance of Remote sensing image classification, multiple classifier combinations are used to classify the Landsat-8 Operational Land Imager (Landsat-8 OLI) images. Some techniques and classifier combination algorithms are investigated. The classifier ensemble consisting of five member classifiers is constructed. The results of every member classifier are evaluated. The voting strategy is experimented to combine the classification results of the member classifier. The results show that all the classifiers have different performances and the multiple classifier combination provides better performance than a single classifier, and achieves higher overall accuracy of classification. The experiment shows that the multiple classifier combination using producer’s accuracy as voting-weight (MCCmod2 and MCCmod3) present higher classification accuracy than the algorithm using overall accuracy as voting-weight (MCCmod1).And the multiple classifier combinations using different voting-weights affected the classification result in different land-cover types. The multiple classifier combination algorithm presented in this article using voting-weight based on the accuracy of multiple classifier may have stability problems, which need to be addressed in future studies.  相似文献   

20.
《Applied Soft Computing》2008,8(1):437-445
In this paper we present two methods to create multiple classifier systems based on an initial transformation of the original features to the binary domain and subsequent decompositions (quantisation). Both methods are generally applicable although in this work they are applied to grey-scale pixel values of facial images which form the original feature domain. We further investigate the issue of diversity within the generated ensembles of classifiers which emerges as an important concept in classifier fusion and propose a formal definition based on statistically independent classifiers using the κ statistic to quantitatively assess it. Results show that our methods outperform a number of alternative algorithms applied on the same dataset, while our analysis indicates that diversity among the classifiers in a combination scheme is not sufficient to guarantee performance improvements. Rather, some type of trade off seems to be necessary between participant classifiers’ accuracy and ensemble diversity in order to achieve maximum recognition gains.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号