首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
k最近邻分类算法原理简单且分类性能好,但因其时间复杂度高,不适用于实际领域在线垃圾邮件过滤.本文在建模阶段首先对训练邮件进行初始聚类,将训练邮件划分为半径大小几乎相同的初始簇,然后使用共享最近邻图聚类算法对包含邮件的初始簇进行再聚类,最终聚类簇被看成是可以增量更新的分类模型,最后使用经典k最近邻分类算法在该分类模型上对未知邮件进行分类.在公开语料Ling-Spam上的实验结果表明,本文提出的垃圾邮件识别算法不仅具有较高的垃圾邮件识别精度,而且还具有较低的时间复杂度.  相似文献   

2.
万年红  谭文安  王雪蓉 《计算机工程》2011,37(9):110-111,114
针对传统软件工程知识分类方法效率低下的问题,提出一种改进的软件工程知识分类方法。依据软件工程知识体系(SWEBOK)对构件行为进行聚类,确定关联系数、最佳聚类数和模糊关联矩阵,基于K-NN算法和结构建模方法生成软件知识分类系统,并根据训练先验知识将新知识归入到SWEBOK的对应类别下。实验结果表明,该方法具有较好的分类效果。  相似文献   

3.
提出了一种基于改进的模糊 C 均值聚类的模糊规则提取方法。然后基于所提取的模糊规则给出了一种分类算法,并利用 IRIS 数据对此分类算法进行了仿真测试。结果表明,该算法在训练祥本较少的情况下,仍能得到很好的分类效果,由此说明所提出的模糊规则生成方法有效。  相似文献   

4.
介绍了一种基于动态聚类的模糊分类规则的生成方法,这种方法能决定规则数目,隶属函数的位置及形状.首先,介绍了基于超圆雏体隶属函数的模糊分类规则的基本形式;然后,介绍动态聚类算法,该算法能将每一类训练模式动态的分为成簇,对于每簇,则建立一个模糊规则;通过调整隶属函数的斜度,来提高对训练模式分类识别率,达到对模糊分类规则进行优化调整的目的;用两个典型的数据集评测了这篇文章研究的方法,这种方法构成的分类系统在识别率与多层神经网络分类器相当,但训练时间远少于多层神经网络分类器的训练时间.  相似文献   

5.
针对智能家居系统中的室温自动控制问题,提出了一种基于模糊聚类与支持向量机相结合的日室温曲线自动分类方法.该方法通过模糊聚类技术将不同用户同一天的日室温曲线进行聚类,将智能家居用户按照室温的不同喜好进行分类;通过支持向量机算法对聚类后的数据进行训练从而建立智能家居系统中不同类别用户的典型日室温曲线模型,用于日后新加入用户的类别判定;为提高模型建立的准确性,利用交叉验证方法进行SVM的参数选择,并通过实验验证了该方法的有效性.  相似文献   

6.
提出用于规则前件学习的中心点交叉涌现的大间隔贝叶斯模糊聚类(CECLM-BFC)算法.考虑不同样本间聚类中心的排斥作用使得聚类中心间距最大化,并采用粒子滤波方法在不同类别样本中交替执行,自动求解出最优聚类结果,包括聚类数、模糊隶属度和聚类中心.在模糊规则后件参数学习上使用分类面大间隔的策略,以MA型模糊系统为研究对象构造具有强解释性的贝叶斯MA型模糊系统(BMA-FS).实验结果表明,BMA-FS能够取得令人满意的分类性能,且模糊规则具有高度的解释性.  相似文献   

7.
在基于遗传算法的信息过滤系统中引入模糊聚类思想,对种群中的每个个体进行模糊相似矩阵直接聚类,然后根据聚类的结果采用所提出的适应度函数来评估种群的适应度,通过迭代训练得出更准确的用户兴趣模板,从而提高了信息过滤的准确率。并且将该方法应用到了所设计的网络信息过滤系统中,进行了验证。  相似文献   

8.
聚类就是按照事物间的相似性进行区分和分类的过程,传统的聚类分析是一种硬划分,它把每个待辨识的对象严格地划分到某个类中,具有非此即彼的性质,因此这种分类的类别界限是分明的。而实际上大多数对象并没有严格的属性,它们在形态和类属方面存在着中介性,适合进行软划分。1965年,模糊理论的创始人Zadeh提出的模糊集理论为这种软划分提供了有力的分析工具,人们开始用模糊的方法来处理聚类问题,并称之为模糊聚类。该文主要内容是研究和实现基于等价关系的模糊聚类算法,该算法以隶属度作为聚类的出发点,以模糊等价矩阵作为启发规则。首先根据给出的样本,通过数据标准化求得数据矩阵;其次根据数量积法对数据矩阵进行标定即建立模糊相似矩阵;再次通过传递闭包法把模糊相似矩阵转换成模糊等价矩阵,在模糊等价矩阵中取不同的元素作为阈值λ,再根据λ截矩阵的定义把模糊等价矩阵转换成只有0和1的矩阵;最后,把该矩阵中元素相同的列聚为同一类。通过实例分析运用基于等价关系的模糊聚类算法进行聚类结果是正确的。  相似文献   

9.
实际生活中,经常会遇到大规模数据的分类问题,传统k-近邻k-NN(k-Nearest Neighbor)分类方法需要遍历整个训练样本集,因此分类效率较低,无法处理具有大规模训练集的分类任务。针对这个问题,提出一种基于聚类的加速k-NN分类方法 C_kNN(Speeding k-NN Classification Method Based on Clustering)。该方法首先对训练样本进行聚类,得到初始聚类结果,并计算每个类的聚类中心,选择与聚类中心相似度最高的训练样本构成新的训练样本集,然后针对每个测试样本,计算新训练样本集中与其相似度最高的k个样本,并选择该k个近邻样本中最多的类别标签作为该测试样本的预测模式类别。实验结果表明,C_k-NN分类方法在保持较高分类精度的同时大幅度提高模型的分类效率。  相似文献   

10.
胡小生  张润晶  钟勇 《计算机科学》2013,40(11):271-275
类别不平衡数据分类是机器学习和数据挖掘研究的热点问题。传统分类算法有很大的偏向性,少数类分类效果不够理想。提出一种两层聚类的类别不平衡数据级联挖掘算法。算法首先进行基于聚类的欠采样,在多数类样本上进行聚类,之后提取聚类质心,获得与少数类样本数目相一致的聚类质心,再与所有少数类样例一起组成新的平衡训练集,为了避免少数类样本数量过少而使训练集过小导致分类精度下降的问题,使用SMOTE过采样结合聚类欠采样;然后在平衡的训练集上使用K均值聚类与C4.5决策树算法相级联的分类方法,通过K均值聚类将训练样例划分为K个簇,在每个聚类簇内使用C4.5算法构建决策树,通过K个聚簇上的决策树来改进优化分类决策边界。实验结果表明,该算法具有处理类别不平衡数据分类问题的优势。  相似文献   

11.
Despite the enormous importance of e-mail to current worldwide communication, the increase of spam deliveries has had a significant adverse effect for all its users. In order to adequately fight spam, both the filtering industry and scientific community have developed and deployed the fastest and most accurate filtering techniques. However, the increasing volume of new incoming messages needing classification together with the lack of adequate support for anti-spam services on the cloud, make filtering efficiency an absolute necessity. In this context, and given the extensive utilization and increasing significance of rule-based filtering frameworks for the anti-spam domain, this work studies and analyses the importance of both existing and novel scheduling strategies to make the most of currently available anti-spam filtering techniques. Results obtained from the experiments demonstrated that some scheduling alternatives resulted in time savings of up to 26% for filtering messages, while maintaining the same classification accuracy.  相似文献   

12.
曾超  吕钊  顾君忠 《计算机应用》2008,28(12):3248-3250
提出了一个基于概念向量空间模型的电子邮件分类方法。在提取电子邮件特征向量时,以WordNet语言本体库为基础,以同义词集合概念代替词条,同时考虑同义词集合间的上下位关系,从而建立电子邮件的概念向量空间模型作为电子邮件的特征向量。使用TF*IWF*IWF方法对概念向量进行权值修正,最后通过简单向量距离分类方法来确定电子邮件的类别。实验结果表明,当训练集合数目有限时,该方法能够有效提高电子邮件的分类准确率。  相似文献   

13.
基于内容的邮件过滤本质是二值文本分类问题。特征选择在分类之前约简特征空间以减少分类器在计算和存储上的开销,同时过滤部分噪声以提高分类的准确性,是影响邮件过滤准确性和时效性的重要因素。但各特征选择算法在同一评价环境中性能不同,且对分类器和数据集分布特征具有依赖性。结合邮件过滤自身特点,从分类器适应性、数据集依赖性及时间复杂度三个方面评价与分析各特征选择算法在邮件过滤领域的性能。实验结果表明,优势率和文档频数用于邮件过滤时垃圾邮件识别的准确率较高,运算时间较少。  相似文献   

14.
Classifier performance optimization in machine learning can be stated as a multi-objective optimization problem. In this context, recent works have shown the utility of simple evolutionary multi-objective algorithms (NSGA-II, SPEA2) to conveniently optimize the global performance of different anti-spam filters. The present work extends existing contributions in the spam filtering domain by using three novel indicator-based (SMS-EMOA, CH-EMOA) and decomposition-based (MOEA/D) evolutionary multi-objective algorithms. The proposed approaches are used to optimize the performance of a heterogeneous ensemble of classifiers into two different but complementary scenarios: parsimony maximization and e-mail classification under low confidence level. Experimental results using a publicly available standard corpus allowed us to identify interesting conclusions regarding both the utility of rule-based classification filters and the appropriateness of a three-way classification system in the spam filtering domain.  相似文献   

15.
This paper proposes a classification method that is based on easily interpretable fuzzy rules and fully capitalizes on the two key technologies, namely pruning the outliers in the training data by SVMs (support vector machines), i.e., eliminating the influence of outliers on the learning process; finding a fuzzy set with sound linguistic interpretation to describe each class based on AFS (axiomatic fuzzy set) theory. Compared with other fuzzy rule-based methods, the proposed models are usually more compact and easily understandable for the users since each class is described by much fewer rules. The proposed method also comes with two other advantages, namely, each rule obtained from the proposed algorithm is simply a conjunction of some linguistic terms, there are no parameters that are required to be tuned. The proposed classification method is compared with the previously published fuzzy rule-based classifiers by testing them on 16 UCI data sets. The results show that the fuzzy rule-based classifier presented in this paper, offers a compact, understandable and accurate classification scheme. A balance is achieved between the interpretability and the accuracy.  相似文献   

16.
电子邮件过滤系统的粗糙集分析模型   总被引:10,自引:2,他引:10  
电子邮件过滤是网络信息安全研究的热点。粗糙集理论是一种处理含糊和不精确性问题的一种新型数学工具,得到了广泛应用。该文结合粗糙集理论的数据分析技术研究了电子邮件过滤系统的建模和特征发现等问题。实例分析结果表明,该方法是有效的。  相似文献   

17.
在垃圾邮件过滤中,考虑到特征词对合法邮件和垃圾邮件分类贡献的不同,通过定义分类贡献比系数,将特征词分类贡献的思想应用到特征选择和朴素贝叶斯过滤器的设计中,在英文语料库上进行实验,实验结果表明,应用特征词分类贡献的垃圾邮件过滤方法可以有效提高过滤器对合法邮件和垃圾邮件的识别能力,降低过滤器对合法邮件和垃圾邮件的误判率。  相似文献   

18.
《Information Sciences》2007,177(10):2167-2187
In this paper we study supervised and semi-supervised classification of e-mails. We consider two tasks: filing e-mails into folders and spam e-mail filtering. Firstly, in a supervised learning setting, we investigate the use of random forest for automatic e-mail filing into folders and spam e-mail filtering. We show that random forest is a good choice for these tasks as it runs fast on large and high dimensional databases, is easy to tune and is highly accurate, outperforming popular algorithms such as decision trees, support vector machines and naïve Bayes. We introduce a new accurate feature selector with linear time complexity. Secondly, we examine the applicability of the semi-supervised co-training paradigm for spam e-mail filtering by employing random forests, support vector machines, decision tree and naïve Bayes as base classifiers. The study shows that a classifier trained on a small set of labelled examples can be successfully boosted using unlabelled examples to accuracy rate of only 5% lower than a classifier trained on all labelled examples. We investigate the performance of co-training with one natural feature split and show that in the domain of spam e-mail filtering it can be as competitive as co-training with two natural feature splits.  相似文献   

19.
In this paper, we introduce a new adaptive rule-based classifier for multi-class classification of biological data, where several problems of classifying biological data are addressed: overfitting, noisy instances and class-imbalance data. It is well known that rules are interesting way for representing data in a human interpretable way. The proposed rule-based classifier combines the random subspace and boosting approaches with ensemble of decision trees to construct a set of classification rules without involving global optimisation. The classifier considers random subspace approach to avoid overfitting, boosting approach for classifying noisy instances and ensemble of decision trees to deal with class-imbalance problem. The classifier uses two popular classification techniques: decision tree and k-nearest-neighbor algorithms. Decision trees are used for evolving classification rules from the training data, while k-nearest-neighbor is used for analysing the misclassified instances and removing vagueness between the contradictory rules. It considers a series of k iterations to develop a set of classification rules from the training data and pays more attention to the misclassified instances in the next iteration by giving it a boosting flavour. This paper particularly focuses to come up with an optimal ensemble classifier that will help for improving the prediction accuracy of DNA variant identification and classification task. The performance of proposed classifier is tested with compared to well-approved existing machine learning and data mining algorithms on genomic data (148 Exome data sets) of Brugada syndrome and 10 real benchmark life sciences data sets from the UCI (University of California, Irvine) machine learning repository. The experimental results indicate that the proposed classifier has exemplary classification accuracy on different types of biological data. Overall, the proposed classifier offers good prediction accuracy to new DNA variants classification where noisy and misclassified variants are optimised to increase test performance.  相似文献   

20.
Removing or filtering outliers and mislabeled instances prior to training a learning algorithm has been shown to increase classification accuracy, especially in noisy data sets. A popular approach is to remove any instance that is misclassified by a learning algorithm. However, the use of ensemble methods has also been shown to generally increase classification accuracy. In this paper, we extensively examine filtering and ensembling. We examine 9 learning algorithms individually and ensembled together as filtering algorithms as well as the effects of filtering in the 9 chosen learning algorithms on a set of 54 data sets. We compare the filtering results with using a majority voting ensemble. We find that the majority voting ensemble significantly outperforms filtering unless there are high amounts of noise present in the data set. Additionally, for most cases, using an ensemble of learning algorithms for filtering produces a greater increase in classification accuracy than using a single learning algorithm for filtering.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号