首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 176 毫秒
1.
在文本分类中获得有类别标记训练样本的代价是很高昂的,本文针对这个问题对传统的模糊聚类方法进行改进,提出模糊划分聚类方法FPCM,将聚类的无监督性和样本的先验知识结合起来,通过相似度度量聚类相关文本,取得比较客观的簇和少量标记文本,为监督学习找到分类依据,并结合朴素贝叶斯增量学习方式进行分类器的学习.本文进一步用估计分类误差损失的方法平衡选取候选样本,提高了分类准确率,实现了应用范围更加广泛的无标记文本分类学习模型.  相似文献   

2.
提出了一种基于K近邻(KNN)原理的快速文本分类算法。该算法不仅具有原始K近邻算法分类效果好的优点,还通过对训练样本进行压缩,消除相似度之间的比较,提高了分类效率。实验表明,该算法用于邮件过滤系统时,分类效果要优于基于朴素贝叶斯分类器的二项独立模型和多项式模型,而分类的时间复杂度与其相当,完全可以应用于实时邮件过滤。  相似文献   

3.
I-vector说话人识别系统常用距离来衡量说话人语音间的相似度。加权成对约束度量学习算法(WPCML)利用成对训练样本的加权约束信息训练一个用于计算马氏距离的度量矩阵。该度量矩阵表示的样本空间中,同类样本间的距离更小,非同类样本间的距离更大。在美国国家标准技术局(NIST)2008年说话人识别评测数据库(SRE08)的实验结果表明,WPCML算法训练度量矩阵用于马氏距离相似度打分,比用余弦距离相似度打分的性能更好。选择训练样本对方法用于构造度量学习训练样本集能进一步提高系统实验性能,并优于目前最流行的PLDA分类器。  相似文献   

4.
针对特征选择中存在数据缺乏类别信息的问题,提出一种新型的基于改进ReliefF的无监督特征选择方法UFS-IR.由于ReliefF类算法存在小类样本抽样概率低、无法删除冗余特征的缺陷,该方法以DBSCAN聚类算法指导分类,通过改进抽样策略,使用调整的余弦相似度度量特征间的相关性作为去冗余的凭据.实验表明UFS-IR可以有效缩减数据维度的同时保证特征子集的最大相关最小冗余性,具有很好的性能.  相似文献   

5.
可分性判据在中文网页分类中的应用   总被引:3,自引:0,他引:3  
提出了一种改进的基于统计的中文网页的分类算法。通过对传统的基于计算相似度文本分类方法和基于贝叶斯模型文本分类算法的研究,我们对贝叶斯模型分类算法进行了改进,提出了利用一种基于概率分布的可分性判据分类方法,即用类别密度函数似然比来增加特征词的可分性信息的算法。通过对计算相似度方法,贝叶斯方法及改进的贝叶斯方法的对比实验表明,改进算法可以使类与类的间隔最大化,因而具有较高的分类精确率和召回率。  相似文献   

6.
在基于微博数据训练分类模型的过程当中,我们可以通过主动学习有效的减少需标注数据的数据量,SVM主动学习算法是主动学习中相当著名的算法,但是该算法还存在缺陷,就是没有对微博数据内容多样的特点进行充分考虑,因此在本文中作者提出了一种新的基于支持向量机(SVM)的主动学习算法,该算法通过未标注样本点与所有已标注样本点之间的余弦相似度之和来度量未标注样本与所有已标注样本点之间的相似性,通过选择与已选择的所有样本不相似的样本点进行标注就可以实现对于数据多样性的充分考虑;另外,为了避免太大的余弦相似度值对于余弦相似度之和的影响,该算法通过一种设置阈值的方法来使得被选择样本的最小余弦相似度尽可能大;除此之外,为了选择最佳的样本进行标注,在算法中我们在考虑数据多样性的同时也对样本点和分类超平面之间的距离进行了考虑。  相似文献   

7.
针对朴素贝叶斯分类器在分类过程中,不同类别的同一特征量之间存在相似性,易导致误分类的现象,提出基于引力模型的朴素贝叶斯分类算法。提出以引力公式中的距离变量的平方作为“相似距离”,应用引力模型来刻画特征与其所属类别之间的相似度,从而克服朴素贝叶斯分类算法容易受到条件独立假设的影响,将所有特征同质化的缺点,并能有效地避免噪声干扰,达到修正先验概率、提高分类精度的目的。对遥感图像的分类实验表明,基于引力模型的朴素贝叶斯分类算法易于实现,可操作性强,且具有更高的平均分类准确率。  相似文献   

8.
k近邻方法是文本分类中广泛应用的方法,对其性能的优化具有现实需求。使用一种改进的聚类算法进行样本剪裁以提高训练样本的类别表示能力;根据样本的空间位置先后实现了基于类内和类间分布的样本加权;改善了k近邻算法中的大类别、高密度训练样本占优现象。实验结果表明,提出的改进文本加权方法提高了分类器的分类效率。  相似文献   

9.
现有的时间序列的相似性度量大多基于欧氏距离,并不适用于不同粒度时间序列的相似性匹配,无法直接对其相似性进行有效的度量,为此,提出一种基于对应差值比样本的相似性度量,用于不同粒度时间序列的相似性匹配.首先对不同时间粒度的时序数据进行阐述,并定义了对应差值比样本与相似度计算方法;接着提出基于它们的相似性匹配算法;最后实验证...  相似文献   

10.
针对当前基于距离测度学习的行人再识别算法中因训练样本少而出现的过拟合问题,提出正则化独立测度矩阵的行人再识别算法.该算法首先在4个不同的颜色空间单独学习测度矩阵,然后分别对相应的测度矩阵进行正则化,测试样本通过正则化后的测度矩阵进行相似性度量,最后结合相似性度量结果得到最终相似度.实验表明,相比原有算法,文中算法在性能上有进一步提升,并可改善训练样本少时出现的过拟合问题.  相似文献   

11.
An increasing number of computational and statistical approaches have been used for text classification, including nearest-neighbor classification, naïve Bayes classification, support vector machines, decision tree induction, rule induction, and artificial neural networks. Among these approaches, naïve Bayes classifiers have been widely used because of its simplicity. Due to the simplicity of the Bayes formula, the naïve Bayes classification algorithm requires a relatively small number of training data and shorter time in both the training and classification stages as compared to other classifiers. However, a major short coming of this technique is the fact that the classifier will pick the highest probability category as the one to which the document is annotated too. Doing this is tantamount to classifying using only one dimension of a multi-dimensional data set. The main aim of this work is to utilize the strengths of the self organizing map (SOM) to overcome the inadvertent dimensionality reduction resulting from using only the Bayes formula to classify. Combining the hybrid system with new ranking techniques further improves the performance of the proposed document classification approach. This work describes the implementation of an enhanced hybrid classification approach which affords a better classification accuracy through the utilization of two familiar algorithms, the naïve Bayes classification algorithm which is used to vectorize the document using a probability distribution and the self organizing map (SOM) clustering algorithm which is used as the multi-dimensional unsupervised classifier.  相似文献   

12.
Organizations often manage identity information for their customers, vendors, and employees. Identity management is critical to various organizational practices ranging from customer relationship management to crime investigation. The task of searching for a specific identity is difficult because disparate identity information may exist due to the issues related to unintentional errors and intentional deception. In this paper we propose a hierarchical Naïve Bayes model that improves existing identity matching techniques in terms of searching effectiveness. Experiments show that our proposed model performs significantly better than the exact-match based matching technique. With 50% training instances labeled, the proposed semi-supervised learning achieves a performance comparable to the fully supervised record comparison algorithm. The semi-supervised learning greatly reduces the efforts of manually labeling training instances without significant performance degradation.  相似文献   

13.
庄媛  张鹏程  李雯睿  冯钧  朱跃龙 《软件学报》2016,27(8):1978-1992
面向服务系统的执行能力依赖第三方提供的服务,在复杂多变的网络环境中,这种依赖会带来服务质量(QoS)的不确定性.而QoS是衡量第三方服务质量的重要标准,因此,有效监控QoS是对Web服务实现质量控制的必要过程.现有监控方法都未考虑环境因素的影响,比如服务器位置、用户使用服务的位置和使用时间段负载等,而这些影响在实际监控中是存在的,忽略环境因素会导致监控结果与实际结果有悖.针对这一问题,提出了一种基于加权朴素贝叶斯算法wBSRM(weightednaive Bayes running monitoring)的Web Service QoS监控方法.受机器学习分类方法的启发,通过TF-IDF(term frequency-inverse document frequency)算法计算环境因素的影响,通过对部分样本进行学习,构建加权朴素贝叶斯分类器.将监控结果分类,满足QoS标准为c0,不满足QoS标准为c1,监控时调用分类器得到c0c1的后验概率之比,对比值进行分析,可得监控结果满足QoS属性标准、不满足QoS属性标准和不能判断这3种情况.在网络开源数据以及随机数据集上的实验结果表明:利用TF-IDF算法能够准确地估算环境因子权值,通过加权朴素贝叶斯分类器,能够更好地监控QoS,效率显著优于现有方法.  相似文献   

14.
Each type of classifier has its own advantages as well as certain shortcomings. In this paper, we take the advantages of the associative classifier and the Naïve Bayes Classifier to make up the shortcomings of each other, thus improving the accuracy of text classification. We will classify the training cases with the Naïve Bayes Classifier and set different confidence threshold values for different class association rules (CARs) to different classes by the obtained classification accuracy rate of the Naïve Bayes Classifier to the classes. Since the accuracy rates of all selected CARs of the class are higher than that obtained by the Naïve Bayes Classifier, we could further optimize the classification result through these selected CARs. Moreover, for those unclassified cases, we will classify them with the Naïve Bayes Classifier. The experimental results show that combining the advantages of these two different classifiers better classification result can be obtained than with a single classifier.  相似文献   

15.
文本分类是自然语言处理领域的一项重要任务,具有广泛的应用场景,比如知识问答、文本主题分类、文本情感分析等.解决文本分类任务的方法有很多,如支持向量机(Support Vector Machines,SVM)模型和朴素贝叶斯(Naïve Bayes)模型,现在被广泛使用的是以循环神经网络(Recurrent Neural Network,RNN)和文本卷积网络(TextConventional Neural Network,TextCNN)为代表的神经网络模型.本文分析了文本分类领域中的序列模型和卷积模型,并提出一种组合序列模型和卷积模型的混合模型.在公开数据集上对不同模型进行性能上的对比,验证了组合模型的性能要优于单独的模型.  相似文献   

16.
Privacy-preserving Naïve Bayes classification   总被引:1,自引:0,他引:1  
Privacy-preserving data mining—developing models without seeing the data – is receiving growing attention. This paper assumes a privacy-preserving distributed data mining scenario: data sources collaborate to develop a global model, but must not disclose their data to others. The problem of secure distributed classification is an important one. In many situations, data is split between multiple organizations. These organizations may want to utilize all of the data to create more accurate predictive models while revealing neither their training data/databases nor the instances to be classified. Naïve Bayes is often used as a baseline classifier, consistently providing reasonable classification performance. This paper brings privacy-preservation to that baseline, presenting protocols to develop a Naïve Bayes classifier on both vertically as well as horizontally partitioned data.  相似文献   

17.
One of the simplest, and yet most consistently well-performing set of classifiers is the Naïve Bayes models. These models rely on two assumptions: (i) All the attributes used to describe an instance are conditionally independent given the class of that instance, and (ii) all attributes follow a specific parametric family of distributions. In this paper we propose a new set of models for classification in continuous domains, termed latent classification models. The latent classification model can roughly be seen as combining the Naïve Bayes model with a mixture of factor analyzers, thereby relaxing the assumptions of the Naïve Bayes classifier. In the proposed model the continuous attributes are described by a mixture of multivariate Gaussians, where the conditional dependencies among the attributes are encoded using latent variables. We present algorithms for learning both the parameters and the structure of a latent classification model, and we demonstrate empirically that the accuracy of the proposed model is significantly higher than the accuracy of other probabilistic classifiers.Editors: Pedro Larrañaga, Jose A. Lozano, Jose M. Peña and Iñaki Inza  相似文献   

18.
Classification problems have a long history in the machine learning literature. One of the simplest, and yet most consistently well-performing set of classifiers is the Naïve Bayes models. However, an inherent problem with these classifiers is the assumption that all attributes used to describe an instance are conditionally independent given the class of that instance. When this assumption is violated (which is often the case in practice) it can reduce classification accuracy due to “information double-counting” and interaction omission. In this paper we focus on a relatively new set of models, termed Hierarchical Naïve Bayes models. Hierarchical Naïve Bayes models extend the modeling flexibility of Naïve Bayes models by introducing latent variables to relax some of the independence statements in these models. We propose a simple algorithm for learning Hierarchical Naïve Bayes models in the context of classification. Experimental results show that the learned models can significantly improve classification accuracy as compared to other frameworks.  相似文献   

19.
Accurate classification of microarray data plays a vital role in cancer prediction and diagnosis. Previous studies have demonstrated the usefulness of naïve Bayes classifier in solving various classification problems. In microarray data analysis, however, the conditional independence assumption embedded in the classifier itself and the characteristics of microarray data, e.g. the extremely high dimensionality, may severely affect the classification performance of naïve Bayes classifier. This paper presents a sequential feature extraction approach for naïve Bayes classification of microarray data. The proposed approach consists of feature selection by stepwise regression and feature transformation by class-conditional independent component analysis. Experimental results on five microarray datasets demonstrate the effectiveness of the proposed approach in improving the performance of naïve Bayes classifier in microarray data analysis.  相似文献   

20.
一种半监督集成跨项目软件缺陷预测方法   总被引:2,自引:2,他引:0  
何吉元  孟昭鹏  陈翔  王赞  樊向宇 《软件学报》2017,28(6):1455-1473
软件缺陷预测方法可以在项目的开发初期,通过预先识别出所有可能含有缺陷的软件模块来优化测试资源的分配。早期的缺陷预测研究大多集中于同项目缺陷预测,但同项目缺陷预测需要充足的历史数据,而在实际应用中可能需要预测的项目的历史数据较为稀缺,或这个项目是一个全新项目。因此跨项目缺陷预测问题成为当前软件缺陷预测领域内的一个研究热点,其研究挑战在于源项目与目标项目数据集间存在的分布差异性以及数据集内存在的类不平衡问题。受到基于搜索的软件工程思想的启发,论文提出了一种基于搜索的半监督集成跨项目软件缺陷预测方法S3EL。该方法首先通过调整训练集中各类数据的分布比例,构建出多个朴素贝叶斯基分类器,随后利用具有全局搜索能力的遗传算法,基于少量已标记目标实例对上述基分类器进行集成,并构建出最终的缺陷预测模型。在Promise数据集及AEEEM数据集上和多个经典的跨项目缺陷预测方法(Burak过滤法、Peters过滤法、TCA+、CODEP及HYDRA)进行了对比。以F1值作为评测指标,结果表明在大部分情况下,S3EL方法可以取得最好的预测性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号