首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
A Novel Bayes Model: Hidden Naive Bayes   总被引:1,自引:0,他引:1  
Because learning an optimal Bayesian network classifier is an NP-hard problem, learning-improved naive Bayes has attracted much attention from researchers. In this paper, we summarize the existing improved algorithms and propose a novel Bayes model: hidden naive Bayes (HNB). In HNB, a hidden parent is created for each attribute which combines the influences from all other attributes. We experimentally test HNB in terms of classification accuracy, using the 36 UCI data sets selected by Weka, and compare it to naive Bayes (NB), selective Bayesian classifiers (SBC), naive Bayes tree (NBTree), tree-augmented naive Bayes (TAN), and averaged one-dependence estimators (AODE). The experimental results show that HNB significantly outperforms NB, SBC, NBTree, TAN, and AODE. In many data mining applications, an accurate class probability estimation and ranking are also desirable. We study the class probability estimation and ranking performance, measured by conditional log likelihood (CLL) and the area under the ROC curve (AUC), respectively, of naive Bayes and its improved models, such as SBC, NBTree, TAN, and AODE, and then compare HNB to them in terms of CLL and AUC. Our experiments show that HNB also significantly outperforms all of them.  相似文献   

2.
The Bayesian classifier is a fundamental classification technique. In this work, we focus on programming Bayesian classifiers in SQL. We introduce two classifiers: Naive Bayes and a classifier based on class decomposition using K-means clustering. We consider two complementary tasks: model computation and scoring a data set. We study several layouts for tables and several indexing alternatives. We analyze how to transform equations into efficient SQL queries and introduce several query optimizations. We conduct experiments with real and synthetic data sets to evaluate classification accuracy, query optimizations, and scalability. Our Bayesian classifier is more accurate than Naive Bayes and decision trees. Distance computation is significantly accelerated with horizontal layout for tables, denormalization, and pivoting. We also compare Naive Bayes implementations in SQL and C++: SQL is about four times slower. Our Bayesian classifier in SQL achieves high classification accuracy, can efficiently analyze large data sets, and has linear scalability.  相似文献   

3.
Abstract: This research focused on investigating and benchmarking several high performance classifiers called J48, random forests, naive Bayes, KStar and artificial immune recognition systems for software fault prediction with limited fault data. We also studied a recent semi-supervised classification algorithm called YATSI (Yet Another Two Stage Idea) and each classifier has been used in the first stage of YATSI. YATSI is a meta algorithm which allows different classifiers to be applied in the first stage. Furthermore, we proposed a semi-supervised classification algorithm which applies the artificial immune systems paradigm. Experimental results showed that YATSI does not always improve the performance of naive Bayes when unlabelled data are used together with labelled data. According to experiments we performed, the naive Bayes algorithm is the best choice to build a semi-supervised fault prediction model for small data sets and YATSI may improve the performance of naive Bayes for large data sets. In addition, the YATSI algorithm improved the performance of all the classifiers except naive Bayes on all the data sets.  相似文献   

4.
In real-world data mining applications, it is often the case that unlabeled instances are abundant, while available labeled instances are very limited. Thus, semi-supervised learning, which attempts to benefit from large amount of unlabeled data together with labeled data, has attracted much attention from researchers. In this paper, we propose a very fast and yet highly effective semi-supervised learning algorithm. We call our proposed algorithm Instance Weighted Naive Bayes (simply IWNB). IWNB firstly trains a naive Bayes using the labeled instances only. And the trained naive Bayes is used to estimate the class membership probabilities of the unlabeled instances. Then, the estimated class membership probabilities are used to label and weight unlabeled instances. At last, a naive Bayes is trained again using both the originally labeled data and the (newly labeled and weighted) unlabeled data. Our experimental results based on a large number of UCI data sets show that IWNB often improves the classification accuracy of original naive Bayes when available labeled data are very limited.  相似文献   

5.
Bayesian networks are important knowledge representation tools for handling uncertain pieces of information. The success of these models is strongly related to their capacity to represent and handle dependence relations. Some forms of Bayesian networks have been successfully applied in many classification tasks. In particular, naive Bayes classifiers have been used for intrusion detection and alerts correlation. This paper analyses the advantage of adding expert knowledge to probabilistic classifiers in the context of intrusion detection and alerts correlation. As examples of probabilistic classifiers, we will consider the well-known Naive Bayes, Tree Augmented Naïve Bayes (TAN), Hidden Naive Bayes (HNB) and decision tree classifiers. Our approach can be applied for any classifier where the outcome is a probability distribution over a set of classes (or decisions). In particular, we study how additional expert knowledge such as “it is expected that 80 % of traffic will be normal” can be integrated in classification tasks. Our aim is to revise probabilistic classifiers’ outputs in order to fit expert knowledge. Experimental results show that our approach improves existing results on different benchmarks from intrusion detection and alert correlation areas.  相似文献   

6.
The naive Bayes model has proven to be a simple yet effective model, which is very popular for pattern recognition applications such as data classification and clustering. This paper explores the possibility of using this model for multidimensional data visualization. To achieve this, a new learning algorithm called naive Bayes self-organizing map (NBSOM) is proposed to enable the naive Bayes model to perform topographic mappings. The training is carried out by means of an online expectation maximization algorithm with a self-organizing principle. The proposed method is compared with principal component analysis, self-organizing maps, and generative topographic mapping on two benchmark data sets and a real-world image processing application. Overall, the results show the effectiveness of NBSOM for multidimensional data visualization.  相似文献   

7.
一种限定性的双层贝叶斯分类模型   总被引:28,自引:1,他引:28  
朴素贝叶斯分类模型是一种简单而有效的分类方法,但它的属性独立性假设使其无法表达属性变量间存在的依赖关系,影响了它的分类性能.通过分析贝叶斯分类模型的分类原则以及贝叶斯定理的变异形式,提出了一种基于贝叶斯定理的新的分类模型DLBAN(double-level Bayesian network augmented naive Bayes).该模型通过选择关键属性建立属性之间的依赖关系.将该分类方法与朴素贝叶斯分类器和TAN(tree augmented naive Bayes)分类器进行实验比较.实验结果表明,在大多数数据集上,DLBAN分类方法具有较高的分类正确率.  相似文献   

8.
Many algorithms have been proposed for the machine learning task of classification. One of the simplest methods, the naive Bayes classifier, has often been found to give good performance despite the fact that its underlying assumptions (of independence and a normal distribution of the variables) are perhaps violated. In previous work, we applied naive Bayes and other standard algorithms to a breast cancer database from Nottingham City Hospital in which the variables are highly non-normal and found that the algorithm performed well when predicting a class that had been derived from the same data. However, when we then applied naive Bayes to predict an alternative clinical variable, it performed much worse than other techniques. This motivated us to propose an alternative method, based on naive Bayes, which removes the requirement for the variables to be normally distributed, but retains the essential structure and other underlying assumptions of the method. We tested our novel algorithm on our breast cancer data and on three UCI datasets which also exhibited strong violations of normality. We found our algorithm outperformed naive Bayes in all four cases and outperformed multinomial logistic regression (MLR) in two cases. We conclude that our method offers a competitive alternative to MLR and naive Bayes when dealing with data sets in which non-normal distributions are observed.  相似文献   

9.
说话人识别的本质就是模式分类。传统分类器算法中参数模型方法的主要缺点是预先假定的概率分布函数形式不一定符合待分类的数据。非参数模型方法,如PNN分类器,可以有效地克服参数模型的缺点,但其巨大的内存开销与低的分类速度使得PNN作为大量和高维的数据样本分类几乎不可行。FCM虽具有良好的模糊聚类能力,但无法直接给出概率分类结果。该文提出的FCM-PNN分类器,在FCM聚类的基础上,以贝叶斯置信度为基础,利用PNN进行概率分类。它结合了FCM聚类和PNN概率分类的优势,同时克服了传统参数模型分类和FCM聚类的局限性。实验结果证实了FCM-PNN分类器具有分类精度高、速度快及揭示细节的能力。  相似文献   

10.
Due to being fast, easy to implement and relatively effective, some state-of-the-art naive Bayes text classifiers with the strong assumption of conditional independence among attributes, such as multinomial naive Bayes, complement naive Bayes and the one-versus-all-but-one model, have received a great deal of attention from researchers in the domain of text classification. In this article, we revisit these naive Bayes text classifiers and empirically compare their classification performance on a large number of widely used text classification benchmark datasets. Then, we propose a locally weighted learning approach to these naive Bayes text classifiers. We call our new approach locally weighted naive Bayes text classifiers (LWNBTC). LWNBTC weakens the attribute conditional independence assumption made by these naive Bayes text classifiers by applying the locally weighted learning approach. The experimental results show that our locally weighted versions significantly outperform these state-of-the-art naive Bayes text classifiers in terms of classification accuracy.  相似文献   

11.
The naive Bayes classifier is known to obtain good results with a simple procedure. The method is based on the independence of the attribute variables given the variable to be classified. In real databases, where this hypothesis is not verified, this classifier continues to give good results. In order to improve the accuracy of the method, various works have been carried out in an attempt to reconstruct the set of the attributes and to join them so that there is independence between the new sets although the elements within each set are dependent. These methods are included in the ones known as semi-naive Bayes classifiers. In this article, we present an application of uncertainty measures on closed and convex sets of probability distributions, also called credal sets, in classification. We represent the information obtained from a database by a set of probability intervals (a credal set) via the imprecise Dirichlet model and we use uncertainty measures on credal sets in order to reconstruct the set of attributes, such as those mentioned, which shall enable us to improve the result of the naive Bayes classifier in a satisfactory way.  相似文献   

12.
约束高斯分类网研究   总被引:1,自引:0,他引:1  
王双成  高瑞  杜瑞杰 《自动化学报》2015,41(12):2164-2176
针对基于一元高斯函数估计属性边缘密度的朴素贝叶斯分类器不能有效利 用属性之间的依赖信息和使用多元高斯函数估计属性联合密度的完全贝叶斯分类器 易于导致对数据的过度拟合而且高阶协方差矩阵的计算也非常困难等情况,在建立 属性联合密度分解与组合定理和属性条件密度计算定理的基础上,将朴素贝叶斯分类 器的属性选择、分类准确性标准和属性父结点的贪婪选择相结合,进行约束高斯 分类网学习与优化,并依据贝叶斯网络理论,对贝叶斯衍生分类器中属性为类提供 的信息构成进行分析.使用UCI数据库中连续属性分类数据进行实验,结果显示,经过 优化的约束高斯分类网具有良好的分类准确性.  相似文献   

13.
The nonlinear dynamic characteristics of expansion and contraction and the sequential time-varying features of the syllable pronunciations greatly complicate the tasks of automatic speech recognition. Each syllable is represented by a sequence of vectors of linear predict coding cepstra (LPCC). Even if the same speaker utters the same syllable, the duration of stable parts of the sequence of LPCC vectors changes every time. Therefore, the duration of stable parts is contracted such that the compressed speech waveform has the same length. We propose five different simple techniques to contract the stable parts of the sequence of LPCC vectors. A simplified Bayes decision rule with a weighted variance is used to classify 408 speaker-dependent mandarin syllables. For the 408 speaker-dependent mandarin syllables, the recognition rate is 94.36% as compared with 79.78% obtained by using the hidden Markov models (HMM). A recognition rate 98.16% is achieved within top 3 candidates. The features proposed in this paper to represent the syllables are simple and easy to be extracted. The computation for feature extraction and classification is much faster than using the techniques of the HMM or any other known techniques.  相似文献   

14.
李克文  杨磊  刘文英  刘璐  刘洪太 《计算机科学》2015,42(9):249-252, 267
不平衡数据的分类问题在多个应用领域中普遍存在,已成为数据挖掘和机器学习领域的研究热点。提出了一种新的不平衡数据分类方法RSBoost,以解决传统分类方法对于少数类识别率不高和分类效率低的问题。该方法采用SMOTE方法对少数类进行过采样处理,然后对整个数据集进行随机欠采样处理,以改善整个数据集的不平衡性,再将其与Boosting算法相结合来对数据进行分类。通过实验对比了5种方法在多个公共数据集上的分类效果和分类效率,结果表明该方法具有较高的分类识别率和分类效率。  相似文献   

15.
Technical Note: Naive Bayes for Regression   总被引:1,自引:0,他引:1  
Frank  Eibe  Trigg  Leonard  Holmes  Geoffrey  Witten  Ian H. 《Machine Learning》2000,41(1):5-25
Despite its simplicity, the naive Bayes learning scheme performs well on most classification tasks, and is often significantly more accurate than more sophisticated methods. Although the probability estimates that it produces can be inaccurate, it often assigns maximum probability to the correct class. This suggests that its good performance might be restricted to situations where the output is categorical. It is therefore interesting to see how it performs in domains where the predicted value is numeric, because in this case, predictions are more sensitive to inaccurate probability estimates.This paper shows how to apply the naive Bayes methodology to numeric prediction (i.e., regression) tasks by modeling the probability distribution of the target value with kernel density estimators, and compares it to linear regression, locally weighted linear regression, and a method that produces model trees—decision trees with linear regression functions at the leaves. Although we exhibit an artificial dataset for which naive Bayes is the method of choice, on real-world datasets it is almost uniformly worse than locally weighted linear regression and model trees. The comparison with linear regression depends on the error measure: for one measure naive Bayes performs similarly, while for another it is worse. We also show that standard naive Bayes applied to regression problems by discretizing the target value performs similarly badly. We then present empirical evidence that isolates naive Bayes' independence assumption as the culprit for its poor performance in the regression setting. These results indicate that the simplistic statistical assumption that naive Bayes makes is indeed more restrictive for regression than for classification.  相似文献   

16.
With increasing Internet connectivity and traffic volume, recent intrusion incidents have reemphasized the importance of network intrusion detection systems for combating increasingly sophisticated network attacks. Techniques such as pattern recognition and the data mining of network events are often used by intrusion detection systems to classify the network events as either normal events or attack events. Our research study claims that the Hidden Naïve Bayes (HNB) model can be applied to intrusion detection problems that suffer from dimensionality, highly correlated features and high network data stream volumes. HNB is a data mining model that relaxes the Naïve Bayes method’s conditional independence assumption. Our experimental results show that the HNB model exhibits a superior overall performance in terms of accuracy, error rate and misclassification cost compared with the traditional Naïve Bayes model, leading extended Naïve Bayes models and the Knowledge Discovery and Data Mining (KDD) Cup 1999 winner. Our model performed better than other leading state-of-the art models, such as SVM, in predictive accuracy. The results also indicate that our model significantly improves the accuracy of detecting denial-of-services (DoS) attacks.  相似文献   

17.
朱欣  赵雷  杨季文 《计算机工程》2011,37(12):101-103
针对网络流量数据大、动态变化性高的问题,提出一种基于数据流挖掘技术——概念自适应快速决策树(CVFDT)的网络流量识别方法。CVFDT适合处理流动数据,随数据样本分布的变化更新模型,并能处理概念漂移。在具有12个最优属性特征的网络流数据集上进行实验,结果表明,与朴素贝叶斯方法相比,CVFDT方法具有较好的分类效果和稳定性。  相似文献   

18.
目前对以朴素贝叶斯算法为代表的文本分类算法,普遍存在特征权重一致,考虑指标单一等问题。为了解决这个问题,提出了一种基于TF-IDF的朴素贝叶斯改进算法TF-IDF-DL朴素贝叶斯算法。该算法以TF-IDF为基础,引入去中心化词频因子和特征词位置因子以加强特征权重的准确性。为了验证该算法的效果,采用了搜狗实验室的搜狗新闻数据集进行实验,实验结果表明,在朴素贝叶斯分类算法中引入TF-IDF-DL算法,能够使该算法在进行文本分类中的准确率、召回率和F 1值都有较好的表现,相比国内同类研究TF-IDF-dist贝叶斯方案,分类准确率提高8.6%,召回率提高11.7%,F 1值提高7.4%。因此该算法能较好地提高分类性能,并且对不易区分的类别也能在一定程度上达到良好的分类效果。  相似文献   

19.
基于自助平均的朴素贝叶斯文本分类器   总被引:1,自引:1,他引:1       下载免费PDF全文
针对单词簇上训练朴素贝叶斯文本分类器概率估计偏差较大所导致的分类精度较低问题,在概率分布聚类算法得到的单词簇的基础上,根据单词与簇间互信息建立有序单词子序列,采用有放回随机抽样对序列构造规模相当的样本集,并将估计出的参数的平均值作为训练得到的参数对未知文本进行分类。公共文本实验数据集上的实验结果表明,该文提出的训练方法相对于传统的朴素贝叶斯分类器训练方法能够获得更高的分类精度且过程相对简单。  相似文献   

20.
针对单词簇上训练朴素贝叶斯文本分类器概率估计偏差较大所导致的分类精度较低问题,在概率分布聚类算法得到的单词簇的基础上,根据单词与簇间互信息建立有序单词子序列,采用有放回随机抽样对序列构造规模相当的样本集,并将估计出的参数的平均值作为训练得到的参数对未知文本进行分类.公共文本实验数据集上的实验结果表明,该文提出的训练方法相对于传统的朴素贝叶斯分类器训练方法能够获得更高的分类精度且过程相对简单.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号