首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
As the information available to naïve users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as QPIAD aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values—which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this paper, we present a principled probabilistic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. We learn this distribution in terms of Bayesian networks. Our approach involves mining/“learning” Bayesian networks from a sample of the database, and using it to do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). We present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayesian networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayesian networks provide higher precision and recall than AFDs while keeping query processing costs manageable.  相似文献   

2.
属性加权的朴素贝叶斯集成分类器   总被引:2,自引:1,他引:1  
为提高朴素贝叶斯分类器的分类精度和泛化能力,提出了基于属性相关性的加权贝叶斯集成方法(WEBNC)。根据每个条件属性与决策属性的相关度对其赋以相应的权值,然后用AdaBoost训练属性加权后的BNC。该分类方法在16个UCI标准数据集上进行了测试,并与BNC、贝叶斯网和由AdaBoost训练出的BNC进行比较,实验结果表明,该分类器具有更高的分类精度与泛化能力。  相似文献   

3.
For learning a Bayesian network classifier, continuous attributes usually need to be discretized. But the discretization of continuous attributes may bring information missing, noise and less sensitivity to the changing of the attributes towards class variables. In this paper, we use the Gaussian kernel function with smoothing parameter to estimate the density of attributes. Bayesian network classifier with continuous attributes is established by the dependency extension of Naive Bayes classifiers. We also analyze the information provided to a class for each attributes as a basis for the dependency extension of Naive Bayes classifiers. Experimental studies on UCI data sets show that Bayesian network classifiers using Gaussian kernel function provide good classification accuracy comparing to other approaches when dealing with continuous attributes.  相似文献   

4.
Possibilistic networks, which are compact representations of possibility distributions, are powerful tools for representing and reasoning with uncertain and incomplete information in the framework of possibility theory. They are like Bayesian networks but lie on possibility theory to deal with uncertainty, imprecision and incompleteness. While classification is a very useful task in many real world applications, possibilistic network-based classification issues are not well investigated in general and possibilistic-based classification inference with uncertain observations in particular. In this paper, we address on one hand the theoretical foundations of inference in possibilistic classifiers under uncertain inputs and propose on the other hand a novel efficient algorithm for the inference in possibilistic network-based classification under uncertain observations. We start by studying and analyzing the counterpart of Jeffrey’s rule in the framework of possibility theory. After that, we address the validity of Markov-blanket criterion in the context of possibilistic networks used for classification with uncertain inputs purposes. Finally, we propose a novel algorithm suitable for possibilistic classifiers with uncertain observations without assuming any independence relations between observations. This algorithm guarantees the same results as if classification were performed using the possibilistic counterpart of Jeffrey’s rule. Classification is achieved in polynomial time if the target variable is binary. The basic idea of our algorithm is to only search for totally plausible class instances through a series of equivalent and polynomial transformations applied on the possibilistic classifier taking into account the uncertain observations.  相似文献   

5.
Land-cover classification based on multi-temporal satellite images for scenarios where parts of the data are missing due to, for example, clouds, snow or sensor failure has received little attention in the remote-sensing literature. The goal of this article is to introduce support vector machine (SVM) methods capable of handling missing data in land-cover classification. The novelty of this article consists of combining the powerful SVM regularization framework with a recent statistical theory of missing data, resulting in a new method where an SVM is trained for each missing data pattern, and a given incomplete test vector is classified by selecting the corresponding SVM model. The SVM classifiers are evaluated on Landsat Enhanced Thematic Mapper Plus (ETM?+?) images covering a scene of Norwegian mountain vegetation. The results show that the proposed SVM-based classifier improves the classification accuracy by 5–10% compared with single image classification. The proposed SVM classifier also outperforms recent non-parametric k-nearest neighbours (k-NN) and Parzen window density-based classifiers for incomplete data by about 3%. Moreover, since the resulting SVM classifier may easily be implemented using existing SVM libraries, we consider the new method to be an attractive choice for classification of incomplete data in remote sensing.  相似文献   

6.
《Applied Soft Computing》2007,7(3):1135-1143
Relations and relation matrices are important concepts in set theory and intelligent computation. Some general uncertainty measures for fuzzy relations are proposed by generalizing Shannon's information entropy. Then, the proposed measures are used to calculate the diversity quantity of multiple classifier systems and the granularity of granulated problem spaces, respectively. As a diversity measure, it is shown that the fusion system whose classifiers are of little similarity produces a great uncertainty quantity, which means that much complementary information is achieved with a diverse multiple classifier system. In granular computing, a “coarse–fine” order is introduced for a family of problem spaces with the proposed granularity measures. The problem space that is finely granulated will get a great uncertainty quantity compared with the coarse problem space. Based on the observation, we employ the proposed measure to evaluate the significance of numerical attributes for classification. Each numerical attribute generates a fuzzy similarity relation over the sample space. We compute the condition entropy of a numerical attribute or a set of numerical attribute relative to the decision, where the greater the condition entropy is, the less important the attribute subset is. A forward greedy search algorithm for numerical feature selection is constructed with the proposed measure. Experimental results show that the proposed method presents an efficient and effective solution for numerical feature analysis.  相似文献   

7.
约束高斯分类网研究   总被引:1,自引:0,他引:1  
王双成  高瑞  杜瑞杰 《自动化学报》2015,41(12):2164-2176
针对基于一元高斯函数估计属性边缘密度的朴素贝叶斯分类器不能有效利 用属性之间的依赖信息和使用多元高斯函数估计属性联合密度的完全贝叶斯分类器 易于导致对数据的过度拟合而且高阶协方差矩阵的计算也非常困难等情况,在建立 属性联合密度分解与组合定理和属性条件密度计算定理的基础上,将朴素贝叶斯分类 器的属性选择、分类准确性标准和属性父结点的贪婪选择相结合,进行约束高斯 分类网学习与优化,并依据贝叶斯网络理论,对贝叶斯衍生分类器中属性为类提供 的信息构成进行分析.使用UCI数据库中连续属性分类数据进行实验,结果显示,经过 优化的约束高斯分类网具有良好的分类准确性.  相似文献   

8.
It has been widely accepted that the classification accuracy can be improved by combining outputs of multiple classifiers. However, how to combine multiple classifiers with various (potentially conflicting) decisions is still an open problem. A rich collection of classifier combination procedures-many of which are heuristic in nature-have been developed for this goal. In this brief, we describe a dynamic approach to combine classifiers that have expertise in different regions of the input space. To this end, we use local classifier accuracy estimates to weight classifier outputs. Specifically, we estimate local recognition accuracies of classifiers near a query sample by utilizing its nearest neighbors, and then use these estimates to find the best weights of classifiers to label the query. The problem is formulated as a convex quadratic optimization problem, which returns optimal nonnegative classifier weights with respect to the chosen objective function, and the weights ensure that locally most accurate classifiers are weighted more heavily for labeling the query sample. Experimental results on several data sets indicate that the proposed weighting scheme outperforms other popular classifier combination schemes, particularly on problems with complex decision boundaries. Hence, the results indicate that local classification-accuracy-based combination techniques are well suited for decision making when the classifiers are trained by focusing on different regions of the input space.  相似文献   

9.
分类准确性是分类器最重要的性能指标,特征子集选择是提高分类器分类准确性的一种有效方法。现有的特征子集选择方法主要针对静态分类器,缺少动态分类器特征子集选择方面的研究。首先给出具有连续属性的动态朴素贝叶斯网络分类器和动态分类准确性评价标准,在此基础上建立动态朴素贝叶斯网络分类器的特征子集选择方法,并使用真实宏观经济时序数据进行实验与分析。  相似文献   

10.
用于不完整数据的选择性贝叶斯分类器   总被引:3,自引:0,他引:3  
选择性分类器通过删除数据集中的无关属性和冗余属性可以有效地提高分类精度和效率.因此,一些选择性分类器应运而生.然而,由于处理不完整数据的复杂性,它们大都是针对完整数据的.由于各种原因,现实中的数据通常是不完整的并且包含许多冗余属性或无关属性.如同完整数据的情形一样,不完整数据集中的冗余属性或无关属性也会使分类性能大幅下降.因此,对用于不完整数据的选择性分类器的研究是一项重要的研究课题.通过分析以往在分类过程中对不完整数据的处理方法,提出了两种用于不完整数据的选择性贝叶斯分类器:SRBC和CBSRBC.SRBC是基于一种鲁棒贝叶斯分类器构建的,而CBSRBC则是在SRBC基础上利用X2统计量构建的.在12个标准的不完整数据集上的实验结果表明,这两种方法在大幅度减少属性数目的同时,能显著提高分类准确率和稳定性.从总体上来讲,CBSRBC在分类精度、运行效率等方面都优于SRBC算法,而SRBC需要预先指定的阈值要少一些.  相似文献   

11.

朴素贝叶斯分类器不能有效地利用属性之间的依赖信息, 而目前所进行的依赖扩展更强调效率, 使扩展后分类器的分类准确性还有待提高. 针对以上问题, 在使用具有平滑参数的高斯核函数估计属性密度的基础上, 结合分类器的分类准确性标准和属性父结点的贪婪选择, 进行朴素贝叶斯分类器的网络依赖扩展. 使用UCI 中的连续属性分类数据进行实验, 结果显示网络依赖扩展后的分类器具有良好的分类准确性.

  相似文献   

12.
针对案例推理(CBR)分类器中案例属性权重的分配问题,提出一种基于内省学习的属性权重迭代调整方法。该方法可根据CBR分类器对训练案例分类的结果调整属性的权重。基于成功驱动的权重学习策略,若当前训练案例分类成功,则首先根据权重调整公式增加匹配属性的权重并减少不匹配属性的权重;然后对所有权重进行归一化从而得到当次迭代的新权重。实验结果表明,所提方法的CBR分类器在UCI数据集PD、Heart和WDBC的准确率比传统CBR分类器分别提高1.72%、4.44%和1.05%。故成功驱动的内省学习权重调整方法可以提高权重分配的合理性,进而提高CBR分类器的准确率。  相似文献   

13.
Active search on graphs focuses on collecting certain labeled nodes (targets) given global knowledge of the network topology and its edge weights (encoding pairwise similarities) under a query budget constraint. However, in most current networks, nodes, network topology, network size, and edge weights are all initially unknown. In this work we introduce selective harvesting, a variant of active search where the next node to be queried must be chosen among the neighbors of the current queried node set; the available training data for deciding which node to query is restricted to the subgraph induced by the queried set (and their node attributes) and their neighbors (without any node or edge attributes). Therefore, selective harvesting is a sequential decision problem, where we must decide which node to query at each step. A classifier trained in this scenario can suffer from what we call a tunnel vision effect: without any recourse to independent sampling, the urge to only query promising nodes forces classifiers to gather increasingly biased training data, which we show significantly hurts the performance of active search methods and standard classifiers. We demonstrate that it is possible to collect a much larger set of targets by using multiple classifiers, not by combining their predictions as a weighted ensemble, but switching between classifiers used at each step, as a way to ease the tunnel vision effect. We discover that switching classifiers collects more targets by (a) diversifying the training data and (b) broadening the choices of nodes that can be queried in the future. This highlights an exploration, exploitation, and diversification trade-off in our problem that goes beyond the exploration and exploitation duality found in classic sequential decision problems. Based on these observations we propose D\(^3\)TS, a method based on multi-armed bandits for non-stationary stochastic processes that enforces classifier diversity, which outperforms all competing methods on five real network datasets in our evaluation and exhibits comparable performance on the other two.  相似文献   

14.
The Naive Bayes classifier is a popular classification technique for data mining and machine learning. It has been shown to be very effective on a variety of data classification problems. However, the strong assumption that all attributes are conditionally independent given the class is often violated in real-world applications. Numerous methods have been proposed in order to improve the performance of the Naive Bayes classifier by alleviating the attribute independence assumption. However, violation of the independence assumption can increase the expected error. Another alternative is assigning the weights for attributes. In this paper, we propose a novel attribute weighted Naive Bayes classifier by considering weights to the conditional probabilities. An objective function is modeled and taken into account, which is based on the structure of the Naive Bayes classifier and the attribute weights. The optimal weights are determined by a local optimization method using the quasisecant method. In the proposed approach, the Naive Bayes classifier is taken as a starting point. We report the results of numerical experiments on several real-world data sets in binary classification, which show the efficiency of the proposed method.  相似文献   

15.
针对粗糙集数据分析中的不确定性度量问题。本文首先构造一种新型的考虑条件属性缺失度的目标概念条件熵和决策知识条件熵。在此基础上,提出基于条件熵的属性权重确定技术和最小条件熵非完备属性取值补充方法,用以解决属性权重完全未知的非完备多属性决策问题。应用实例分析表明:该方法能有效结合粗粒度的初步分级信息,客观地确定决策因素取值,具有很强的解释意义,得到的决策结果更为合理有效。  相似文献   

16.
Classification is one of the most important tasks in machine learning with a huge number of real-life applications. In many practical classification problems, the available information for making object classification is partial or incomplete because some attribute values can be missing due to various reasons. These missing values can significantly affect the efficacy of the classification model. So it is crucial to develop effective techniques to impute these missing values. A number of methods have been introduced for solving classification problem with missing values. However they have various problems. So, we introduce an effective method for imputing missing values using the correlation among the attributes. Other methods which consider correlation for imputing missing values works better either for categorical or numeric data, or designed for a particular application only. Moreover they will not work if all the records have at least one missing attribute. Our method, Model based Missing value Imputation using Correlation (MMIC), can effectively impute both categorical and numeric data. It uses an effective model based technique for filling the missing values attribute wise and reusing then effectively using the model. Extensive performance analyzes show that our proposed approach achieves high performance in imputing missing values and thus increases the efficacy of the classifier. The experimental results also show that our method outperforms various existing methods for handling missing data in classification.  相似文献   

17.
提出了一种粗糙小波网络分类器的模型。其过程为:利用粗糙集理论获取分类知识,根据训练样本属性值离散化、属性约简和值约简来构造粗糙小波网络分类器。该分类器可以有效地克服粗糙集规则匹配方法抗噪声能力和规则泛化能力差的缺点;同时可简化小波网络的结构,加快网络的训练速度。并详细介绍了该分类器用于入侵数据识别的步骤和仿真实验结果。  相似文献   

18.
The naive Bayes classifier is known to obtain good results with a simple procedure. The method is based on the independence of the attribute variables given the variable to be classified. In real databases, where this hypothesis is not verified, this classifier continues to give good results. In order to improve the accuracy of the method, various works have been carried out in an attempt to reconstruct the set of the attributes and to join them so that there is independence between the new sets although the elements within each set are dependent. These methods are included in the ones known as semi-naive Bayes classifiers. In this article, we present an application of uncertainty measures on closed and convex sets of probability distributions, also called credal sets, in classification. We represent the information obtained from a database by a set of probability intervals (a credal set) via the imprecise Dirichlet model and we use uncertainty measures on credal sets in order to reconstruct the set of attributes, such as those mentioned, which shall enable us to improve the result of the naive Bayes classifier in a satisfactory way.  相似文献   

19.
The inherent uncertainty and incomplete information of the software development process presents particular challenges for identifying fault-prone modules and providing a preferred model early enough in a development cycle in order to guide software enhancement efforts effectively. Grey relational analysis (GRA) of grey system theory is a well known approach that is utilized for generalizing estimates under small sample and uncertain conditions. This paper examines the potential benefits for providing an early software-quality classification based on improved grey relational classifier. The particle swarm optimization (PSO) approach is adopted to explore the best fit of weights on software metrics in the GRA approach for deriving a classifier with preferred balance of misclassification rates. We have demonstrated our approach by using the data from the medical information system dataset. Empirical results show that the proposed approach provides a preferred balance of misclassification rates than the grey relational classifiers without using PSO. It also outperforms the widely used classifiers of classification and regression trees (CART) and C4.5 approaches.  相似文献   

20.
This paper proposes a new approach based on missing value pattern discovery for classifying incomplete data. This approach is particularly designed for classification of datasets with a small number of samples and a high percentage of missing values where available missing value treatment approaches do not usually work well. Based on the pattern of the missing values, the proposed approach finds subsets of samples for which most of the features are available and trains a classifier for each subset. Then, it combines the outputs of the classifiers. Subset selection is translated into a clustering problem, allowing derivation of a mathematical framework for it. A trade off is established between the computational complexity (number of subsets) and the accuracy of the overall classifier. To deal with this trade off, a numerical criterion is proposed for the prediction of the overall performance. The proposed method is applied to seven datasets from the popular University of California, Irvine data mining archive and an epilepsy dataset from Henry Ford Hospital, Detroit, Michigan (total of eight datasets). Experimental results show that classification accuracy of the proposed method is superior to those of the widely used multiple imputations method and four other methods. They also show that the level of superiority depends on the pattern and percentage of missing values.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号