共查询到20条相似文献,搜索用时 0 毫秒
1.
Data uncertainty can be caused by numerous factors such as measurement precision limitations, network latency, data staleness and sampling errors. When mining knowledge from emerging applications such as sensor networks or location based services, data uncertainty should be handled cautiously to avoid erroneous results. In this paper, we apply probabilistic and statistical theory on uncertain data and develop a novel method to calculate conditional probabilities of Bayes theorem. Based on that, we propose a novel Bayesian classification algorithm for uncertain data. The experimental results show that the proposed method classifies uncertain data with potentially higher accuracies than the Naive Bayesian approach. It also has a more stable performance than the existing extended Naive Bayesian method. 相似文献
2.
Kapil Keshao Wankhade Kalpana C. Jondhale Vijaya R. Thool 《Knowledge and Information Systems》2018,56(1):197-221
Learning of rare class data is a challenging problem in field of classification process. A rare class or imbalanced class learning is the common problem faced by many real-world applications, because of this many researcher work focused on this issue. Rare class data always generate wrong results because of overwhelming accuracy of minority class by majority class. There are lots of methods been proposed to handle imbalanced class or rare class or skew class problem. This paper proposes a hybrid method, i. e. classification- and clustering-based method, solving rare class problem. This proposed hybrid method uses k-means, ensemble and divide and merge methods. This method tries to improve detection rate of every class. For experimental work, the proposed method is tested on real datasets. The experimental results show that proposed method works well as compared with other algorithms. 相似文献
3.
《Knowledge》2007,20(3):220-224
In many applications, an enormous amount of unlabeled data is available with little cost. Therefore, it is natural to ask whether we can take advantage of these unlabeled data in classification learning. In this paper, we analyzed the role of unlabeled data in the context of naive Bayesian learning. Experimental results show that including unlabeled data as part of training data can significantly improve the performance of classification accuracy. 相似文献
4.
Integrating ontological modelling and Bayesian inference for pattern classification in topographic vector data 总被引:1,自引:0,他引:1
Patrick Lüscher Robert Weibel Dirk Burghardt 《Computers, Environment and Urban Systems》2009,33(5):363
This paper presents an ontology-driven approach for spatial database enrichment in support of map generalisation. Ontology-driven spatial database enrichment is a promising means to provide better transparency, flexibility and reusability in comparison to purely algorithmic approaches. Geographic concepts manifested in spatial patterns are formalised by means of ontologies that are used to trigger appropriate low level pattern recognition techniques. The paper focuses on inference in the presence of vagueness, which is common in definitions of spatial phenomena, and on the influence of the complexity of spatial measures on classification accuracy. The concept of the English terraced house serves as an example to demonstrate how geographic concepts can be modelled in an ontology for spatial database enrichment. Owing to their good integration into ontologies, and their ability to deal with vague definitions, supervised Bayesian inference is used for inferring complex concepts. The approach is validated in experiments using large vector datasets representing buildings of four different cities. We compare classification results obtained with the proposed approach to results produced by a more traditional ontology approach. The proposed approach performed considerably better in comparison to the traditional ontology approach. Besides clarifying the benefits of using ontologies in spatial database enrichment, our research demonstrates that Bayesian networks are a suitable method to integrate vague knowledge about conceptualisations in cartography and GIScience. 相似文献
5.
Morales DA Bengoetxea E Larrañaga P García M Franco Y Fresnada M Merino M 《Computer methods and programs in biomedicine》2008,90(2):104-116
In vitro fertilization (IVF) is a medically assisted reproduction technique that enables infertile couples to achieve successful pregnancy. Given the uncertainty of the treatment, we propose an intelligent decision support system based on supervised classification by Bayesian classifiers to aid to the selection of the most promising embryos that will form the batch to be transferred to the woman's uterus. The aim of the supervised classification system is to improve overall success rate of each IVF treatment in which a batch of embryos is transferred each time, where the success is achieved when implantation (i.e. pregnancy) is obtained. Due to ethical reasons, different legislative restrictions apply in every country on this technique. In Spain, legislation allows a maximum of three embryos to form each transfer batch. As a result, clinicians prefer to select the embryos by non-invasive embryo examination based on simple methods and observation focused on morphology and dynamics of embryo development after fertilization. This paper proposes the application of Bayesian classifiers to this embryo selection problem in order to provide a decision support system that allows a more accurate selection than with the actual procedures which fully rely on the expertise and experience of embryologists. For this, we propose to take into consideration a reduced subset of feature variables related to embryo morphology and clinical data of patients, and from this data to induce Bayesian classification models. Results obtained applying a filter technique to choose the subset of variables, and the performance of Bayesian classifiers using them, are presented. 相似文献
6.
Online classification is important for real time data sequence classification. Its most challenging problem is that the class priors may vary for non-stationary data sequences. Most of the current online-data-sequence-classification algorithms assume that the class labels of some new-arrived data samples are known and retrain the classifier accordingly. Unfortunately, such assumption is often violated in real applications. But if we were able to estimate the class priors on the test data sequence accurately, we could adjust the classifier without retraining it while preserving a reasonable accuracy. There has been some work on the class priors estimation to classify static data sets using the offline iterative EM algorithm, which has been proved to be quite effective to adjust the classifier. Inspired by the offline iterative EM algorithm for static data sets, in this paper, we propose an online incremental EM algorithm to estimate the class priors along the data sequence. The classifier is adjusted accordingly to keep pace with the varying distribution. The proposed online algorithm is more computationally efficient because it scans the sequence only once. Experimental results show that the proposed algorithm indeed performs better than the conventional offline iterative EM algorithm when the class priors are non-stationary. 相似文献
7.
《Computers & Mathematics with Applications》2003,45(4-5):737-748
We present ELEM2, a machine learning system that induces classification rules from a set of data based on a heuristic search over a hypothesis space. ELEM2 is distinguished from other rule induction systems in three aspects. First, it uses a new heuristtic function to guide the heuristic search. The function reflects the degree of relevance of an attribute-value pair to a target concept and leads to selection of the most relevant pairs for formulating rules. Second, ELEM2 handles inconsistent training examples by defining an unlearnable region of a concept based on the probability distribution of that concept in the training data. The unlearnable region is used as a stopping criterion for the concept learning process, which resolves conflicts without removing inconsistent examples. Third, ELEM2 employs a new rule quality measure in its post-pruning process to prevent rules from overfitting the data. The rule quality formula measures the extent to which a rule can discriminate between the positive and negative examples of a class. We describe features of ELEM2, its rule induction algorithm and its classification procedure. We report experimental results that compare ELEM2 with C4.5 and CN2 on a number of datasets. 相似文献
8.
Sandra Ramos Antónia Amaral Turkman Marília Antunes 《Computational statistics & data analysis》2010,54(8):2012-2020
A Bayesian optimal screening method (BOSc) is proposed to classify an individual into one of two groups, based on the observation of pairs of covariates, namely the expression level of pairs of genes (previously selected by a specific method, among the thousands of genes present in the microarray) measured using DNA microarrays technology. The method is general and can be applied to any correlated pair of screening variables, either with a bivariate normal distribution or which can be transformed into a bivariate normal.1 Results on microarray data sets (Leukemia, Prostate and Breast) show that BOSc performance is competitive with, and in some cases significantly better than, quadratic and linear discriminant analyses and support vector machines classifiers. BOSc provides flexible parametric decision rules. Finally, the screening classifier allows the calculation of operating characteristics while addressing information about the prevalence of the disease or type of disease, which is an advantage over other classification methods. 相似文献
9.
Bayesian networks for imputation in classification problems 总被引:1,自引:0,他引:1
Estevam R. HruschkaJr. Eduardo R. Hruschka Nelson F. F. Ebecken 《Journal of Intelligent Information Systems》2007,29(3):231-252
Missing values are an important problem in data mining. In order to tackle this problem in classification tasks, we propose two imputation methods based on Bayesian networks. These methods are evaluated in the context of both prediction and classification tasks. We compare the obtained results with those achieved by classical imputation methods (Expectation–Maximization, Data Augmentation, Decision Trees, and Mean/Mode). Our simulations were performed by means of four datasets (Congressional Voting Records, Mushroom, Wisconsin Breast Cancer and Adult), which are benchmarks for data mining methods. Missing values were simulated in these datasets by means of the elimination of some known values. Thus, it is possible to assess the prediction capability of an imputation method, comparing the original values with the imputed ones. In addition, we propose a methodology to estimate the bias inserted by imputation methods in classification tasks. In this sense, we use four classifiers (One Rule, Naïve Bayes, J4.8 Decision Tree and PART) to evaluate the employed imputation methods in classification scenarios. Computing times consumed to perform imputations are also reported. Simulation results in terms of prediction, classification, and computing times allow us performing several analyses, leading to interesting conclusions. Bayesian networks have shown to be competitive with classical imputation methods. 相似文献
10.
Bayesian networks are graphical models that describe dependency relationships between variables, and are powerful tools for studying probability classifiers. At present, the causal Bayesian network learning method is used in constructing Bayesian network classifiers while the contribution of attribute to class is over-looked. In this paper, a Bayesian network specifically for classification-restricted Bayesian classification networks is proposed. Combining dependency analysis between variables, classification accuracy evaluation criteria and a search algorithm, a learning method for restricted Bayesian classification networks is presented. Experiments and analysis are done using data sets from UCI machine learning repository. The results show that the restricted Bayesian classification network is more accurate than other well-known classifiers. 相似文献
11.
Jose Miguel Hernández-Lobato Author Vitae Daniel Hernández-Lobato Author Vitae 《Pattern recognition》2011,44(4):886-900
In some classification problems there is prior information about the joint relevance of groups of features. This knowledge can be encoded in a network whose nodes correspond to features and whose edges connect features that should be either both excluded or both included in the predictive model. In this paper, we introduce a novel network-based sparse Bayesian classifier (NBSBC) that makes use of the information about feature dependencies encoded in such a network to improve its prediction accuracy, especially in problems with a high-dimensional feature space and a limited amount of available training data. Approximate Bayesian inference is efficiently implemented in this model using expectation propagation. The NBSBC method is validated on four real-world classification problems from different domains of application: phonemes, handwritten digits, precipitation records and gene expression measurements. A comparison with state-of-the-art methods (support vector machine, network-based support vector machine and graph lasso) show that NBSBC has excellent predictive performance. It has the best accuracy in three of the four problems analyzed and ranks second in the modeling of the precipitation data. NBSBC also yields accurate and robust rankings of the individual features according to their relevance to the solution of the classification problem considered. The accuracy and stability of these estimates is an important factor in the good overall performance of this method. 相似文献
12.
Bellazzi R. Riva A. 《IEEE transactions on systems, man, and cybernetics. Part A, Systems and humans : a publication of the IEEE Systems, Man, and Cybernetics Society》1998,28(5):629-636
Many real applications of Bayesian networks (BN) concern problems in which several observations are collected over time on a certain number of similar plants. This situation is typical of the context of medical monitoring, in which several measurements of the relevant physiological quantities are available over time on a population of patients under treatment, and the conditional probabilities that describe the model are usually obtained from the available data through a suitable learning algorithm. In situations with small data sets for each plant, it is useful to reinforce the parameter estimation process of the BN by taking into account the observations obtained from other similar plants. On the other hand, a desirable feature to be preserved is the ability to learn individualized conditional probability tables, rather than pooling together all the available data. In this work we apply a Bayesian hierarchical model able to preserve individual parameterization, and, at the same time, to allow the conditionals of each plant to borrow strength from all the experience contained in the data-base. A testing example and an application in the context of diabetes monitoring will be shown 相似文献
13.
Grouped data occur frequently in practice, either because of limited resolution of instruments, or because data have been summarized in relatively wide bins. A combination of the composite link model with roughness penalties is proposed to estimate smooth densities from such data in a Bayesian framework. A simulation study is used to evaluate the performances of the strategy in the estimation of a density, of its quantiles and first moments. Two illustrations are presented: the first one involves grouped data of lead concentration in the blood and the second one the number of deaths due to tuberculosis in The Netherlands in wide age classes. 相似文献
14.
Zhihao Zhang Author Vitae Author Vitae 《Pattern recognition》2010,43(9):3151-3161
Data stream classification is a hot topic in data mining research. The great challenge is that the class priors may evolve along the data sequence. Algorithms have been proposed to estimate the dynamic class priors and adjust the classifier accordingly. However, the existing algorithms do not perform well on prior estimation due to the lack of samples from the target distribution. Sample size has great effects in parameter estimation and small-sample effects greatly contaminate the estimation performance. In this paper, we propose a novel parameter estimation method called transfer estimation. Transfer estimation makes use of samples not only from the target distribution but also from similar distributions. We apply this new estimation method to the existing algorithms and obtain an improved algorithm. Experiments on both synthetic and real data sets show that the improved algorithm outperforms the existing algorithms on both class prior estimation and classification. 相似文献
15.
We propose a scoring criterion, named mixture-based factorized conditional log-likelihood (mfCLL), which allows for efficient hybrid learning of mixtures of Bayesian networks in binary classification tasks. The learning procedure is decoupled in foreground and background learning, being the foreground the single concept of interest that we want to distinguish from a highly complex background. The overall procedure is hybrid as the foreground is discriminatively learned, whereas the background is generatively learned. The learning algorithm is shown to run in polynomial time for network structures such as trees and consistent κ-graphs. To gauge the performance of the mfCLL scoring criterion, we carry out a comparison with state-of-the-art classifiers. Results obtained with a large suite of benchmark datasets show that mfCLL-trained classifiers are a competitive alternative and should be taken into consideration. 相似文献
16.
Schizophrenia is a frequent and devastating disorder beginning in early adulthood. Until now, the heterogeneity of this disease has been a major pitfall for identifying the aetiological, genetic or environmental factors. Age at onset or several other quantitative variables could allow categorizing more homogeneous subgroups of patients, although there is little information on the boundaries for such categories. The Bayesian networks classifier (BNs) approach is one of the most popular formalisms for reasoning under uncertainty. Using a data set including genotypes of selected candidate genes for schizophrenia, BNs were used to determine the best cut-off point for three continuous variables (i.e. age at onset of schizophrenia (AFC & AFE) and neurological soft signs (NSS)). 相似文献
17.
Bayesian networks have received much attention in the recent literature. In this article, we propose an approach to learn Bayesian networks using the stochastic approximation Monte Carlo (SAMC) algorithm. Our approach has two nice features. Firstly, it possesses the self-adjusting mechanism and thus avoids essentially the local-trap problem suffered by conventional MCMC simulation-based approaches in learning Bayesian networks. Secondly, it falls into the class of dynamic importance sampling algorithms; the network features can be inferred by dynamically weighted averaging the samples generated in the learning process, and the resulting estimates can have much lower variation than the single model-based estimates. The numerical results indicate that our approach can mix much faster over the space of Bayesian networks than the conventional MCMC simulation-based approaches. 相似文献
18.
19.
20.
Gene expression microarray is a rapidly maturing technology that provides the opportunity to assay the expression levels of thousands or tens of thousands of genes in a single experiment. We present a new heuristic to select relevant gene subsets in order to further use them for the classification task. Our method is based on the statistical significance of adding a gene from a ranked-list to the final subset. The efficiency and effectiveness of our technique is demonstrated through extensive comparisons with other representative heuristics. Our approach shows an excellent performance, not only at identifying relevant genes, but also with respect to the computational cost. 相似文献