首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 934 毫秒
1.
Due to the high dimensionality of microarray gene expression data and complicated correlations in gene expression levels, statistical methods for analyzing these data often are computationally intensive, requiring special expertise for their implementation. Biologists without such expertise will benefit from a computationally efficient and easy-to-implement analytic method. In this article, we develop such a method: a two-stage empirical Bayes method for identifying differentially expressed genes. We use a special technique to reduce the dimension of the parameter space in the preliminary stage, and construct a Bayesian model in the second stage. The computation of our method is efficient and requires little calibration for real microarray gene expression data. Specifically, we employ statistical tools, including the empirical Bayes estimation and a distribution approximation approach, to speed up computation and at the same time to preserve precision. We develop a score for evaluating the magnitude of the overall differential gene expression levels based on our Bayesian model, and declare differential expression according to the posterior probabilities that their scores exceed some threshold value. The number of declarations is determined by a false discovery rate procedure.  相似文献   

2.
A hybrid Huberized support vector machine (HHSVM) with an elastic-net penalty has been developed for cancer tumor classification based on thousands of gene expression measurements. In this paper, we develop a Bayesian formulation of the hybrid Huberized support vector machine for binary classification. For the coefficients of the linear classification boundary, we propose a new type of prior, which can select variables and group them together simultaneously. Our proposed prior is a scale mixture of normal distributions and independent gamma priors on a transformation of the variance of the normal distributions. We establish a direct connection between the Bayesian HHSVM model with our special prior and the standard HHSVM solution with the elastic-net penalty. We propose a hierarchical Bayes technique and an empirical Bayes technique to select the penalty parameter. In the hierarchical Bayes model, the penalty parameter is selected using a beta prior. For the empirical Bayes model, we estimate the penalty parameter by maximizing the marginal likelihood. The proposed model is applied to two simulated data sets and three real-life gene expression microarray data sets. Results suggest that our Bayesian models are highly successful in selecting groups of similarly behaved important genes and predicting the cancer class. Most of the genes selected by our models have shown strong association with well-studied genetic pathways, further validating our claims.  相似文献   

3.
A number of statistical approaches have been proposed for evaluating the statistical significance of a differential expression in microarray data. The error estimation of these approaches is inaccurate when the number of replicated arrays is small. Consequently, their resulting statistics are often underpowered to detect important differential expression patterns in the microarray data with limited replication. In this paper, we propose an empirical Bayes (EB) heterogeneous error model (HEM) with error-pooling prior specifications for varying technical and biological errors in the microarray data. The error estimation of HEM is thus strengthened by and shrunk toward the EB priors that are obtained by the error-pooling estimation at each local intensity range. By using simulated and real data sets, we compared HEM with two widely used statistical approaches, significance analysis of microarray (SAM) and analysis of variance (ANOVA), to identify differential expression patterns across multiple conditions. The comparison showed that HEM is statistically more powerful than SAM and ANOVA, particularly when the sample size is smaller than five. We also suggest a resampling-based estimation of Bayesian false discovery rate to provide a biologically relevant cutoff criterion of HEM statistics.  相似文献   

4.
Augmented Bayes分类器的一种学习方法   总被引:1,自引:0,他引:1  
NaveBayes分类器作为一种计算简单、精度较高的分类方法,已经得到了广泛应用。但是其所作的假设:各属性之间相互独立却非常容易在现实中被违背,阻碍了分类器精度的进一步提高。而Bayes网络较好地考虑了属性之间的依赖关系,但是其计算相当复杂。AugmentedBayes分类器将两者的优点结合在一起,既考虑了属性之间的依赖关系,又保证了算法的简单性。该文从属性所拥有的信息量出发考虑,提出了AugmentedBayes分类器的一种基于熵的学习方法。最后,通过测试数据将该方法与NaveBayes分类器和SuperParent算法进行了比较。  相似文献   

5.
Bayesian learning, widely used in many applied data-modeling problems, is often accomplished with approximation schemes because it requires intractable computation of the posterior distributions. In this study, we focus on two approximation methods, variational Bayes and local variational approximation. We show that the variational Bayes approach for statistical models with latent variables can be viewed as a special case of local variational approximation, where the log-sum-exp function is used to form the lower bound of the log-likelihood. The minimum variational free energy, the objective function of variational Bayes, is analyzed and related to the asymptotic theory of Bayesian learning. This analysis additionally implies a relationship between the generalization performance of the variational Bayes approach and the minimum variational free energy.  相似文献   

6.
Bayesian networks are models for uncertain reasoning which are achieving a growing importance also for the data mining task of classification. Credal networks extend Bayesian nets to sets of distributions, or credal sets. This paper extends a state-of-the-art Bayesian net for classification, called tree-augmented naive Bayes classifier, to credal sets originated from probability intervals. This extension is a basis to address the fundamental problem of prior ignorance about the distribution that generates the data, which is a commonplace in data mining applications. This issue is often neglected, but addressing it properly is a key to ultimately draw reliable conclusions from the inferred models. In this paper we formalize the new model, develop an exact linear-time classification algorithm, and evaluate the credal net-based classifier on a number of real data sets. The empirical analysis shows that the new classifier is good and reliable, and raises a problem of excessive caution that is discussed in the paper. Overall, given the favorable trade-off between expressiveness and efficient computation, the newly proposed classifier appears to be a good candidate for the wide-scale application of reliable classifiers based on credal networks, to real and complex tasks.  相似文献   

7.
The Bayesian classifier is a fundamental classification technique. In this work, we focus on programming Bayesian classifiers in SQL. We introduce two classifiers: Naive Bayes and a classifier based on class decomposition using K-means clustering. We consider two complementary tasks: model computation and scoring a data set. We study several layouts for tables and several indexing alternatives. We analyze how to transform equations into efficient SQL queries and introduce several query optimizations. We conduct experiments with real and synthetic data sets to evaluate classification accuracy, query optimizations, and scalability. Our Bayesian classifier is more accurate than Naive Bayes and decision trees. Distance computation is significantly accelerated with horizontal layout for tables, denormalization, and pivoting. We also compare Naive Bayes implementations in SQL and C++: SQL is about four times slower. Our Bayesian classifier in SQL achieves high classification accuracy, can efficiently analyze large data sets, and has linear scalability.  相似文献   

8.
Identifying differentially expressed genes in microarray data has been studied extensively and several methods have been proposed. Most popular methods in the study of gene expression microarray data analysis rely on normal distribution assumption and are based on a Wald statistic. These methods may be inefficient when expression levels follow a skewed distribution. To deal with possible violations of the normality assumption, we propose a method based on Generalized Logistic Distribution of Type II (GLDII). The motivation behind this distributional assumption is to allow longer tails than normal distribution. This is important in analyzing gene expression data since extreme values are common in such experiments. The shape parameter for GLDII allows flexibility in modeling a wide range of distributions. To simplify the computational complexity involved in carrying out Likelihood Ratio (LR) tests for several thousands of genes, an Approximate LR Test (ALRT) is proposed. We also generalize the two-class ALRT method to multi-class microarray data. The performance of the ALRT method under the GLDII assumption is compared to methods based on Wald-type statistics using simulation. The results from the simulations show that our method performs quite well compared to the significance analysis of microarrays (SAM) approach using standardized Wilcoxon rank statistics and the empirical Bayes (E-B) t-statistics. Our method is also less sensitive to extreme values. We illustrate our method using two publicly available gene expression data sets.  相似文献   

9.
Xintao  Yong   《Pattern recognition》2006,39(12):2439-2449
DNA microarray provides a powerful basis for analysis of gene expression. Bayesian networks, which are based on directed acyclic graphs (DAGs) and can provide models of causal influence, have been investigated for gene regulatory networks. The difficulty with this technique is that learning the Bayesian network structure is an NP-hard problem, as the number of DAGs is superexponential in the number of genes, and an exhaustive search is intractable. In this paper, we propose an enhanced constraint-based approach for causal structure learning. We integrate with graphical Gaussian modeling and use its independence graph as an input of our constraint-based causal learning method. We also present graphical decomposition techniques to further improve the performance. Our enhanced method makes it feasible to explore causal interactions among genes interactively. We have tested our methodology using two microarray data sets. The results show that the technique is both effective and efficient in exploring causal structures from microarray data.  相似文献   

10.
Biao Qin  Yuni Xia  Shan Wang  Xiaoyong Du 《Knowledge》2011,24(8):1151-1158
Data uncertainty can be caused by numerous factors such as measurement precision limitations, network latency, data staleness and sampling errors. When mining knowledge from emerging applications such as sensor networks or location based services, data uncertainty should be handled cautiously to avoid erroneous results. In this paper, we apply probabilistic and statistical theory on uncertain data and develop a novel method to calculate conditional probabilities of Bayes theorem. Based on that, we propose a novel Bayesian classification algorithm for uncertain data. The experimental results show that the proposed method classifies uncertain data with potentially higher accuracies than the Naive Bayesian approach. It also has a more stable performance than the existing extended Naive Bayesian method.  相似文献   

11.
Time-course microarray studies require a particular modelling of covariance matrices when measures are repeated on the same individuals. Taking into account the within-subject correlation in the test statistics for differential gene expression, however, requires a large number of parameters when a gene-specific approach is used, which often results in a lack of power due to the small number of individuals usually considered in microarray experiments. Shrinkage approaches can improve this detection power in differential gene expression studies by reducing the number of parameters, while offering a good flexibility and a small rate of false positives. A natural extension of the shrinkage approach based on a structural mixed model to variance-covariance matrices is proposed. The structural model was used in three configurations to shrink (i) the eigenvalues in an eigenvalue/eigenvector decomposition, (ii) the innovation variances in a Cholesky decomposition, (iii) both the variances and correlation parameters of a gene-by-gene covariance matrix using a Fisher transformation. The proposed methods were applied both to a publicly available data set and to simulated data. They were found to perform well, compared to previously proposed empirical Bayesian approaches, and outperformed the gene-specific or common-covariance methods in many cases.  相似文献   

12.
Markov chain Monte Carlo (MCMC) algorithms have greatly facilitated the popularity of Bayesian variable selection and model averaging in problems with high-dimensional covariates where enumeration of the model space is infeasible. A variety of such algorithms have been proposed in the literature for sampling models from the posterior distribution in Bayesian variable selection. Ghosh and Clyde proposed a method to exploit the properties of orthogonal design matrices. Their data augmentation algorithm scales up the computation tremendously compared to traditional Gibbs samplers, and leads to the availability of Rao-Blackwellized estimates of quantities of interest for the original non-orthogonal problem. The algorithm has excellent performance when the correlations among the columns of the design matrix are small, but empirical results suggest that moderate to strong multicollinearity leads to slow mixing. This motivates the need to develop a class of novel sandwich algorithms for Bayesian variable selection that improves upon the algorithm of Ghosh and Clyde. It is proved that the Haar algorithm with the largest group that acts on the space of models is the optimum algorithm, within the parameter expansion data augmentation (PXDA) class of sandwich algorithms. The result provides theoretical insight but using the largest group is computationally prohibitive so two new computationally viable sandwich algorithms are developed, which are inspired by the Haar algorithm, but do not necessarily belong to the class of PXDA algorithms. It is illustrated via simulation studies and real data analysis that several of the sandwich algorithms can offer substantial gains in the presence of multicollinearity.  相似文献   

13.
贝叶斯网络因其对属性间因果关系的表达能力而成为处理不完整数据的强有力的工具。然而绝大多数的贝叶斯分类器都是基于完整数据的,并且在现实世界中数据往往是不完整的,因此利用不完整数据构建有效的贝叶斯分类器是一个重要而又具有挑战性的问题。通过分析著名的基于不完整数据的RBC分类器的不足,在BC(Bound and Collapse)方法和EM算法的基础上给出了一种基于不完整数据的分类器构建方法。实验结果表明了该算法的有效性。  相似文献   

14.
一种基于不完整数据的朴素贝叶斯分类器   总被引:1,自引:0,他引:1       下载免费PDF全文
贝叶斯网络因其对属性间因果关系的表达能力而成为处理不完整数据的强有力的工具。然而绝大多数的贝叶斯分类器都是基于完整数据的,并且在现实世界中数据往往是不完整的,因此利用不完整数据构建有效的贝叶斯分类器是一个重要而又具有挑战性的问题。 通过分析著名的基于不完整数据的RBC分类器的不足,在BC (Bound and Collapse)方法和EM算法的基础上给出了一种基于不完整数据的分类器构建方法。实验结果表明了该算法的有效性。  相似文献   

15.
An encompassing prior (EP) approach to facilitate Bayesian model selection for nested models with inequality constraints has been previously proposed. In this approach, samples are drawn from the prior and posterior distributions of an encompassing model that contains an inequality restricted version as a special case. The Bayes factor in favor of the inequality restriction then simplifies to the ratio of the proportions of posterior and prior samples consistent with the inequality restriction. This formalism has been applied almost exclusively to models with inequality or “about equality” constraints. It is shown that the EP approach naturally extends to exact equality constraints by considering the ratio of the heights for the posterior and prior distributions at the point that is subject to test (i.e., the Savage-Dickey density ratio). The EP approach generalizes the Savage-Dickey ratio method, and can accommodate both inequality and exact equality constraints. The general EP approach is found to be a computationally efficient procedure to calculate Bayes factors for nested models. However, the EP approach to exact equality constraints is vulnerable to the Borel-Kolmogorov paradox, the consequences of which warrant careful consideration.  相似文献   

16.
17.
Boosted Bayesian network classifiers   总被引:2,自引:0,他引:2  
The use of Bayesian networks for classification problems has received a significant amount of recent attention. Although computationally efficient, the standard maximum likelihood learning method tends to be suboptimal due to the mismatch between its optimization criteria (data likelihood) and the actual goal of classification (label prediction accuracy). Recent approaches to optimizing classification performance during parameter or structure learning show promise, but lack the favorable computational properties of maximum likelihood learning. In this paper we present boosted Bayesian network classifiers, a framework to combine discriminative data-weighting with generative training of intermediate models. We show that boosted Bayesian network classifiers encompass the basic generative models in isolation, but improve their classification performance when the model structure is suboptimal. We also demonstrate that structure learning is beneficial in the construction of boosted Bayesian network classifiers. On a large suite of benchmark data-sets, this approach outperforms generative graphical models such as naive Bayes and TAN in classification accuracy. Boosted Bayesian network classifiers have comparable or better performance in comparison to other discriminatively trained graphical models including ELR and BNC. Furthermore, boosted Bayesian networks require significantly less training time than the ELR and BNC algorithms.  相似文献   

18.
We develop an approach to analyze time-course microarray data which are obtained from a single sample at multiple time points and to identify which genes are cell-cycle regulated. Since some genes have similar gene expression patterns, to reduce the amount of hypothesis testing, we first perform a clustering analysis to group genes into classes with similar cell-cycle patterns, including a class with no cell-cycle phenomena at all. Then we build a statistical model and an inference function assuming that genes within a cluster share the same mean model. A varying coefficient nonparametric approach is employed to be more flexible to fit the time-course data. In order to incorporate the correlation of longitudinal measurements, the quadratic inference function method is applied to obtain more efficient estimators and more powerful tests. Furthermore, this method allows us to perform chi-squared tests to determine whether certain genes are cell-cycle regulated. A data example on cell-cycle microarray data as well as simulations are illustrated.  相似文献   

19.
DNA microarray has been recognized as being an important tool for studying the expression of thousands of genes simultaneously. These experiments allow us to compare two different samples of cDNA obtained under different conditions. A novel method for the analysis of replicated microarray experiments based upon the modelling of gene expression distribution as a mixture of α-stable distributions is presented. Some features of the distribution of gene expression, such as Pareto tails and the fact that the variance of any given array increases concomitantly with an increase in the number of genes studied, suggest the possibility of modelling gene expression distribution on the basis of α-stable density. The proposed methodology uses very well known properties of α-stable distribution, such as the scale mixture of normals. A Bayesian log-posterior odds is calculated, which allows us to decide whether a gene is expressed differentially or not. The proposed methodology is illustrated using simulated and experimental data and the results are compared with other existing statistical approaches. The proposed heavy-tail model improves the performance of other distributions and is easily applicable to microarray gene data, specially if the dataset contains outliers or presents high variance between replicates.  相似文献   

20.
During the drug development, nonlinear mixed effects models are routinely used to study the drug’s pharmacokinetics and pharmacodynamics. The distribution of random effects is of special interest because it allows to describe the heterogeneity of the drug’s kinetics or dynamics in the population of individuals studied. Parametric models are widely used, but they rely on a normality assumption which may be too restrictive. In practice, this assumption is often checked using the empirical distribution of random effects’ empirical Bayes estimates. Unfortunately, when data are sparse (like in patients phase III clinical trials), this method is unreliable. In this context, nonparametric estimators of the random effects distribution are attractive. Several nonparametric methods (estimators and their associated computation algorithms) have been proposed but their use is limited. Indeed, their practical and theoretical properties are unclear and they have a reputation for being computationally expensive. Four nonparametric methods in comparison with the usual parametric method are evaluated. Statistical and computational features are reviewed and practical performances are compared in simulation studies mimicking real pharmacokinetic analyses. The nonparametric methods seemed very useful when data are sparse. On a simple pharmacokinetic model, all the nonparametric methods performed roughly equivalently. On a more challenging pharmacokinetic model, differences between the methods were clearer.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号