首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Data clustering is a fundamental unsupervised learning task in several domains such as data mining, computer vision, information retrieval, and pattern recognition. In this paper, we propose and analyze a new clustering approach based on both hierarchical Dirichlet processes and the generalized Dirichlet distribution, which leads to an interesting statistical framework for data analysis and modelling. Our approach can be viewed as a hierarchical extension of the infinite generalized Dirichlet mixture model previously proposed in Bouguila and Ziou (IEEE Trans Neural Netw 21(1):107–122, 2010). The proposed clustering approach tackles the problem of modelling grouped data where observations are organized into groups that we allow to remain statistically linked by sharing mixture components. The resulting clustering model is learned using a principled variational Bayes inference-based algorithm that we have developed. Extensive experiments and simulations, based on two challenging applications namely images categorization and web service intrusion detection, demonstrate our model usefulness and merits.  相似文献   

2.
We consider the problem of determining the structure of high-dimensional data, without prior knowledge of the number of clusters. Data are represented by a finite mixture model based on the generalized Dirichlet distribution. The generalized Dirichlet distribution has a more general covariance structure than the Dirichlet distribution and offers high flexibility and ease of use for the approximation of both symmetric and asymmetric distributions. This makes the generalized Dirichlet distribution more practical and useful. An important problem in mixture modeling is the determination of the number of clusters. Indeed, a mixture with too many or too few components may not be appropriate to approximate the true model. Here, we consider the application of the minimum message length (MML) principle to determine the number of clusters. The MML is derived so as to choose the number of clusters in the mixture model which best describes the data. A comparison with other selection criteria is performed. The validation involves synthetic data, real data clustering, and two interesting real applications: classification of web pages, and texture database summarization for efficient retrieval.  相似文献   

3.
The advent of mixture models has opened the possibility of flexible models which are practical to work with. A common assumption is that practitioners typically expect that data are generated from a Gaussian mixture. The inverted Dirichlet mixture has been shown to be a better alternative to the Gaussian mixture and to be of significant value in a variety of applications involving positive data. The inverted Dirichlet is, however, usually undesirable, since it forces an assumption of positive correlation. Our focus here is to develop a Bayesian alternative to both the Gaussian and the inverted Dirichlet mixtures when dealing with positive data. The alternative that we propose is based on the generalized inverted Dirichlet distribution which offers high flexibility and ease of use, as we show in this paper. Moreover, it has a more general covariance structure than the inverted Dirichlet. The proposed mixture model is subjected to a fully Bayesian analysis based on Markov Chain Monte Carlo (MCMC) simulation methods namely Gibbs sampling and Metropolis–Hastings used to compute the posterior distribution of the parameters, and on Bayesian information criterion (BIC) used for model selection. The adoption of this purely Bayesian learning choice is motivated by the fact that Bayesian inference allows to deal with uncertainty in a unified and consistent manner. We evaluate our approach on the basis of two challenging applications concerning object classification and forgery detection.  相似文献   

4.
Mixture modeling is one of the most useful tools in machine learning and data mining applications. An important challenge when applying finite mixture models is the selection of the number of clusters which best describes the data. Recent developments have shown that this problem can be handled by the application of non-parametric Bayesian techniques to mixture modeling. Another important crucial preprocessing step to mixture learning is the selection of the most relevant features. The main approach in this paper, to tackle these problems, consists on storing the knowledge in a generalized Dirichlet mixture model by applying non-parametric Bayesian estimation and inference techniques. Specifically, we extend finite generalized Dirichlet mixture models to the infinite case in which the number of components and relevant features do not need to be known a priori. This extension provides a natural representation of uncertainty regarding the challenging problem of model selection. We propose a Markov Chain Monte Carlo algorithm to learn the resulted infinite mixture. Through applications involving text and image categorization, we show that infinite mixture models offer a more powerful and robust performance than classic finite mixtures for both clustering and feature selection.  相似文献   

5.
6.
We describe approaches for positive data modeling and classification using both finite inverted Dirichlet mixture models and support vector machines (SVMs). Inverted Dirichlet mixture models are used to tackle an outstanding challenge in SVMs namely the generation of accurate kernels. The kernels generation approaches, grounded on ideas from information theory that we consider, allow the incorporation of data structure and its structural constraints. Inverted Dirichlet mixture models are learned within a principled Bayesian framework using both Gibbs sampler and Metropolis-Hastings for parameter estimation and Bayes factor for model selection (i.e., determining the number of mixture’s components). Our Bayesian learning approach uses priors, which we derive by showing that the inverted Dirichlet distribution belongs to the family of exponential distributions, over the model parameters, and then combines these priors with information from the data to build posterior distributions. We illustrate the merits and the effectiveness of the proposed method with two real-world challenging applications namely object detection and visual scenes analysis and classification.  相似文献   

7.
This paper presents an unsupervised approach for feature selection and extraction in mixtures of generalized Dirichlet (GD) distributions. Our method defines a new mixture model that is able to extract independent and non-Gaussian features without loss of accuracy. The proposed model is learned using the Expectation-Maximization algorithm by minimizing the message length of the data set. Experimental results show the merits of the proposed methodology in the categorization of object images.  相似文献   

8.
In this paper, we propose a probabilistic framework for efficient retrieval and indexing of image collections. This framework uncovers the hierarchical structure underlying the collection from image features based on a hybrid model that combines both generative and discriminative learning. We adopt the generalized Dirichlet mixture and maximum likelihood for the generative learning in order to estimate accurately the statistical model of the data. Then, the resulting model is refined by a new discriminative likelihood that enhances the power of relevant features. Consequently, this new model is suitable for modeling high-dimensional data described by both semantic and low-level (visual) features. The semantic features are defined according to a known ontology while visual features represent the visual appearance such as color, shape, and texture. For validation purposes, we propose a new visual feature which has nice invariance properties to image transformations. Experiments on the Microsoft's collection (MSRCID) show clearly the merits of our approach in both retrieval and indexing.  相似文献   

9.
Positive vectors clustering using inverted Dirichlet finite mixture models   总被引:1,自引:0,他引:1  
In this work we present an unsupervised algorithm for learning finite mixture models from multivariate positive data. Indeed, this kind of data appears naturally in many applications, yet it has not been adequately addressed in the past. This mixture model is based on the inverted Dirichlet distribution, which offers a good representation and modeling of positive non-Gaussian data. The proposed approach for estimating the parameters of an inverted Dirichlet mixture is based on the maximum likelihood (ML) using Newton Raphson method. We also develop an approach, based on the minimum message length (MML) criterion, to select the optimal number of clusters to represent the data using such a mixture. Experimental results are presented using artificial histograms and real data sets. The challenging problem of software modules classification is investigated within the proposed statistical framework, also.  相似文献   

10.
Gaussian mixture model based on the Dirichlet distribution (Dirichlet Gaussian mixture model) has recently received great attention for modeling and processing data. This paper studies the new Dirichlet Gaussian mixture model for image segmentation. First, we propose a new way to incorporate the local spatial information between neighboring pixels based on the Dirichlet distribution. The main advantage is its simplicity, ease of implementation and fast computational speed. Secondly, existing Dirichlet Gaussian model uses complex log-likelihood function and require many parameters that are difficult to estimate. The total parameters in the proposed model lesser and the log-likelihood function have a simpler form. Finally, to estimate the parameters of the proposed Dirichlet Gaussian mixture model, a gradient method is adopted to minimize the negative log-likelihood function. Numerical experiments are conducted using the proposed model on various synthetic, natural and color images. We demonstrate through extensive simulations that the proposed model is superior to other algorithms based on the model-based techniques for image segmentation.  相似文献   

11.
In this paper, we propose a Bayesian nonparametric approach for modeling and selection based on a mixture of Dirichlet processes with Dirichlet distributions, which can also be seen as an infinite Dirichlet mixture model. The proposed model uses a stick-breaking representation and is learned by a variational inference method. Due to the nature of Bayesian nonparametric approach, the problems of overfitting and underfitting are prevented. Moreover, the obstacle of estimating the correct number of clusters is sidestepped by assuming an infinite number of clusters. Compared to other approximation techniques, such as Markov chain Monte Carlo (MCMC), which require high computational cost and whose convergence is difficult to diagnose, the whole inference process in the proposed variational learning framework is analytically tractable with closed-form solutions. Additionally, the proposed infinite Dirichlet mixture model with variational learning requires only a modest amount of computational power which makes it suitable to large applications. The effectiveness of our model is experimentally investigated through both synthetic data sets and challenging real-life multimedia applications namely image spam filtering and human action videos categorization.  相似文献   

12.
针对电网净负荷时序数据关联的特点,提出基于数据关联的狄利克雷混合模型(Data-relevance Dirichlet process mixture model,DDPMM)来表征净负荷的不确定性.首先,使用狄利克雷混合模型对净负荷的观测数据与预测数据进行拟合,得到其混合概率模型;然后,提出考虑数据关联的变分贝叶斯推断方法,改进后验分布对该混合概率模型进行求解,从而得到混合模型的最优参数;最后,根据净负荷预测值的大小得到其对应的预测误差边缘概率分布,实现不确定性表征.本文基于比利时电网的净负荷数据进行检验,算例结果表明:与传统的狄利克雷混合模型和高斯混合模型(Gaussian mixture model,GMM)等方法相比,所提出的基于数据关联狄利克雷混合模型可以更为有效地表征净负荷的不确定性.  相似文献   

13.
Short text clustering is one of the fundamental tasks in natural language processing. Different from traditional documents, short texts are ambiguous and sparse due to their short form and the lack of recurrence in word usage from one text to another, making it very challenging to apply conventional machine learning algorithms directly. In this article, we propose two novel approaches for short texts clustering: collapsed Gibbs sampling infinite generalized Dirichlet multinomial mixture model infinite GSGDMM) and collapsed Gibbs sampling infinite Beta-Liouville multinomial mixture model (infinite GSBLMM). We adopt two flexible and practical priors to the multinomial distribution where in the first one the generalized Dirichlet distribution is integrated, while the second one is based on the Beta-Liouville distribution. We evaluate the proposed approaches on two famous benchmark datasets, namely, Google News and Tweet. The experimental results demonstrate the effectiveness of our models compared to basic approaches that use Dirichlet priors. We further propose to improve the performance of our methods with an online clustering procedure. We also evaluate the performance of our methods for the outlier detection task, in which we achieve accurate results.  相似文献   

14.
The generalized Gaussian mixture model (GGMM) provides a flexible and suitable tool for many computer vision and pattern recognition problems. However, generalized Gaussian distribution is unbounded. In many applications, the observed data are digitalized and have bounded support. A new bounded generalized Gaussian mixture model (BGGMM), which includes the Gaussian mixture model (GMM), Laplace mixture model (LMM), and GGMM as special cases, is presented in this paper. We propose an extension of the generalized Gaussian distribution in this paper. This new distribution has a flexibility to fit different shapes of observed data such as non-Gaussian and bounded support data. In order to estimate the model parameters, we propose an alternate approach to minimize the higher bound on the data negative log-likelihood function. We quantify the performance of the BGGMM with simulations and real data.  相似文献   

15.
Finite mixture models have been applied for different computer vision, image processing and pattern recognition tasks. The majority of the work done concerning finite mixture models has focused on mixtures for continuous data. However, many applications involve and generate discrete data for which discrete mixtures are better suited. In this paper, we investigate the problem of discrete data modeling using finite mixture models. We propose a novel, well motivated mixture that we call the multinomial generalized Dirichlet mixture. The novel model is compared with other discrete mixtures. We designed experiments involving spatial color image databases modeling and summarization, and text classification to show the robustness, flexibility and merits of our approach.  相似文献   

16.
In this paper, we present a fully Bayesian approach for generalized Dirichlet mixtures estimation and selection. The estimation of the parameters is based on the Monte Carlo simulation technique of Gibbs sampling mixed with a Metropolis-Hastings step. Also, we obtain a posterior distribution which is conjugate to a generalized Dirichlet likelihood. For the selection of the number of clusters, we used the integrated likelihood. The performance of our Bayesian algorithm is tested and compared with the maximum likelihood approach by the classification of several synthetic and real data sets. The generalized Dirichlet mixture is also applied to the problems of IR eye modeling and introduced as a probabilistic kernel for Support Vector Machines.
Riad I. HammoudEmail:
  相似文献   

17.
Learning appropriate statistical models is a fundamental data analysis task which has been the topic of continuing interest. Recently, finite Dirichlet mixture models have proved to be an effective and flexible model learning technique in several machine learning and data mining applications. In this article, the problem of learning and selecting finite Dirichlet mixture models is addressed using an expectation propagation (EP) inference framework. Within the proposed EP learning method, for finite mixture models, all the involved parameters and the model complexity (i.e. the number of mixture components), can be evaluated simultaneously in a single optimization framework. Extensive simulations using synthetic data along with two challenging real-world applications involving automatic image annotation and human action videos categorization demonstrate that our approach is able to achieve better results than comparable techniques.  相似文献   

18.
The prior distribution of an attribute in a naïve Bayesian classifier is typically assumed to be a Dirichlet distribution, and this is called the Dirichlet assumption. The variables in a Dirichlet random vector can never be positively correlated and must have the same confidence level as measured by normalized variance. Both the generalized Dirichlet and the Liouville distributions include the Dirichlet distribution as a special case. These two multivariate distributions, also defined on the unit simplex, are employed to investigate the impact of the Dirichlet assumption in naïve Bayesian classifiers. We propose methods to construct appropriate generalized Dirichlet and Liouville priors for naïve Bayesian classifiers. Our experimental results on 18 data sets reveal that the generalized Dirichlet distribution has the best performance among the three distribution families. Not only is the Dirichlet assumption inappropriate, but also forcing the variables in a prior to be all positively correlated can deteriorate the performance of the naïve Bayesian classifier.  相似文献   

19.
In the Bayesian mixture modeling framework it is possible to infer the necessary number of components to model the data and therefore it is unnecessary to explicitly restrict the number of components. Nonparametric mixture models sidestep the problem of finding the “correct” number of mixture components by assuming infinitely many components. In this paper Dirichlet process mixture (DPM) models are cast as infinite mixture models and inference using Markov chain Monte Carlo is described. The specification of the priors on the model parameters is often guided by mathematical and practical convenience. The primary goal of this paper is to compare the choice of conjugate and non-conjugate base distributions on a particular class of DPM models which is widely used in applications, the Dirichlet process Gaussian mixture model (DPGMM). We compare computational efficiency and modeling performance of DPGMM defined using a conjugate and a conditionally conjugate base distribution. We show that better density models can result from using a wider class of priors with no or only a modest increase in computational effort.  相似文献   

20.
Finite mixture models are one of the most widely and commonly used probabilistic techniques for image segmentation. Although the most well known and commonly used distribution when considering mixture models is the Gaussian, it is certainly not the best approximation for image segmentation and other related image processing problems. In this paper, we propose and investigate the use of several other mixture models based namely on Dirichlet, generalized Dirichlet and Beta–Liouville distributions, which offer more flexibility in data modeling, for image segmentation. A maximum likelihood (ML) based algorithm is applied for estimating the resulted segmentation model’s parameters. Spatial information is also employed for figuring out the number of regions in an image and several color spaces are investigated and compared. The experimental results show that the proposed segmentation framework yields good overall performance, on various color scenes, that is better than comparable techniques.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号