期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Discrete data clustering using finite mixture models

Nizar Bouguila^{Author Vitae} Walid ElGuebaly Author Vitae 《Pattern recognition》2009,42(1):33-42

Finite mixture models have been applied for different computer vision, image processing and pattern recognition tasks. The majority of the work done concerning finite mixture models has focused on mixtures for continuous data. However, many applications involve and generate discrete data for which discrete mixtures are better suited. In this paper, we investigate the problem of discrete data modeling using finite mixture models. We propose a novel, well motivated mixture that we call the multinomial generalized Dirichlet mixture. The novel model is compared with other discrete mixtures. We designed experiments involving spatial color image databases modeling and summarization, and text classification to show the robustness, flexibility and merits of our approach. 相似文献

2.

Proportional data modeling via selection and estimation of a finite mixture of scaled Dirichlet distributions

Nuha Zamzami Rua Alsuroji Oboh Eromonsele Nizar Bouguila 《Computational Intelligence》2020,36(2):459-485

This paper proposes an unsupervised algorithm for learning a finite mixture of scaled Dirichlet distributions. Parameters estimation is based on the maximum likelihood approach, and the minimum message length (MML) criterion is proposed for selecting the optimal number of components. This research work is motivated by the flexibility issues of the Dirichlet distribution, the widely used model for multivariate proportional data, which has prompted a number of scholars to search for generalizations of the Dirichlet. By introducing the extra parameters of the scaled Dirichlet, several useful statistical models could be obtained. Experimental results are presented using both synthetic and real datasets. Moreover, challenging real-world applications are empirically investigated to evaluate the efficiency of our proposed statistical framework. 相似文献

3.

A hierarchical modeling approach for clustering probability density functions

《Computational statistics & data analysis》2014

The problem of clustering probability density functions is emerging in different scientific domains. The methods proposed for clustering probability density functions are mainly focused on univariate settings and are based on heuristic clustering solutions. New aspects of the problem associated with the multivariate setting and a model-based perspective are investigated. The novel approach relies on a hierarchical mixture modeling of the data. The method is introduced in the univariate context and then extended to multivariate densities by means of a factorial model performing dimension reduction. Model fitting is carried out using an EM-algorithm. The proposed method is illustrated through simulated experiments and applied to two real data sets in order to compare its performance with alternative clustering strategies. 相似文献

4.

Classification of large data sets with mixture models via sufficient EM

P.M. Steiner M. Hudec 《Computational statistics & data analysis》2007,51(11):5416-5428

For the classification of very large data sets with a mixture model approach a two-step strategy for the estimation of the mixture is proposed. In the first step data are scaled down using compression techniques. Data compression consists of clustering the single observations into a medium number of groups and the representation of each group by a prototype, i.e. a triple of sufficient statistics (mean vector, covariance matrix, number of observations compressed). In the second step the mixture is estimated by applying an adapted EM algorithm (called sufficient EM) to the sufficient statistics of the compressed data. The estimated mixture allows the classification of observations according to their maximum posterior probability of component membership. The performance of sufficient EM in clustering a real data set from a web-usage mining application is compared to standard EM and the TwoStep clustering algorithm as implemented in SPSS. It turns out that the algorithmic efficiency of the sufficient EM algorithm is much more higher than for standard EM. While the TwoStep algorithm is even faster the results show a lack of stability. 相似文献

5.

A finite mixture model for simultaneous high-dimensional clustering, localized feature selection and outlier rejection

Nizar Bouguila Khaled Almakadmeh 《Expert systems with applications》2012,39(7):6641-6656

Model-based approaches and in particular finite mixture models are widely used for data clustering which is a crucial step in several applications of practical importance. Indeed, many pattern recognition, computer vision and image processing applications can be approached as feature space clustering problems. For complex high-dimensional data, however, the use of these approaches presents several challenges such as the presence of many irrelevant features which may affect the speed and also compromise the accuracy of the used learning algorithm. Another problem is the presence of outliers which potentially influence the resulting model’s parameters. For this purpose, we propose and discuss an algorithm that partitions a given data set without a priori information about the number of clusters, the saliency of the features or the number of outliers. We illustrate the performance of our approach using different applications involving synthetic data, real data and objects shape clustering. 相似文献

6.

Dynamic Type-2 Fuzzy Dependent Dirichlet Regression Mixture clustering model

《Applied Soft Computing》2017

In this paper, a new dynamic Interval Type-2 Fuzzy Dependent Dirichlet Piecewise Regression Mixture (IT2FDDPRM) clustering model is proposed. The model overcomes shortcomings of both Dependent Dirichlet Process Mixture (DDPM) technique and Interval Type-2 Fuzzy C-regression Clustering Model (IT2FCRM). DDPM method demonstrates that the probability of assigning data to a cluster including the maximum number of data among all clusters is higher, and it ignores the similarity of data to a cluster. However, the new IT2FDDPRM clustering technique supports assignment of data to a cluster which has the most similarity to them. It also allows the model to generate infinite number of clusters. Moreover, it has the capability of segmenting functions assigned to clusters. The model is validated using statistical tests, three validity functions, and mean square error of the model. The results of numerical experiments show that the proposed method has superior performance to other clustering techniques in literature. 相似文献

7.

Online short text clustering using infinite extensions of discrete mixture models

Samar Hannachi Fatma Najar Hafsa Ennajari Nizar Bouguila 《Computational Intelligence》2023,39(5):759-782

Short text clustering is one of the fundamental tasks in natural language processing. Different from traditional documents, short texts are ambiguous and sparse due to their short form and the lack of recurrence in word usage from one text to another, making it very challenging to apply conventional machine learning algorithms directly. In this article, we propose two novel approaches for short texts clustering: collapsed Gibbs sampling infinite generalized Dirichlet multinomial mixture model infinite GSGDMM) and collapsed Gibbs sampling infinite Beta-Liouville multinomial mixture model (infinite GSBLMM). We adopt two flexible and practical priors to the multinomial distribution where in the first one the generalized Dirichlet distribution is integrated, while the second one is based on the Beta-Liouville distribution. We evaluate the proposed approaches on two famous benchmark datasets, namely, Google News and Tweet. The experimental results demonstrate the effectiveness of our models compared to basic approaches that use Dirichlet priors. We further propose to improve the performance of our methods with an online clustering procedure. We also evaluate the performance of our methods for the outlier detection task, in which we achieve accurate results. 相似文献

8.

Pair-copula based mixture models and their application in clustering

Anandarup Roy Swapan K. Parui 《Pattern recognition》2014

Finite mixtures are often used to perform model based clustering of multivariate data sets. In real life applications, such data may exhibit complex nonlinear form of dependence among the variables. Also, the individual variables (margins) may follow different families of distributions. Most of the existing mixture models are unable to accommodate these two aspects of the data. This paper presents a finite mixture model that involves a pair-copula based construction of a multivariate distribution. Such a model de-couples the margins and the dependence structures. Hence, the margins can be modeled using different families. Again, many possible dependence structures can also be studied using different copulas. The resulting mixture model (called DVMM) is then capable of capturing a broad family of distributions including non-Gaussian models. Here we study DVMM in the context of clustering of multivariate data. We design an expectation maximization procedure for estimating the mixture parameters. We perform extensive experiments on the basis of a number of well-known data sets. A detailed evaluation of the clustering quality obtained by DVMM in comparison to other mixture models is presented. The experimental results show that the performance of DVMM is quite satisfactory. 相似文献

9.

Competitive EM algorithm for finite mixture models

Baibo Zhang Author Vitae Author Vitae Xing Yi Author Vitae 《Pattern recognition》2004,37(1):131-144

In this paper, we present a novel competitive EM (CEM) algorithm for finite mixture models to overcome the two main drawbacks of the EM algorithm: often getting trapped at local maxima and sometimes converging to the boundary of the parameter space. The proposed algorithm is capable of automatically choosing the clustering number and selecting the “split” or “merge” operations efficiently based on the new competitive mechanism we propose. It is insensitive to the initial configuration of the mixture component number and model parameters.Experiments on synthetic data show that our algorithm has very promising performance for the parameter estimation of mixture models. The algorithm is also applied to the structure analysis of complicated Chinese characters. The results show that the proposed algorithm performs much better than previous methods with slightly heavier computation burden. 相似文献

10.

SNCStream+: Extending a high quality true anytime data stream clustering algorithm

《Information Systems》2016

Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally as data arrives. On top of that, due to the inherent evolving nature of data streams, it is expected that algorithms undergo both concept drifts and evolutions, which must be taken into account by the clustering algorithm, allowing incremental clustering updates. In this paper we present the Social Network Clusterer Stream⁺ (SNCStream⁺). SNCStream⁺ tackles the data stream clustering problem as a network formation and evolution problem, where instances and micro-clusters form clusters based on homophily. Our proposal has its parameters analyzed and it is evaluated in a broad set of problems against literature baselines. Results show that SNCStream⁺ achieves superior clustering quality (CMM), and feasible processing time and memory space usage when compared to the original SNCStream and other proposals of the literature. 相似文献

11.

Fuzzy clustering to estimate the parameters of block mixture models

G. Govaert M. Nadif 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2006,10(5):415-422

Finite mixture models are being increasingly used to provide model-based cluster analysis. To tackle the problem of block clustering which aims to organize the data into homogeneous blocks, recently we have proposed a block mixture model; we have considered this model under the classification maximum likelihood approach and we have developed a new algorithm for simultaneous partitioning based on the classification EM algorithm. From the estimation point of view, classification maximum likelihood approach yields inconsistent estimates of the parameters and in this paper we consider the block clustering problem under the maximum likelihood approach; unfortunately, the application of the classical EM algorithm for the block mixture model is not direct: difficulties arise due to the dependence structure in the model and approximations are required. Considering the block clustering problem under a fuzzy approach, we propose a fuzzy block clustering algorithm to approximate the EM algorithm. To illustrate our approach, we study the case of binary data by using a Bernoulli block mixture. 相似文献

12.

Attribute weighted mercer kernel based fuzzy clustering algorithm for general non-spherical datasets 总被引：4，自引：0，他引：4

Hongbin Shen Jie Yang Shitong Wang Xiaojun Liu 《Soft Computing - A Fusion of Foundations, Methodologies and Applications》2006,10(11):1061-1073

Clustering analysis is an important topic in artificial intelligence, data mining and pattern recognition research. Conventional clustering algorithms, for instance, the famous Fuzzy C-means clustering algorithm (FCM), assume that all the attributes are equally relevant to all the clusters. However in most domains, especially for high-dimensional dataset, some attributes are irrelevant, and some relevant ones are less important than others with respect to a specific class. In this paper, such imbalances between the attributes are considered and a new weighted fuzzy kernel-clustering algorithm (WFKCA) is presented. WFKCA performs clustering in a kernel feature space mapped by mercer kernels. Compared with the conventional hard kernel-clustering algorithm, WFKCA can yield the meaningful prototypes (cluster centers) of the clusters. Numerical convergence properties of WFKCA are also discussed. For in-depth studies, WFKCA is extended to WFKCA2, which has been demonstrated as a useful tool for clustering incomplete data. Numerical examples demonstrate the effectiveness of the new WFKCA algorithm 相似文献

13.

Swendsen-Wang Cuts sampling for spatially constrained Dirichlet process mixture models

《Graphical Models》2014,76(5):496-506

Spatially constrained Dirichlet process mixture models are springing up in image processing in recent years. However, inference for the model is NP-hard. Gibbs sampling which is a generic Markov chain Monte Carlo technique is commonly employed for the model inference. It needs to traverse all the nodes of the constructed graph in each iteration. The sampling process hardly crosses over the intermediate low probabilistic state. In addition, it is not well informed by the spatial relationship in the sampling process. In this paper, a spatially dependent split-merge algorithm for sampling the MRF/DPMM model based on Swendsen-Wang Cuts is proposed. It is a state of the art algorithm which combines the spatial relationship to direct the sampling, and lessen the mixing time drastically. In this algorithm, a set of nodes are being frozen together according to the discriminative probability of the edges between neighboring nodes. The frozen nodes update their states simultaneously in contrast to the single node update in a Gibbs sampling. The final step of the algorithm is to accept the proposed new state according to the Metropolis Hasting scheme, in which only the ratio of posterior distribution needs to be calculated in each iteration. Experimental results demonstrated that the proposed sampling algorithm is able to reduce the mixing time considerably. At the same time, it can obtain comparably stable results with a random initial state. 相似文献

14.

A maximum likelihood approximation method for Dirichlet's parameter estimation

Nicolas Wicker Jean Muller Ravi Kiran Reddy Kalathur 《Computational statistics & data analysis》2008,52(3):1315-1322

Dirichlet distributions are natural choices to analyse data described by frequencies or proportions since they are the simplest known distributions for such data apart from the uniform distribution. They are often used whenever proportions are involved, for example, in text-mining, image analysis, biology or as a prior of a multinomial distribution in Bayesian statistics. As the Dirichlet distribution belongs to the exponential family, its parameters can be easily inferred by maximum likelihood. Parameter estimation is usually performed with the Newton-Raphson algorithm after an initialisation step using either the moments or Ronning's methods. However this initialisation can result in parameters that lie outside the admissible region. A simple and very efficient alternative based on a maximum likelihood approximation is presented. The advantages of the presented method compared to two other methods are demonstrated on synthetic data sets as well as for a practical biological problem: the clustering of protein sequences based on their amino acid compositions. 相似文献

15.

Novel mixture allocation models for topic learning

Kamal Maanicshah Manar Amayri Nizar Bouguila 《Computational Intelligence》2024,40(2):e12641

Latent Dirichlet allocation (LDA) is one of the major models used for topic modelling. A number of models have been proposed extending the basic LDA model. There has also been interesting research to replace the Dirichlet prior of LDA with other pliable distributions like generalized Dirichlet, Beta-Liouville and so forth. Owing to the proven efficiency of using generalized Dirichlet (GD) and Beta-Liouville (BL) priors in topic models, we use these versions of topic models in our paper. Furthermore, to enhance the support of respective topics, we integrate mixture components which gives rise to generalized Dirichlet mixture allocation and Beta-Liouville mixture allocation models respectively. In order to improve the modelling capabilities, we use variational inference method for estimating the parameters. Additionally, we also introduce an online variational approach to cater to specific applications involving streaming data. We evaluate our models based on its performance on applications related to text classification, image categorization and genome sequence classification using a supervised approach where the labels are used as an observed variable within the model. 相似文献

16.

On the use of Bernoulli mixture models for text classification 总被引：1，自引：0，他引：1

A. E. 《Pattern recognition》2002,35(12):2705-2710

Mixture modelling of class-conditional densities is a standard pattern recognition technique. Although most research on mixture models has concentrated on mixtures for continuous data, emerging pattern recognition applications demand extending research efforts to other data types. This paper focuses on the application of mixtures of multivariate Bernoulli distributions to binary data. More concretely, a text classification task aimed at improving language modelling for machine translation is considered. 相似文献

17.

Active curve axis Gaussian mixture models

Baibo Changshui Xing 《Pattern recognition》2005,38(12):2351-2362

Gaussian Mixture Models (GMM) have been broadly applied for the fitting of probability density function. However, due to the intrinsic linearity of GMM, usually many components are needed to appropriately fit the data distribution, when there are curve manifolds in the data cloud.

In order to solve this problem and represent data with curve manifolds better, in this paper we propose a new nonlinear probability model, called active curve axis Gaussian model. Intuitively, this model can be imagined as Gaussian model being bent at the first principal axis. For estimating parameters of mixtures of this model, the EM algorithm is employed.

Experiments on synthetic data and Chinese characters show that the proposed nonlinear mixture models can approximate distributions of data clouds with curve manifolds in a more concise and compact way than GMM does. The performance of the proposed nonlinear mixture models is promising. 相似文献

18.

Learning from partially supervised data using mixture models and belief functions

E. Côme Author Vitae L. Oukhellou Author Vitae T. Denœux Author Vitae Author Vitae 《Pattern recognition》2009,42(3):334-91

This paper addresses classification problems in which the class membership of training data are only partially known. Each learning sample is assumed to consist of a feature vector x_i∈X and an imprecise and/or uncertain “soft” label m_i defined as a Dempster-Shafer basic belief assignment over the set of classes. This framework thus generalizes many kinds of learning problems including supervised, unsupervised and semi-supervised learning. Here, it is assumed that the feature vectors are generated from a mixture model. Using the generalized Bayesian theorem, an extension of Bayes’ theorem in the belief function framework, we derive a criterion generalizing the likelihood function. A variant of the expectation maximization (EM) algorithm, dedicated to the optimization of this criterion is proposed, allowing us to compute estimates of model parameters. Experimental results demonstrate the ability of this approach to exploit partial information about class labels. 相似文献

19.

Test for homogeneity in gamma mixture models using likelihood ratio

《Computational statistics & data analysis》2014

A testing problem of homogeneity in gamma mixture models is studied. It is found that there is a proportion of the penalized likelihood ratio test statistic that degenerates to zero. The limiting distribution of this statistic is found to be the chi-bar-square distributions. The degeneration is due to the negative-definiteness of a complicated random matrix, depending on the shape parameter under the null hypothesis. In light of this dependency, bounds on the distribution are introduced and a weighted average procedure is proposed. Simulation suggests that the results are accurate and consistent, and that the asymptotic result applies to the maximum likelihood estimator, obtained via an Expectation–Maximization algorithm. 相似文献

20.

Gamma mixture models for target recognition 总被引：6，自引：0，他引：6

Andrew R. 《Pattern recognition》2000,33(12):2045-2054

This paper considers a mixture model approach to automatic target recognition using high-resolution radar measurements. The mixture model approach is motivated from several perspectives including requirements that the target classifier is robust to uncertainty in amplitude scaling, rotation and translation of the target. Estimation of the model parameters is achieved using the expectation-maximisation (EM) algorithm. Gamma mixtures are introduced and the re-estimation equations derived. The models are applied to the classification of high-resolution radar range profiles of ships and results compared with a previously published self-organising map approach. 相似文献