首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Clustering ensembles: models of consensus and weak partitions   总被引:4,自引:0,他引:4  
Clustering ensembles have emerged as a powerful method for improving both the robustness as well as the stability of unsupervised classification solutions. However, finding a consensus clustering from multiple partitions is a difficult problem that can be approached from graph-based, combinatorial, or statistical perspectives. This study extends previous research on clustering ensembles in several respects. First, we introduce a unified representation for multiple clusterings and formulate the corresponding categorical clustering problem. Second, we propose a probabilistic model of consensus using a finite mixture of multinomial distributions in a space of clusterings. A combined partition is found as a solution to the corresponding maximum-likelihood problem using the EM algorithm. Third, we define a new consensus function that is related to the classical intraclass variance criterion using the generalized mutual information definition. Finally, we demonstrate the efficacy of combining partitions generated by weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combinations of such weak clustering components. Combination accuracy is analyzed as a function of several parameters that control the power and resolution of component partitions as well as the number of partitions. We also analyze clustering ensembles with incomplete information and the effect of missing cluster labels on the quality of overall consensus. Experimental results demonstrate the effectiveness of the proposed methods on several real-world data sets.  相似文献   

2.
Over the past few years, there has been a renewed interest in the consensus clustering problem. Several new methods have been proposed for finding a consensus partition for a set of n data objects that optimally summarizes an ensemble. In this paper, we propose new consensus clustering algorithms with linear computational complexity in n. We consider clusterings generated with random number of clusters, which we describe by categorical random variables. We introduce the idea of cumulative voting as a solution for the problem of cluster label alignment, where, unlike the common one-to-one voting scheme, a probabilistic mapping is computed. We seek a first summary of the ensemble that minimizes the average squared distance between the mapped partitions and the optimal representation of the ensemble, where the selection criterion of the reference clustering is defined based on maximizing the information content as measured by the entropy. We describe cumulative vote weighting schemes and corresponding algorithms to compute an empirical probability distribution summarizing the ensemble. Given the arbitrary number of clusters of the input partitions, we formulate the problem of extracting the optimal consensus as that of finding a compressed summary of the estimated distribution that preserves maximum relevant information. An efficient solution is obtained using an agglomerative algorithm that minimizes the average generalized Jensen-Shannon divergence within the cluster. The empirical study demonstrates significant gains in accuracy and superior performance compared to several recent consensus clustering algorithms.  相似文献   

3.
On combining multiple clusterings: an overview and a new perspective   总被引:1,自引:0,他引:1  
Many problems can be reduced to the problem of combining multiple clusterings. In this paper, we first summarize different application scenarios of combining multiple clusterings and provide a new perspective of viewing the problem as a categorical clustering problem. We then show the connections between various consensus and clustering criteria and discuss the complexity results of the problem. Finally we propose a new method to determine the final clustering. Experiments on kinship terms and clustering popular music from heterogeneous feature sets show the effectiveness of combining multiple clusterings.  相似文献   

4.
Clustering ensembles combine multiple partitions of data into a single clustering solution of better quality. Inspired by the success of supervised bagging and boosting algorithms, we propose non-adaptive and adaptive resampling schemes for the integration of multiple independent and dependent clusterings. We investigate the effectiveness of bagging techniques, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms. In our adaptive approach, individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given dataset. The sampling probability for each data point dynamically depends on the consistency of its previous assignments in the ensemble. New subsamples are then drawn to increasingly focus on the problematic regions of the input feature space. A measure of data point clustering consistency is therefore defined to guide this adaptation. Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A meaningful consensus partition for an entire set of data points emerges from multiple clusterings of bootstraps and subsamples. Subsamples of small size can reduce computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical study also compares the performance of adaptive and non-adaptive clustering ensembles using different consensus functions on a number of datasets. By focusing attention on the data points with the least consistent clustering assignments, whether one can better approximate the inter-cluster boundaries or can at least create diversity in boundaries and this results in improving clustering accuracy and convergence speed as a function of the number of partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a new avenue for research, and this study helps to pave the way for the useful application of distributed data mining methods.  相似文献   

5.
We introduce a framework for the optimal extraction of flat clusterings from local cuts through cluster hierarchies. The extraction of a flat clustering from a cluster tree is formulated as an optimization problem and a linear complexity algorithm is presented that provides the globally optimal solution to this problem in semi-supervised as well as in unsupervised scenarios. A collection of experiments is presented involving clustering hierarchies of different natures, a variety of real data sets, and comparisons with specialized methods from the literature.  相似文献   

6.
基于层次分析法的模糊分类优选模型   总被引:1,自引:0,他引:1       下载免费PDF全文
不同的模糊分类算法在同一个数据集合上常会产生不同的模糊分类.究竟哪种方法最能揭示数据的真实结构,对此,以模糊分类有效性指标为评价指标,应用层次分析法对各模糊分类进行综合评价,建立了一个模糊分类优选模型.大量实验表明,该优选模型所选出的最优模糊分类,其模式识别率高,能揭示数据的真实结构.  相似文献   

7.
Combining multiple clusterings using evidence accumulation   总被引:2,自引:0,他引:2  
We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble - a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: 1) applying different clustering algorithms and 2) applying the same clustering algorithm with different values of parameters or initializations. Further, combinations of different data representations (feature spaces) and clustering algorithms can also provide a multitude of significantly different data partitionings. We propose a simple framework for extracting a consistent clustering, given the various partitions in a clustering ensemble. According to the EAC concept, each partition is viewed as an independent evidence of data organization, individual data partitions being combined, based on a voting mechanism, to generate a new n /spl times/ n similarity matrix between the n patterns. The final data partition of the n patterns is obtained by applying a hierarchical agglomerative clustering algorithm on this matrix. We have developed a theoretical framework for the analysis of the proposed clustering combination strategy and its evaluation, based on the concept of mutual information between data partitions. Stability of the results is evaluated using bootstrapping techniques. A detailed discussion of an evidence accumulation-based clustering algorithm, using a split and merge strategy based on the k-means clustering algorithm, is presented. Experimental results of the proposed method on several synthetic and real data sets are compared with other combination strategies, and with individual clustering results produced by well-known clustering algorithms.  相似文献   

8.
Clustering has a long and rich history in a variety of scientific fields. Finding natural groupings of a data set is a hard task as attested by hundreds of clustering algorithms in the literature. Each clustering technique makes some assumptions about the underlying data set. If the assumptions hold, good clusterings can be expected. It is hard, in some cases impossible, to satisfy all the assumptions. Therefore, it is beneficial to apply different clustering methods on the same data set, or the same method with varying input parameters or both. Then, the clusterings obtained can be combined into a final clustering having better overall quality. Combining multiple clusterings into a final clustering which has better overall quality has gained significant importance recently. Our contributions are a novel method for combining a collection of clusterings into a final clustering which is based on cliques, and a novel output-sensitive clique finding algorithm which works on large and dense graphs and produces output in a short amount of time. Extensive experimental studies on real and artificial data sets demonstrate the effectiveness of our contributions.  相似文献   

9.
Clustering is one of the most important unsupervised learning problems and it consists of finding a common structure in a collection of unlabeled data. However, due to the ill-posed nature of the problem, different runs of the same clustering algorithm applied to the same data-set usually produce different solutions. In this scenario choosing a single solution is quite arbitrary. On the other hand, in many applications the problem of multiple solutions becomes intractable, hence it is often more desirable to provide a limited group of “good” clusterings rather than a single solution. In the present paper we propose the least squares consensus clustering. This technique allows to extrapolate a small number of different clustering solutions from an initial (large) ensemble obtained by applying any clustering algorithm to a given data-set. We also define a measure of quality and present a graphical visualization of each consensus clustering to make immediately interpretable the strength of the consensus. We have carried out several numerical experiments both on synthetic and real data-sets to illustrate the proposed methodology.  相似文献   

10.
Bagging-based spectral clustering ensemble selection   总被引:2,自引:0,他引:2  
Traditional clustering ensemble methods combine all obtained clustering results at hand. However, we can often achieve a better clustering solution if only parts of the clustering results available are combined. In this paper, we generalize the selective clustering ensemble algorithm proposed by Azimi and Fern and a novel clustering ensemble method, SELective Spectral Clustering Ensemble (SELSCE), is proposed. The component clusterings of the ensemble system are generated by spectral clustering (SC) capable of engendering diverse committees. The random scaling parameter, Nyström approximation are used to perturb SC for producing the components of the ensemble system. After the generation of component clusterings, the bagging technique, usually applied in supervised learning, is used to assess the component clustering. We randomly pick part of the available clusterings to get a consensus result and then compute normalized mutual information (NMI) or adjusted rand index (ARI) between the consensus result and the component clusterings. Finally, the components are ranked by aggregating multiple NMI or ARI values. The experimental results on UCI dataset and images demonstrate that the proposed algorithm can achieve a better result than the traditional clustering ensemble methods.  相似文献   

11.
Given a clustering algorithm, how can we adapt it to find multiple, nonredundant, high-quality clusterings? We focus on algorithms based on vector quantization and describe a framework for automatic ‘alternatization’ of such algorithms. Our framework works in both simultaneous and sequential learning formulations and can mine an arbitrary number of alternative clusterings. We demonstrate its applicability to various clustering algorithms—k-means, spectral clustering, constrained clustering, and co-clustering—and effectiveness in mining a variety of datasets.  相似文献   

12.
侯勇  郑雪峰 《计算机应用》2013,33(8):2204-2207
当前流行的聚类集成算法无法依据不同数据集的不同特点给出恰当的处理方案,为此提出一种新的基于数据集特点的增强聚类集成算法,该算法由基聚类器的生成、基聚类器的选择与共识函数构成。该算法依据数据集的特点,通过启发式方法,选出合适的基聚类器,构建最终的基聚类器集合,并产生最终聚类结果。实验中,对ecoli,leukaemia与Vehicle三个基准数据集进行了聚类,所提出算法的聚类误差分别是0.014,0.489,0.479,同基于Bagging的结构化集成(BSEA)、异构聚类集成(HCE)和基于聚类的集成分类(COEC)算法相比,所提出算法的聚类误差始终最低;而在增加候基聚类器的情况下,所提出算法的标准化互信息(NMI)值始终高于对比算法。实验结果表明,同对比的聚类集成算法相比,所提出算法的聚类精度最高,可伸缩性最强。  相似文献   

13.
Data clustering is a fundamental and very popular method of data analysis. Its subjective nature, however, means that different clustering algorithms or different parameter settings can produce widely varying and sometimes conflicting results. This has led to the use of clustering comparison measures to quantify the degree of similarity between alternative clusterings. Existing measures, though, can be limited in their ability to assess similarity and sometimes generate unintuitive results. They also cannot be applied to compare clusterings which contain different data points, an activity which is important for scenarios such as data stream analysis. In this paper, we introduce a new clustering similarity measure, known as ADCO, which aims to address some limitations of existing measures, by allowing greater flexibility of comparison via the use of density profiles to characterize a clustering. In particular, it adopts a ‘data mining style’ philosophy to clustering comparison, whereby two clusterings are considered to be more similar, if they are likely to give rise to similar types of prediction models. Furthermore, we show that this new measure can be applied as a highly effective objective function within a new algorithm, known as MAXIMUS, for generating alternate clusterings.  相似文献   

14.
《国际计算机数学杂志》2012,89(12):2516-2526
In this paper, we generalize the hard clustering paradigm. While in this paradigm a data set is subdivided into disjoint clusters, we allow different clusters to have a nonempty intersection. The concept of hard clustering is then analysed in this general setting, and we show which specific properties hard clusterings possess in comparison to more general clusterings. We also introduce the concept of equivalent clusterings and show that in the case of hard clusterings equivalence and equality coincide. However, if more general clusterings are considered, these two concepts differ, and this implies the undesired fact that equivalent clusterings can have different representations in the traditional view on clustering. We show how a matrix representation can solve this representation problem.  相似文献   

15.
A considerable amount of work has been done in data clustering research during the last four decades, and a myriad of methods has been proposed focusing on different data types, proximity functions, cluster representation models, and cluster presentation. However, clustering remains a challenging problem due to its ill-posed nature: it is well known that off-the-shelf clustering methods may discover different patterns in a given set of data, mainly because every clustering algorithm has its own bias resulting from the optimization of different criteria. This bias becomes even more important as in almost all real-world applications, data is inherently high-dimensional and multiple clustering solutions might be available for the same data collection. In this respect, the problems of projective clustering and clustering ensembles have been recently defined to deal with the high dimensionality and multiple clusterings issues, respectively. Nevertheless, despite such two issues can often be encountered together, existing approaches to the two problems have been developed independently of each other. In our earlier work Gullo et al. (Proceedings of the international conference on data mining (ICDM), 2009a) we introduced a novel clustering problem, called projective clustering ensembles (PCE): given a set (ensemble) of projective clustering solutions, the goal is to derive a projective consensus clustering, i.e., a projective clustering that complies with the information on object-to-cluster and the feature-to-cluster assignments given in the ensemble. In this paper, we enhance our previous study and provide theoretical and experimental insights into the PCE problem. PCE is formalized as an optimization problem and is designed to satisfy desirable requirements on independence from the specific clustering ensemble algorithm, ability to handle hard as well as soft data clustering, and different feature weightings. Two PCE formulations are defined: a two-objective optimization problem, in which the two objective functions respectively account for the object- and feature-based representations of the solutions in the ensemble, and a single-objective optimization problem, in which the object- and feature-based representations are embedded into a single function to measure the distance error between the projective consensus clustering and the projective ensemble. The significance of the proposed methods for solving the PCE problem has been shown through an extensive experimental evaluation based on several datasets and comparatively with projective clustering and clustering ensemble baselines.  相似文献   

16.
Clustering algorithms support exploratory data analysis by grouping inputs that share similar features. Especially the clustering of unlabelled data is said to be a fiendishly difficult problem, because users not only have to choose a suitable clustering algorithm but also a suitable number of clusters. The known issues of existing clustering validity measures comprise instabilities in the presence of noise and restrictive assumptions about cluster shapes. In addition, they cannot evaluate individual clusters locally. We present a new measure for assessing and comparing different clusterings both on a global and on a local level. Our measure is based on the topological method of persistent homology, which is stable and unbiased towards cluster shapes. Based on our measure, we also describe a new visualization that displays similarities between different clusterings (using a global graph view) and supports their comparison on the individual cluster level (using a local glyph view). We demonstrate how our visualization helps detect different—but equally valid—clusterings of data sets from multiple application domains.  相似文献   

17.
Multiple clusterings are produced for various needs and reasons in both distributed and local environments. Combining multiple clusterings into a final clustering which has better overall quality has gained importance recently. It is also expected that the final clustering is novel, robust, and scalable. In order to solve this challenging problem we introduce a new graph-based method. Our method uses the evidence accumulated in the previously obtained clusterings, and produces a very good quality final clustering. The number of clusters in the final clustering is obtained automatically; this is another important advantage of our technique. Experimental test results on real and synthetically generated data sets demonstrate the effectiveness of our new method.  相似文献   

18.
Clustering is the process of grouping objects that are similar, where similarity between objects is usually measured by a distance metric. The groups formed by a clustering method are referred as clusters. Clustering is a widely used activity with multiple applications ranging from biology to economics. Each clustering technique has some advantages and disadvantages. Some clustering algorithms may even require input parameters which strongly affect the result. In most cases, it is not possible to choose the best distance metric, the best clustering method, and the best input argument values for an input data set. Therefore, multiple clusterings can be obtained by several distance metrics, several clustering methods, and several input argument values. And, multiple clusterings can be combined into a new and better quality final clustering. We propose a family of combining multiple clustering algorithms that are memory efficient, scalable, robust, and intuitive. Our new algorithms offer tremendous speed gain and low memory requirements by working at cluster level, while producing very good quality final clusters. Extensive experimental evaluations on some very challenging artificially generated and real data sets from a diverse set of domains establish the usefulness of our methods.  相似文献   

19.
Clustering algorithms have the annoying habit of finding clusters even when the data are generated randomly. Verifying that potential clusterings are real in some objective sense is receiving more attention as the number of new clustering algorithms and their applications grow. We consider one aspect of this question and study the stability of a hierarchical structure with a variation on a measure of stability proposed in the literature.(1,2)Our measure of stability is appropriate for proximity matrices whose entries are on an ordinal scale. We randomly split the data set, cluster the two halves, and compare the two hierarchical clusterings with the clustering achieved on the entire data set. Two stability statistics, based on the Goodman-Kruskal rank correlation coefficient, are defined. The distributions of these statistics are estimated with Monte Carlo techniques for two clustering methods (single-link and complete-link) and under two conditions (randomly selected proximity matrices and proximity matrices with good hierarchical structure). The stability measures are applied to some real data sets.  相似文献   

20.
Machine Learning - Cluster ensembles or consensus clusterings have been shown to be better than any standard clustering algorithm at improving accuracy and robustness across various sets of data....  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号