首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 7 毫秒
1.
A measurement of cluster quality is often needed for DNA microarray data analysis. In this paper, we introduce a new cluster validity index, which measures geometrical features of the data. The essential concept of this index is to evaluate the ratio between the squared total length of the data eigen-axes with respect to the between-cluster separation. We show that this cluster validity index works well for data that contain clusters closely distributed or with different sizes. We verify the method using three simulated data sets, two real world data sets and two microarray data sets. The experiment results show that the proposed index is superior to five other cluster validity indices, including partition coefficients (PC), General silhouette index (GS), Dunn’s index (DI), CH Index and I-Index. Also, we have given a theorem to show for what situations the proposed index works well.  相似文献   

2.
聚类作为一种无监督的学习方法,通常需要人为地提供聚类的簇数。在先验知识缺乏的情况下,通过人为指定聚类参数是不合实际的。近年来研究的聚类有效性函数(Cluster Validity Index) 用于估计簇的数目及聚类效果的优劣。本文提出了一种新的基于有效性指数的聚类算法,无需提供聚类的参数。算法每步合并两个簇,使有效性指数值增加最大或减小最少。本文运用引力模型度量相似度,对可能出现的异常点情况作均匀化的处理。实验表明,本文的算法能正确发现特定数据的簇个数,和其它聚类方法比较,聚类结果具有较低的错误率,并在效率上优于一般的基于有效性指数的聚类算法。  相似文献   

3.
Hierarchical clustering algorithms provide a set of nested partitions called a cluster hierarchy. Since the hierarchy is usually too complex it is reduced to a single partition by using cluster validity indices. We show that the classical method is often not useful and we propose SEP, a new method that efficiently searches in an extended partition set. Furthermore, we propose a new cluster validity index, COP, since many of the commonly used indices cannot be used with SEP. Experiments performed with 80 synthetic and 7 real datasets confirm that SEP/COP is superior to the method currently used and furthermore, it is less sensitive to noise.  相似文献   

4.
In this paper we propose a clustering algorithm to cluster data with arbitrary shapes without knowing the number of clusters in advance. The proposed algorithm is a two-stage algorithm. In the first stage, a neural network incorporated with an ART-like training algorithm is used to cluster data into a set of multi-dimensional hyperellipsoids. At the second stage, a dendrogram is built to complement the neural network. We then use dendrograms and so-called tables of relative frequency counts to help analysts to pick some trustable clustering results from a lot of different clustering results. Several data sets were tested to demonstrate the performance of the proposed algorithm.  相似文献   

5.
聚类是一种经典的数据挖掘技术,它在模式识别、机器学习、人工智能等多个领域得到了广泛的应用.通过聚类分析,目标数据集的深层次结构可以被有效地发掘出来.作为一种常用的划分聚类算法,K-means具有实现简单、能够处理大型数据等优点.然而,受收敛规则的影响,K-means算法仍然存在着对初始类簇中心的选取非常敏感、不能很好地...  相似文献   

6.
The self-organizing map (SOM) has been widely used in many industrial applications. Classical clustering methods based on the SOM often fail to deliver satisfactory results, specially when clusters have arbitrary shapes. In this paper, through some preprocessing techniques for filtering out noises and outliers, we propose a new two-level SOM-based clustering algorithm using a clustering validity index based on inter-cluster and intra-cluster density. Experimental results on synthetic and real data sets demonstrate that the proposed clustering algorithm is able to cluster data better than the classical clustering algorithms based on the SOM, and find an optimal number of clusters.  相似文献   

7.
This paper investigates a genetic programming (GP) approach aimed at the multi-objective design of hierarchical consensus functions for clustering ensembles. By this means, data partitions obtained via different clustering techniques can be continuously refined (via selection and merging) by a population of fusion hierarchies having complementary validation indices as objective functions. To assess the potential of the novel framework in terms of efficiency and effectiveness, a series of systematic experiments, involving eleven variants of the proposed GP-based algorithm and a comparison with basic as well as advanced clustering methods (of which some are clustering ensembles and/or multi-objective in nature), have been conducted on a number of artificial, benchmark and bioinformatics datasets. Overall, the results corroborate the perspective that having fusion hierarchies operating on well-chosen subsets of data partitions is a fine strategy that may yield significant gains in terms of clustering robustness.  相似文献   

8.
基于层次分析法的模糊分类优选模型   总被引:1,自引:0,他引:1       下载免费PDF全文
不同的模糊分类算法在同一个数据集合上常会产生不同的模糊分类.究竟哪种方法最能揭示数据的真实结构,对此,以模糊分类有效性指标为评价指标,应用层次分析法对各模糊分类进行综合评价,建立了一个模糊分类优选模型.大量实验表明,该优选模型所选出的最优模糊分类,其模式识别率高,能揭示数据的真实结构.  相似文献   

9.
In a graph theory model, clustering is the process of division of vertices into groups, with a higher density of edges within groups than between them. In this paper, we introduce a new clustering method for detecting such groups and use it to analyse some classic social networks. The new method has two distinguished features: non-binary hierarchical tree and the feature of overlapping clustering. A non-binary hierarchical tree is much smaller than the binary-trees constructed by most traditional methods and, therefore, it clearly highlights meaningful clusters which significantly reduces further manual efforts for cluster selections. The present method is tested by several bench mark data sets for which the community structure was known beforehand and the results indicate that it is a sensitive and accurate method for extracting community structure from social networks.  相似文献   

10.
As humans, we have innate faculties that allow us to efficiently segment groups of objects. Computers, to some degree, can be programmed with similar categorical capabilities, which stem from exploratory data analysis. Out of the various subsets of data reasoning, clustering provides insight into the structure and relationships of input samples situated in a number of distributions. To determine these relationships, many clustering methods rely on one or more human inputs; the most important being the number of distributions, c, to seek. This work investigates a technique for estimating the number of clusters from a general type of data called relational data. Several numerical examples are presented to illustrate the effectiveness of the proposed method.  相似文献   

11.
In this paper the problem of automatic clustering a data set is posed as solving a multiobjective optimization (MOO) problem, optimizing a set of cluster validity indices simultaneously. The proposed multiobjective clustering technique utilizes a recently developed simulated annealing based multiobjective optimization method as the underlying optimization strategy. Here variable number of cluster centers is encoded in the string. The number of clusters present in different strings varies over a range. The points are assigned to different clusters based on the newly developed point symmetry based distance rather than the existing Euclidean distance. Two cluster validity indices, one based on the Euclidean distance, XB-index, and another recently developed point symmetry distance based cluster validity index, Sym-index, are optimized simultaneously in order to determine the appropriate number of clusters present in a data set. Thus the proposed clustering technique is able to detect both the proper number of clusters and the appropriate partitioning from data sets either having hyperspherical clusters or having point symmetric clusters. A new semi-supervised method is also proposed in the present paper to select a single solution from the final Pareto optimal front of the proposed multiobjective clustering technique. The efficacy of the proposed algorithm is shown for seven artificial data sets and six real-life data sets of varying complexities. Results are also compared with those obtained by another multiobjective clustering technique, MOCK, two single objective genetic algorithm based automatic clustering techniques, VGAPS clustering and GCUK clustering.  相似文献   

12.
We propose an internal cluster validity index for a fuzzy c-means algorithm which combines a mathematical model for the fuzzy c-partition and a heuristic search for the number of clusters in the data. Our index resorts to information theoretic principles, and aims to assess the congruence between such a model and the data that have been observed. The optimal cluster solution represents a trade-off between discrepancy and the complexity of the underlying fuzzy c-partition. We begin by testing the effectiveness of the proposed index using two sets of synthetic data, one comprising a well-defined cluster structure and the other containing only noise. Then we use datasets arising from real life problems. Our results are compared to those provided by several available indices and their goodness is judged by an external measure of similarity. We find substantial evidence supporting our index as a credible alternative to the cluster validation problem, especially when it concerns structureless data.  相似文献   

13.
We prove a unique property of single-link distance, based on which an algorithm is designed for data clustering. The property states that a single-link cluster is a subset with inter-subset distance greater than intra-subset distance, and vice versa. Among the major linkages (single, complete, average, centroid, median, and Ward's), only single-link distance has this property. Based on this property we introduce monotonic sequences of iclusters (i.e., single-link clusters) to model the phenomenon that a natural cluster has a dense kernel and the density decreases as we move from the kernel to the boundary. A monotonic sequence of iclusters is a sequence of nested iclusters such that an icluster in the sequence is a dominant child (in terms of size) of the icluster before it. Our data clustering algorithm is monotonic sequence based. We classify a dataset of one monotonic sequence into to two classes by splitting the sequence into two parts: the kernel part and the surrounding part. For a data set of multiple monotonic sequences, each leaf monotonic sequence represents the kernel of a class, which then “grows” by absorbing nearby non-kernel points. This algorithm, proved by experiments, compares favorable in effectiveness to other clustering algorithms.  相似文献   

14.
There is an interest in the problem of identifying different partitions of a given set of units obtained according to different subsets of the observed variables (multiple cluster structures). A model-based procedure has been previously developed for detecting multiple cluster structures from independent subsets of variables. The method relies on model-based clustering methods and on a comparison among mixture models using the Bayesian Information Criterion. A generalization of this method which allows the use of any model-selection criterion is considered. A new approach combining the generalized model-based procedure with variable-clustering methods is proposed. The usefulness of the new method is shown using simulated and real examples. Monte Carlo methods are employed to evaluate the performance of various approaches. Data matrices with two cluster structures are analyzed taking into account the separation of clusters, the heterogeneity within clusters and the dependence of cluster structures.  相似文献   

15.
Nuclear magnetic resonance (NMR) spectroscopy has emerged as a technology that can provide metabolite information within organ systems in vivo. In this study, we introduced a new method of employing a clustering algorithm to develop a diagnostic model that can differentially diagnose a single unknown subject in a disease with well-defined group boundaries. We used three tests to assess the suitability and the accuracy required for diagnostic purposes of the four clustering algorithms we investigated (K-means, Fuzzy, Hierarchical, and Medoid Partitioning). To accomplish this goal, we studied the striatal metabolomic profile of R6/2 Huntington disease (HD) transgenic mice and that of wild type (WT) mice using high field in vivo proton NMR spectroscopy (9.4 T). We tested all four clustering algorithms (1) with the original R6/2 HD mice and WT mice, (2) with unknown mice, whose status had been determined via genotyping, and (3) with the ability to separate the original R6/2 mice into the two age subgroups (8 and 12 weeks old). Only our diagnostic models that employed ROC-supervised Fuzzy, unsupervised Fuzzy, and ROC-supervised K-means Clustering passed all three stringent tests with 100% accuracy, indicating that they may be used for diagnostic purposes.  相似文献   

16.
The utilisation of clustering algorithms based on the optimisation of prototypes in neural networks is demonstrated for unsupervised learning. Stimulated by common clustering methods of this type (learning vector quantisation [LVQ, GLVQ] and K-means) a globally operating algorithm was developed to cope with known shortcomings of existing tools. This algorithm and K-means (for the common methods) were applied to the problem of clustering EEG patterns being pre-processed. It can be shown that the algorithm based on global random optimisation may find an optimal solution repeatedly, whereas K-means provides different sub-optimal solutions with respect to the quality measure defined as objective function. The results are presented. The performance of the algorithms is discussed.  相似文献   

17.
In this paper, we present an efficient global illumination technique, and then we discuss the results of its extensive experimental validation. The technique is a hybrid of cluster-based hierarchical and progressive radiosity techniques, which does not require storing links between interacting surfaces and clusters. We tested our technique by applying a multistage validation procedure, which we designed specifically for global illumination solutions. First, we experimentally validate the algorithm against analytically derived and measured real-world data to check how calculation speed is traded for lighting simulation accuracy for various clustering and meshing scenarios. Then we test the algorithm performance and rendering quality by directly comparing the virtual and real-world images of a complex environment.  相似文献   

18.
The problem of classifying an image into different homogeneous regions is viewed as the task of clustering the pixels in the intensity space. In particular, satellite images contain landcover types some of which cover significantly large areas, while some (e.g., bridges and roads) occupy relatively much smaller regions. Detecting regions or clusters of such widely varying sizes presents a challenging task. A modified differential evolution based fuzzy clustering technique, is proposed in this article. Real-coded encoding of the cluster centres is used for this purpose. Results demonstrating the effectiveness of the proposed technique are provided for several synthetic and real life data sets as well as for some benchmark functions. Different landcover regions in remote sensing imagery have also been classified using the proposed technique to establish its efficiency. Statistical significance tests have been performed to establish the superiority of the proposed algorithm.  相似文献   

19.
Erich Novak and Klaus Ritter developed in 1996 a global optimization algorithm that uses hyperbolic cross points (HCPs). In this paper we develop a hybrid algorithm for clustering called CMHCP that uses a modified version of this HCP algorithm for global search and the alternating optimization for local search. The program has been tested extensively with very promising results and high efficiency. This provides a nice addition to the arsenal of global optimization in clustering. In the process, we also analyze the smoothness of some reformulated objective functions.  相似文献   

20.
This paper is concerned with a stepwise mode of objective function-based fuzzy clustering. A revealed structure in data becomes refined in a successive manner by starting with the most dominant relationships and proceeding with its more detailed characterization. Technically, the proposed process develops a so-called hierarchy of clusters. Given the underlying clustering mechanism of the fuzzy C means (FCM), the produced architecture is referred to as a hierarchical FCM or hierarchical FCM tree (HFCM tree). We discuss the design of the tree demonstrating how its growth is guided by a certain mapping criterion. It is also shown how a structure at the higher level is effectively used to build clusters at the consecutive level by making use of the conditional FCM. Detailed investigations of computational complexity contrast a stepwise development of clusters with a single-step clustering completed for the equivalent number of clusters occurring in total at all final nodes of the HFCM tree. The analysis quantifies a significant reduction of the stepwise refinement of the clusters. Experimental studies include synthetic data as well as those coming from the machine learning repository.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号