首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到10条相似文献,搜索用时 188 毫秒
1.
张妨妨  钱雪忠 《计算机应用》2012,32(9):2476-2479
针对传统GK聚类算法无法自动确定聚类数和对初始聚类中心比较敏感的缺陷,提出一种改进的GK聚类算法。该算法首先通过基于类间分离度和类内紧致性的权和的新有效性指标来确定最佳聚类数;然后,利用改进的熵聚类的思想来确定初始聚类中心;最后,根据判定出的聚类数和新的聚类中心进行聚类。实验结果表明,新指标能准确地判断出类间有交叠的数据集的最佳聚类数,且改进后的算法具有更高的聚类准确率。  相似文献   

2.
孙秀娟  刘希玉 《计算机应用》2008,28(12):3244-3247
在K-means算法中,聚类数k是影响聚类质量的关键因素之一。目前,已经提出了许多确定最佳k值的聚类有效性方法,但这些方法都不能很好地处理两种数据集:类(簇)密度不同的数据集和类间距比较小的数据集(含有合并簇的数据集)。为此,提出了一种新的聚类有效性函数,该函数定义为数据特征轴总长度的平方与最小类间距的比值,最佳聚类数为这个比值达到最小时对应的k值。同时,为减小K-means算法对噪声和孤立点数据的敏感性,使用了基于加权的改进K-平均的方法计算类中心。实验证明,与其他算法相比,基于新聚类有效性函数的K-wmeans算法不仅降低了噪声和孤立点数据对聚类结果的影响,而且能有效地处理上面提到的两种数据集,明显提高了数据聚类质量。  相似文献   

3.
4.
Cluster analysis is used to explore structure in unlabeled batch data sets in a wide range of applications. An important part of cluster analysis is validating the quality of computationally obtained clusters. A large number of different internal indices have been developed for validation in the offline setting. However, this concept cannot be directly extended to the online setting because streaming algorithms do not retain the data, nor maintain a partition of it, both needed by batch cluster validity indices. In this paper, we develop two incremental versions (with and without forgetting factors) of the Xie-Beni and Davies-Bouldin validity indices, and use them to monitor and control two streaming clustering algorithms (sk-means and online ellipsoidal clustering), In this context, our new incremental validity indices are more accurately viewed as performance monitoring functions. We also show that incremental cluster validity indices can send a distress signal to online monitors when evolving structure leads an algorithm astray. Our numerical examples indicate that the incremental Xie-Beni index with a forgetting factor is superior to the other three indices tested.  相似文献   

5.
Gene expression data generated by DNA microarray experiments provide a vast resource of medical diagnostic and disease understanding. Unfortunately, the large amount of data makes it hard, sometimes even impossible, to understand the correct behavior of genes. In this work, we develop a possibilistic approach for mining gene microarray data. Our model consists of two steps. In the first step, we use possibilistic clustering to partition the data into groups (or clusters). The optimal number of clusters is evaluated automatically from the data using the Information Entropy as a validity measure. In the second step, we select from each computed cluster the most representative genes and model them as a graph called a proximity graph. This set of graphs (or hyper-graph) will be used to predict the function of new and previously unknown genes. Experimental results using real-world data sets reveal a good performance and a high prediction accuracy of our model.  相似文献   

6.
ABSTRACT

Fuzzy c-means clustering is an important non-supervised classification method for remote-sensing images and is based on type-1 fuzzy set theory. Type-1 fuzzy sets use singleton values to express the membership grade; therefore, such sets cannot describe the uncertainty of the membership grade. Interval type-2 fuzzy c-means (IT2FCM) clustering and relevant methods are based on interval type-2 fuzzy sets. Real vectors are used to describe the clustering centres, and the average values of the upper and lower membership grades are used to determine the classification of each pixel. Thus, the width information for interval clustering centres and interval membership grades are ignored. The main contribution of this article is to propose an improved IT2FCM* algorithm by adopting interval number distance (IND) and ranking methods, which use the width information of interval clustering centres and interval membership grades, thus distinguishing this method from existing fuzzy clustering methods. Three different IND definitions are tested, and the distance definition proposed by Li shows the best performance. The second contribution of this work is that two fuzzy cluster validity indices, FS- and XB-, are improved using the IND. Three types of multi/hyperspectral remote-sensing data sets are used to test this algorithm, and the experimental results show that the IT2FCM* algorithm based on the IND proposed by Li performs better than the IT2FCM algorithm using four cluster validity indices, the confusion matrix, and the kappa coefficient (κ). Additionally, the improved FS- index has more indicative ability than the original FS- index.  相似文献   

7.
From a dataset automatically identifying possible count of clusters is an important task of unsupervised classification. To address this issue, in the current paper, we have focused on the symmetry property of any cluster. Point and line symmetry are two important attributes of data partitions. Here we have proposed line symmetry versions of eight well-known validity indices: XB, PBM, FCM, PS, FS, K, SV, and DB indices to make them capable of identifying the accurate count of partitions from data sets containing clusters having line symmetric property. The global optimality of two of these newly developed indices is established mathematically. Eight artificially generated data sets of varying dimensions containing clusters of different convexities and shapes and three real-life data sets are used for the purpose of experiment. Initially, to obtain different partitions an existing genetic clustering technique which uses line symmetry property (GALS clustering) is applied on data sets varying the count of clusters. queryPlease check and confirm the edit in the following sentence: We have also provided a comparative study of our proposed line-symmetry-based cluster validity indices with their point-symmetry-based versions and original versions based on Euclidean distance. We have also provided a comparative study of our proposed line-symmetry-based cluster validity indices with their point-symmetry-based versions and original versions based on Euclidean distance. From the experimental results it is revealed that most of the line-symmetry-distance-based cluster validity indices perform better than their point symmetry and Euclidean-distance-based versions.  相似文献   

8.
We propose an internal cluster validity index for a fuzzy c-means algorithm which combines a mathematical model for the fuzzy c-partition and a heuristic search for the number of clusters in the data. Our index resorts to information theoretic principles, and aims to assess the congruence between such a model and the data that have been observed. The optimal cluster solution represents a trade-off between discrepancy and the complexity of the underlying fuzzy c-partition. We begin by testing the effectiveness of the proposed index using two sets of synthetic data, one comprising a well-defined cluster structure and the other containing only noise. Then we use datasets arising from real life problems. Our results are compared to those provided by several available indices and their goodness is judged by an external measure of similarity. We find substantial evidence supporting our index as a credible alternative to the cluster validation problem, especially when it concerns structureless data.  相似文献   

9.
石文峰  商琳 《计算机科学》2017,44(9):45-48, 66
Fuzzy C-Means(FCM)是模糊聚类中聚类效果较好且应用较为广泛的聚类算法,但是其对初始聚类数的敏感性导致如何选择一个较好的C值 变得十分重要。因此,确定FCM的聚类数是使用FCM进行聚类分析时的一个至关重要的步骤。通过扩展决策粗糙集模型进行聚类的有效性分析,并进一步确定FCM的聚类数,从而避免了使用FCM时不好的初始化所带来的影响。文中提出了一种基于扩展粗糙集模型的模糊C均值聚类数的确定方法,并通过图像分割实验来验证聚类的效果。实验通过比对不同聚类数下分类结果的代价获得了一个较好的分割结果,并将结果与Z.Yu等人于2015年提出的蚁群模糊C均值混合算法(AFHA)以及提高的AFHA算法(IAFHA)进行对比,结果表明所提方法的聚类结果较好,图像分割效果较明显,Bezdek分割系数比AFHA和IAFHA算法的更高,且在Xie-Beni系数上也有较大优势。  相似文献   

10.
Clustering categorical data arising as an important problem of data mining has recently attracted much attention. In this paper, the problem of unsupervised dimensionality reduction for categorical data is first studied. Based on the theory of rough sets, the attributes of categorical data are decomposed into a number of rough subspaces. A novel clustering ensemble algorithm based on rough subspaces is then proposed to deal with categorical data. The algorithm employs some of rough subspaces with high quality to cluster the data and yields a robust and stable solution by exploiting the resulting partitions. We also introduce a cluster index to evaluate the solution of clustering algorithm for categorical data. Experimental results for selected UCI data sets show that the proposed method produces better results than those obtained by other methods when being evaluated in terms of cluster validity indexes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号