首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 468 毫秒
1.
In this paper, a genetic clustering algorithm based on dynamic niching with niche migration (DNNM-clustering) is proposed. It is an effective and robust approach to clustering on the basis of a similarity function relating to the approximate density shape estimation. In the new algorithm, a dynamic identification of the niches with niche migration is performed at each generation to automatically evolve the optimal number of clusters as well as the cluster centers of the data set without invoking cluster validity functions. The niches can move slowly under the migration operator which makes the dynamic niching method independent of the radius of the niches. Compared to other existing methods, the proposed clustering method exhibits the following robust characteristics: (1) robust to the initialization, (2) robust to clusters volumes (ability to detect different volumes of clusters), and (3) robust to noise. Moreover, it is free of the radius of the niches and does not need to pre-specify the number of clusters. Several data sets with widely varying characteristics are used to demonstrate its superiority. An application of the DNNM-clustering algorithm in unsupervised classification of the multispectral remote sensing image is also provided.  相似文献   

2.
Clustering is a useful tool for finding structure in a data set. The mixture likelihood approach to clustering is a popular clustering method, in which the EM algorithm is the most used method. However, the EM algorithm for Gaussian mixture models is quite sensitive to initial values and the number of its components needs to be given a priori. To resolve these drawbacks of the EM, we develop a robust EM clustering algorithm for Gaussian mixture models, first creating a new way to solve these initialization problems. We then construct a schema to automatically obtain an optimal number of clusters. Therefore, the proposed robust EM algorithm is robust to initialization and also different cluster volumes with automatically obtaining an optimal number of clusters. Some experimental examples are used to compare our robust EM algorithm with existing clustering methods. The results demonstrate the superiority and usefulness of our proposed method.  相似文献   

3.
A novel robust validity index is proposed for subtractive clustering (SC) algorithm. Although the SC algorithm is a simple and fast data clustering method with robust properties against outliers and noise; it has two limitations. First, the cluster number generated by the SC algorithm is influenced by a given threshold. Second, the cluster centers obtained by SC are based on data that have the highest potential values but may not be the actual cluster centers. The validity index is a function as a measure of the fitness of a partition for a given data set. To solve the first problem, this study proposes a novel robust validity index that evaluates the fitness of a partition generated by SC algorithm in terms of three properties: compactness, separation and partition index. To solve the second problem, a modified algorithm based on distance relations between data and cluster centers is designed to ascertain the actual centers generated by the SC algorithm. Experiments confirm that the preferences of the proposed index outperform all others.  相似文献   

4.
Identifying the optimal cluster number and generating reliable clustering results are necessary but challenging tasks in cluster analysis. The effectiveness of clustering analysis relies not only on the assumption of cluster number but also on the clustering algorithm employed. This paper proposes a new clustering analysis method that identifies the desired cluster number and produces, at the same time, reliable clustering solutions. It first obtains many clustering results from a specific algorithm, such as Fuzzy C-Means (FCM), and then integrates these different results as a judgement matrix. An iterative graph-partitioning process is implemented to identify the desired cluster number and the final result. The proposed method is a robust approach as it is demonstrated its effectiveness in clustering 2D data sets and multi-dimensional real-world data sets of different shapes. The method is compared with cluster validity analysis and other methods such as spectral clustering and cluster ensemble methods. The method is also shown efficient in mesh segmentation applications. The proposed method is also adaptive because it not only works with the FCM algorithm but also other clustering methods like the k-means algorithm.  相似文献   

5.
Although there have been many researches on cluster analysis considering feature (or variable) weights, little effort has been made regarding sample weights in clustering. In practice, not every sample in a data set has the same importance in cluster analysis. Therefore, it is interesting to obtain the proper sample weights for clustering a data set. In this paper, we consider a probability distribution over a data set to represent its sample weights. We then apply the maximum entropy principle to automatically compute these sample weights for clustering. Such method can generate the sample-weighted versions of most clustering algorithms, such as k-means, fuzzy c-means (FCM) and expectation & maximization (EM), etc. The proposed sample-weighted clustering algorithms will be robust for data sets with noise and outliers. Furthermore, we also analyze the convergence properties of the proposed algorithms. This study also uses some numerical data and real data sets for demonstration and comparison. Experimental results and comparisons actually demonstrate that the proposed sample-weighted clustering algorithms are effective and robust clustering methods.  相似文献   

6.
基于稀疏Parzen窗密度估计的快速自适应相似度聚类方法   总被引:1,自引:1,他引:0  
相似度聚类方法(Similarity-based clustering method,SCM)因其简单易实现和具有鲁棒性而广受关注.但由于内含相似度聚类算法(Similarity clustering algorithm,SCA)的高时间复杂度和凝聚型层次聚类(Agglomerative hierarchicalclu...  相似文献   

7.
针对模糊C-均值聚类(FCM)算法对噪声敏感、容易收敛到局部极小值的问题,提出一种基于交叉熵的模糊聚类算法。通过引入交叉熵重新定义了传统FCM算法的目标函数,利用交叉熵度量样本隶属度之间的差异性,并采用拉格朗日求解方法和朗伯W函数解决了目标函数的优化问题,此外,分析了样本划分矩阵的分布情况,依据分布特性对噪声样本进行识别。人工数据集合和标准数据集加噪的实验结果表明,该算法提高了传统FCM算法的抗干扰能力,具有更强的鲁棒性,噪声样本识别的准确率较高。  相似文献   

8.
江浩  陈兴蜀杜敏 《计算机应用》2013,33(11):3071-3075
热点话题挖掘是舆情监控的重要技术基础。针对现有的论坛热点话题挖掘方法没有解决数据中词汇噪声较多且热度评价方式单一的问题,提出一种基于主题聚簇评价的热点话题挖掘方法。采用潜在狄里克雷分配主题模型对论坛文本数据建模,对映射到主题空间的文档集去除主题噪声后用优化聚类中心选择的K-means++算法进行聚类,最后从主题突发度、主题纯净度和聚簇关注度三个方面对聚簇进行评价。通过实验分析得出主题噪声阈值设置为0.75,聚类中心数设置为50时,可以使聚类质量与聚类速度达到最优。真实数据集上的测试结果表明该方法可以有效地将聚簇按出现热点话题的可能性排序。最后设计了热点话题的展示方法。  相似文献   

9.
Clustering is concerned with the discovery of interesting groupings of records in a database. Many algorithms have been developed to tackle clustering problems in a variety of application domains. In particular, some of them have been used in bioinformatics research to uncover inherent clusters in gene expression microarray data. In this paper, we show how some popular clustering algorithms have been used for this purpose. Based on experiments using simulated and real data, we also show that the performance of these algorithms can be further improved. For more effective clustering of gene expression microarray data, which is typically characterized by a lot of noise, we propose a novel evolutionary algorithm called evolutionary clustering (EvoCluster). EvoCluster encodes an entire cluster grouping in a chromosome so that each gene in the chromosome encodes one cluster. Based on such encoding scheme, it makes use of a set of reproduction operators to facilitate the exchange of grouping information between chromosomes. The fitness function that the EvoCluster adopts is able to differentiate between how relevant a feature value is in determining a particular cluster grouping. As such, instead of just local pairwise distances, it also takes into consideration how clusters are arranged globally. Unlike many popular clustering algorithms, EvoCluster does not require the number of clusters to be decided in advance. Also, patterns hidden in each cluster can be explicitly revealed and presented for easy interpretation even by casual users. For performance evaluation, we have tested EvoCluster using both simulated and real data. Experimental results show that it can be very effective and robust even in the presence of noise and missing values. Also, when correlating the gene expression microarray data with DNA sequences, we were able to uncover significant biological binding sites (both previously known and unknown) in each cluster discovered by EvoCluster.  相似文献   

10.
Bagging for path-based clustering   总被引:3,自引:0,他引:3  
A resampling scheme for clustering with similarity to bootstrap aggregation (bagging) is presented. Bagging is used to improve the quality of path-based clustering, a data clustering method that can extract elongated structures from data in a noise robust way. The results of an agglomerative optimization method are influenced by small fluctuations of the input data. To increase the reliability of clustering solutions, a stochastic resampling method is developed to infer consensus clusters. A related reliability measure allows us to estimate the number of clusters, based on the stability of an optimized cluster solution under resampling. The quality of path-based clustering with resampling is evaluated on a large image data set of human segmentations.  相似文献   

11.
In the fuzzy c-means (FCM) clustering algorithm, almost none of the data points have a membership value of 1. Moreover, noise and outliers may cause difficulties in obtaining appropriate clustering results from the FCM algorithm. The embedding of FCM into switching regressions, called the fuzzy c-regressions (FCRs), still has the same drawbacks as FCM. In this paper, we propose the alpha-cut implemented fuzzy clustering algorithms, referred to as FCMalpha, which allow the data points being able to completely belong to one cluster. The proposed FCMalpha algorithms can form a cluster core for each cluster, where data points inside a cluster core will have a membership value of 1 so that it can resolve the drawbacks of FCM. On the other hand, the fuzziness index m plays different roles for FCM and FCMalpha. We find that the clustering results obtained by FCMalpha are more robust to noise and outliers than FCM when a larger m is used. Moreover, the cluster cores generated by FCMalpha are workable for various data shape clusters, so that FCMalpha is very suitable for embedding into switching regressions. The embedding of FCMalpha into switching regressions is called FCRalpha. The proposed FCRalpha provides better results than FCR for environments with noise or outliers. Numerical examples show the robustness and the superiority of our proposed methods.  相似文献   

12.
覃华  詹娟娟  苏一丹 《控制与决策》2017,32(10):1796-1802
针对近邻传播聚类算法偏向参数难选定、生成的簇数目偏多等问题,提出一种概率无向图模型的近邻传播聚类算法.首先为样本数据构建概率无向图模型,利用极大团和势函数计算无向图中数据样本的概率密度,将此概率密度作为一种聚类先验知识注入近邻传播算法的偏向参数中,提高算法的聚类效率;并用高斯降噪和簇归并方法进一步提升算法的聚类精度.在UCI数据集上的实验结果表明,所提出算法的聚类效率和精度均优于相比较的同类算法.  相似文献   

13.
Clustering categorical data arising as an important problem of data mining has recently attracted much attention. In this paper, the problem of unsupervised dimensionality reduction for categorical data is first studied. Based on the theory of rough sets, the attributes of categorical data are decomposed into a number of rough subspaces. A novel clustering ensemble algorithm based on rough subspaces is then proposed to deal with categorical data. The algorithm employs some of rough subspaces with high quality to cluster the data and yields a robust and stable solution by exploiting the resulting partitions. We also introduce a cluster index to evaluate the solution of clustering algorithm for categorical data. Experimental results for selected UCI data sets show that the proposed method produces better results than those obtained by other methods when being evaluated in terms of cluster validity indexes.  相似文献   

14.
Unsupervised clustering for datasets with severe outliers inside is a difficult task. In this approach, we propose a cluster-dependent multi-metric clustering approach which is robust to severe outliers. A dataset is modeled as clusters each contaminated by noises of cluster-dependent unknown noise level in formulating outliers of the cluster. With such a model, a multi-metric Lp-norm transformation is proposed and learnt which maps each cluster to the most Gaussian distribution by minimizing some non-Gaussianity measure. The approach is composed of two consecutive phases: multi-metric location estimation (MMLE) and multi-metric iterative chi-square cutoff (ICSC). Algorithms for MMLE and ICSC are proposed. It is proved that the MMLE algorithm searches for the solution of a multi-objective optimization problem and in fact learns a cluster-dependent multi-metric Lq-norm distance and/or a cluster-dependent multi-kernel defined in data space for each cluster. Experiments on heavy-tailed alpha-stable mixture datasets, Gaussian mixture datasets with radial and diffuse outliers added respectively, and the real Wisconsin breast cancer dataset and lung cancer dataset show that the proposed method is superior to many existent robust clustering and outlier detection methods in both clustering and outlier detection performances.  相似文献   

15.
一个好的聚类算法应该是用户输入参数少,对噪声不敏感,能够发现任意形状,可以处理高维数据,具有可解释性和可扩展性.将聚类分析应用于地理信息系统中,可以实现对GIS数据信息概括和综合.文中提出一种基于距离阈值相邻的聚类算法,通过距离阈值可达的方式逐个将对象加入到已知聚类中,可以发现任意形状的聚类并对噪声数据有很好的分离效果,实验中将该算法应用于地理信息系统中的数据挖掘实现上,结果证明此算法对于实现GIS聚类具有满意的效果.  相似文献   

16.
孙秀娟  刘希玉 《计算机应用》2008,28(12):3244-3247
在K-means算法中,聚类数k是影响聚类质量的关键因素之一。目前,已经提出了许多确定最佳k值的聚类有效性方法,但这些方法都不能很好地处理两种数据集:类(簇)密度不同的数据集和类间距比较小的数据集(含有合并簇的数据集)。为此,提出了一种新的聚类有效性函数,该函数定义为数据特征轴总长度的平方与最小类间距的比值,最佳聚类数为这个比值达到最小时对应的k值。同时,为减小K-means算法对噪声和孤立点数据的敏感性,使用了基于加权的改进K-平均的方法计算类中心。实验证明,与其他算法相比,基于新聚类有效性函数的K-wmeans算法不仅降低了噪声和孤立点数据对聚类结果的影响,而且能有效地处理上面提到的两种数据集,明显提高了数据聚类质量。  相似文献   

17.
Clustering Incomplete Data Using Kernel-Based Fuzzy C-means Algorithm   总被引:3,自引:0,他引:3  
  相似文献   

18.
现有的非负矩阵分解方法既忽略数据的非局部结构,又难以有效应对噪声和野值点。为了解决上述问题,提出一种新的用于聚类的鲁棒结构正则化非负矩阵分解算法。所提出的算法分别构建一个近邻图和一个最大熵图描述数据的局部结构和非局部结构,并使用L2,1范数代价函数尝试解决噪声问题,从而学习到鲁棒具有判别力的表征。给出一个最优的迭代算法求解两个非负因子,该优化算法的收敛性已被理论和实验证明。在七个图像数据集上的聚类实验结果表明,所提出的算法在无噪声和有噪声情况下聚类均优于其他主流方法。  相似文献   

19.

The data cluster tendency is an emerging need for exploring the big data cluster analysis tasks. The data are evaluated based on the number of clusters is known as cluster tendency. Many visualization techniques have been developed for the detection of cluster tendency. Some of the existing techniques include Visual Assessment Tendency (VAT), spectral-based VAT (SpecVAT), and improved VAT (iVAT), are considerably succeeded for an assessment of cluster tendency for small datasets. A bigVAT is another method that was recently developed for the estimation of cluster tendency of big data. It is perfect for deriving the clustering tendency in visual form for big data. However, it is intractable to explore the data clusters for large volumes of data objects. The proposed work addresses the clustering problem of bigVAT with the derivation of sampling-based crisp partitions. The crisp partitions will accurately predict the cluster labels of data objects. This research is based on big synthetic and big real-life datasets for demonstrating the performance efficiency of the proposed work.

  相似文献   

20.
This paper addresses three major issues associated with conventional partitional clustering, namely, sensitivity to initialization, difficulty in determining the number of clusters, and sensitivity to noise and outliers. The proposed robust competitive agglomeration (RCA) algorithm starts with a large number of clusters to reduce the sensitivity to initialization, and determines the actual number of clusters by a process of competitive agglomeration. Noise immunity is achieved by incorporating concepts from robust statistics into the algorithm. RCA assigns two different sets of weights for each data point: the first set of constrained weights represents degrees of sharing, and is used to create a competitive environment and to generate a fuzzy partition of the data set. The second set corresponds to robust weights, and is used to obtain robust estimates of the cluster prototypes. By choosing an appropriate distance measure in the objective function, RCA can be used to find an unknown number of clusters of various shapes in noisy data sets, as well as to fit an unknown number of parametric models simultaneously. Several examples, such as clustering/mixture decomposition, line/plane fitting, segmentation of range images, and estimation of motion parameters of multiple objects, are shown  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号