首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 546 毫秒
1.
A considerable amount of work has been done in data clustering research during the last four decades, and a myriad of methods has been proposed focusing on different data types, proximity functions, cluster representation models, and cluster presentation. However, clustering remains a challenging problem due to its ill-posed nature: it is well known that off-the-shelf clustering methods may discover different patterns in a given set of data, mainly because every clustering algorithm has its own bias resulting from the optimization of different criteria. This bias becomes even more important as in almost all real-world applications, data is inherently high-dimensional and multiple clustering solutions might be available for the same data collection. In this respect, the problems of projective clustering and clustering ensembles have been recently defined to deal with the high dimensionality and multiple clusterings issues, respectively. Nevertheless, despite such two issues can often be encountered together, existing approaches to the two problems have been developed independently of each other. In our earlier work Gullo et al. (Proceedings of the international conference on data mining (ICDM), 2009a) we introduced a novel clustering problem, called projective clustering ensembles (PCE): given a set (ensemble) of projective clustering solutions, the goal is to derive a projective consensus clustering, i.e., a projective clustering that complies with the information on object-to-cluster and the feature-to-cluster assignments given in the ensemble. In this paper, we enhance our previous study and provide theoretical and experimental insights into the PCE problem. PCE is formalized as an optimization problem and is designed to satisfy desirable requirements on independence from the specific clustering ensemble algorithm, ability to handle hard as well as soft data clustering, and different feature weightings. Two PCE formulations are defined: a two-objective optimization problem, in which the two objective functions respectively account for the object- and feature-based representations of the solutions in the ensemble, and a single-objective optimization problem, in which the object- and feature-based representations are embedded into a single function to measure the distance error between the projective consensus clustering and the projective ensemble. The significance of the proposed methods for solving the PCE problem has been shown through an extensive experimental evaluation based on several datasets and comparatively with projective clustering and clustering ensemble baselines.  相似文献   

2.
聚类融合通过把具有一定差异性的聚类成员进行组合,能够得到比单一算法更为优越的结果,是近年来聚类算法研究领域的热点问题之一。提出了一种基于自适应最近邻的聚类融合算法ANNCE,能够根据数据分布密度的不同,为每一个数据点自动选择合适的最近邻选取范围。该算法与已有的基于KNN的算法相比,不仅解决了KNN算法中存在的过多参数需要实验确定的问题,还进一步提高了聚类效果。  相似文献   

3.
聚类集成中的差异性度量研究   总被引:14,自引:0,他引:14  
集体的差异性被认为是影响集成学习的一个关键因素.在分类器集成中有许多的差异性度量被提出,但是在聚类集成中如何测量聚类集体的差异性,目前研究得很少.作者研究了7种聚类集体差异性度量方法,并通过实验研究了这7种度量在不同的平均成员聚类准确度、不同的集体大小和不同的数据分布情况下与各种聚类集成算法性能之间的关系.实验表明:这些差异性度量与聚类集成性能间并没有单调关系,但是在平均成员准确度较高、聚类集体大小适中和数据中有均匀簇分布的情况下,它们与集成性能间的相关度还是比较高的.最后给出了一些差异性度量用于指导聚类集体生成的可行性建议.  相似文献   

4.
基于谱聚类的聚类集成算法   总被引:13,自引:7,他引:6  
周林  平西建  徐森  张涛 《自动化学报》2012,38(8):1335-1342
谱聚类是近年来出现的一类性能优越的聚类算法,能对任意形状的数据进行聚类, 但算法对尺度参数比较敏感,利用聚类集成良好的鲁棒性和泛化能力,本文提出了基于谱聚类的聚类集成算法.该算法首先利用谱聚类算法的内在特性构造多样性的聚类成员; 然后,采用连接三元组算法计算相似度矩阵,扩充了数据点之间的相似性信息;最后,对相似度矩阵使用谱聚类算法得到最终的集成结果. 为了使算法能扩展到大规模应用,利用Nystrm采样算法只计算随机采样数据点之间以及随机采样数据点与剩余数据点之间的相似度矩阵,从而有效降低了算法的计算复杂度. 本文算法既利用了谱聚类算法的优越性能,同时又避免了精确选择尺度参数的问题.实验结果表明:较之其他常见的聚类集成算法,本文算法更优越、更有效,能较好地解决数据聚类、图像分割等问题.  相似文献   

5.
The problem of obtaining a single “consensus” clustering solution from a multitude or ensemble of clusterings of a set of objects, has attracted much interest recently because of its numerous practical applications. While a wide variety of approaches including graph partitioning, maximum likelihood, genetic algorithms, and voting-merging have been proposed so far to solve this problem, virtually all of them work on hard partitionings, i.e., where an object is a member of exactly one cluster in any individual solution. However, many clustering algorithms such as fuzzy c-means naturally output soft partitionings of data, and forcibly hardening these partitions before applying a consensus method potentially involves loss of valuable information. In this article we propose several consensus algorithms that can be applied directly to soft clusterings. Experimental results over a variety of real-life datasets are also provided to show that using soft clusterings as input does offer significant advantages, especially when dealing with vertically partitioned data.  相似文献   

6.
杜航原  张晶  王文剑   《智能系统学报》2020,15(6):1113-1120
针对聚类集成中一致性函数设计问题,本文提出一种深度自监督聚类集成算法。该算法首先根据基聚类划分结果采用加权连通三元组算法计算样本之间的相似度矩阵,基于相似度矩阵表达邻接关系,将基聚类由特征空间中的数据表示变换至图数据表示;在此基础上,基聚类的一致性集成问题被转化为对基聚类图数据表示的图聚类问题。为此,本文利用图神经网络构造自监督聚类集成模型,一方面采用图自动编码器学习图的低维嵌入,依据低维嵌入似然分布估计聚类集成的目标分布;另一方面利用聚类集成目标对低维嵌入过程进行指导,确保模型获得的图低维嵌入与聚类集成结果是一致最优的。在大量数据集上进行了仿真实验,结果表明本文算法相比HGPA、CSPA和MCLA等算法可以进一步提高聚类集成结果的准确性。  相似文献   

7.
The clustering ensemble has emerged as a prominent method for improving robustness, stability, and accuracy of unsupervised classification solutions. It combines multiple partitions generated by different clustering algorithms into a single clustering solution. Genetic algorithms are known as methods with high ability to solve optimization problems including clustering. To date, significant progress has been contributed to find consensus clustering that will yield better results than existing clustering. This paper presents a survey of genetic algorithms designed for clustering ensembles. It begins with the introduction of clustering ensembles and clustering ensemble algorithms. Subsequently, this paper describes a number of suggested genetic-guided clustering ensemble algorithms, in particular the genotypes, fitness functions, and genetic operations. Next, clustering accuracies among the genetic-guided clustering ensemble algorithms is compared. This paper concludes that using genetic algorithms in clustering ensemble improves the clustering accuracy and addresses open questions subject to future research.  相似文献   

8.
Clustering is one of the most important unsupervised learning problems and it consists of finding a common structure in a collection of unlabeled data. However, due to the ill-posed nature of the problem, different runs of the same clustering algorithm applied to the same data-set usually produce different solutions. In this scenario choosing a single solution is quite arbitrary. On the other hand, in many applications the problem of multiple solutions becomes intractable, hence it is often more desirable to provide a limited group of “good” clusterings rather than a single solution. In the present paper we propose the least squares consensus clustering. This technique allows to extrapolate a small number of different clustering solutions from an initial (large) ensemble obtained by applying any clustering algorithm to a given data-set. We also define a measure of quality and present a graphical visualization of each consensus clustering to make immediately interpretable the strength of the consensus. We have carried out several numerical experiments both on synthetic and real data-sets to illustrate the proposed methodology.  相似文献   

9.
张晓博  杨燕  李天瑞  陆凡  彭莉兰 《计算机应用》2020,40(10):3088-3094
针对多发于老龄人群的帕金森病(PD)的早期智能化诊断的问题,提出基于医疗检测文本信息数据的聚类技术来对PD进行分析预测。首先,对原始数据集进行预处理以获取有效特征信息,并通过主成分分析(PCA)方法将原始特征分别降维到8个不同维度的维度空间;然后,应用5个传统的经典聚类模型和3种不同的聚类集成方法分别对8个维度空间的数据进行聚类;最后,采用4个聚类性能指标来预测数据集中的多巴胺异常PD患者、健康体和无多巴胺缺失(SWEDD) PD患者。仿真结果显示,PCA特征维度值取30时,高斯混合模型(GMM)的聚类准确度达到89.12%;PCA特征维度值取70时,谱聚类(SC)的聚类准确度达到61.41%;PCA特征维度值取80时,元聚类算法(MCLA)的聚类准确度达到59.62%。对比实验结果表明,5种经典聚类方法中,PCA的特征维度值小于40时,高斯混合模型聚类效果最佳;3种聚类集成方法中,对于不同的特征维度,MCLA的聚类性能均表现优异,进而为PD的早期智能化辅助诊断提供了技术和理论支撑。  相似文献   

10.
张晓博  杨燕  李天瑞  陆凡  彭莉兰 《计算机应用》2005,40(10):3088-3094
针对多发于老龄人群的帕金森病(PD)的早期智能化诊断的问题,提出基于医疗检测文本信息数据的聚类技术来对PD进行分析预测。首先,对原始数据集进行预处理以获取有效特征信息,并通过主成分分析(PCA)方法将原始特征分别降维到8个不同维度的维度空间;然后,应用5个传统的经典聚类模型和3种不同的聚类集成方法分别对8个维度空间的数据进行聚类;最后,采用4个聚类性能指标来预测数据集中的多巴胺异常PD患者、健康体和无多巴胺缺失(SWEDD) PD患者。仿真结果显示,PCA特征维度值取30时,高斯混合模型(GMM)的聚类准确度达到89.12%;PCA特征维度值取70时,谱聚类(SC)的聚类准确度达到61.41%;PCA特征维度值取80时,元聚类算法(MCLA)的聚类准确度达到59.62%。对比实验结果表明,5种经典聚类方法中,PCA的特征维度值小于40时,高斯混合模型聚类效果最佳;3种聚类集成方法中,对于不同的特征维度,MCLA的聚类性能均表现优异,进而为PD的早期智能化辅助诊断提供了技术和理论支撑。  相似文献   

11.
In recent years, ensemble learning has become a prolific area of study in pattern recognition, based on the assumption that using and combining different learning models in the same problem could lead to better performance results than using a single model. This idea of ensemble learning has traditionally been used for classification tasks, but has more recently been adapted to other machine learning tasks such as clustering and feature selection. We propose several feature selection ensemble configurations based on combining rankings of features from individual rankers according to the combination method and threshold value used. The performance of each proposed ensemble configuration was tested for synthetic datasets (to assess the adequacy of the selection), real classical datasets (with more samples than features), and DNA microarray datasets (with more features than samples). Five different classifiers were studied in order to test the suitability of the proposed ensemble configurations and assess the results.  相似文献   

12.
Searching and mining biomedical literature databases are common ways of generating scientific hypotheses by biomedical researchers. Clustering can assist researchers to form hypotheses by seeking valuable information from grouped documents effectively. Although a large number of clustering algorithms are available, this paper attempts to answer the question as to which algorithm is best suited to accurately cluster biomedical documents. Non-negative matrix factorization (NMF) has been widely applied to clustering general text documents. However, the clustering results are sensitive to the initial values of the parameters of NMF. In order to overcome this drawback, we present the ensemble NMF for clustering biomedical documents in this paper. The performance of ensemble NMF was evaluated on numerous datasets generated from the TREC Genomics track dataset. With respect to most datasets, the experimental results have demonstrated that the ensemble NMF significantly outperforms classical clustering algorithms of bisecting K-means, and hierarchical clustering. We compared four different methods for constructing an ensemble NMF. For clustering biomedical documents, this research is the first to compare ensemble NMF with typical classical clustering algorithms, and validates ensemble NMF constructed from different graph-based ensemble algorithms. This is also the first work on ensemble NMF with Hybrid Bipartite Graph Formulation for clustering biomedical documents.  相似文献   

13.
一种改进的多视图聚类集成算法   总被引:1,自引:0,他引:1  
邓强  杨燕  王浩 《计算机科学》2017,44(1):65-70
近年来,针对大数据的数据挖掘技术和机器学习算法研究变得日趋重要。在聚类领域,随着多视图数据的大量出现,多视图聚类已经成为了一类重要的聚类方法。然而,大多数现有的多视图聚类算法受算法参数设置、数据样本等影响,具有聚类结果不稳定、参数需要反复调节等缺点。基于多视图K-means算法和聚类集成技术,提出了一种改进的多视图聚类集成算法,其提高了聚类的准确性、鲁棒性和稳定性。其次,由于单机环境下的多视图聚类算法难以对海量的数据进行处理,结合分布式处理技术,实现了一种分布式的多视图并行聚类算法。实验证明,并行算法在处理大数据时的时间效率有很大提升,适合于大数据环境下的多视图聚类分析。  相似文献   

14.
Identifying the optimal cluster number and generating reliable clustering results are necessary but challenging tasks in cluster analysis. The effectiveness of clustering analysis relies not only on the assumption of cluster number but also on the clustering algorithm employed. This paper proposes a new clustering analysis method that identifies the desired cluster number and produces, at the same time, reliable clustering solutions. It first obtains many clustering results from a specific algorithm, such as Fuzzy C-Means (FCM), and then integrates these different results as a judgement matrix. An iterative graph-partitioning process is implemented to identify the desired cluster number and the final result. The proposed method is a robust approach as it is demonstrated its effectiveness in clustering 2D data sets and multi-dimensional real-world data sets of different shapes. The method is compared with cluster validity analysis and other methods such as spectral clustering and cluster ensemble methods. The method is also shown efficient in mesh segmentation applications. The proposed method is also adaptive because it not only works with the FCM algorithm but also other clustering methods like the k-means algorithm.  相似文献   

15.
A human-computer interactive method for projected clustering   总被引:1,自引:0,他引:1  
Clustering is a central task in data mining applications such as customer segmentation. High-dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Therefore, techniques have recently been proposed to find clusters in hidden subspaces of the data. However, since the behavior of the data can vary considerably in different subspaces, it is often difficult to define the notion of a cluster with the use of simple mathematical formalizations. The widely used practice of treating clustering as the exact problem of optimizing an arbitrarily chosen objective function can often lead to misleading results. In fact, the proper clustering definition may vary not only with the application and data set but also with the perceptions of the end user. This makes it difficult to separate the definition of the clustering problem from the perception of an end-user. We propose a system, which performs high-dimensional clustering by cooperation between the human and the computer. The complex task of cluster creation is accomplished through a combination of human intuition and the computational support provided by the computer. The result is a system, which leverages the best abilities of both the human and the computer for solving the clustering problem.  相似文献   

16.
针对移动互联网流量识别问题,基于多项性能评估指标,分析K-均值和谱聚类算法在不同特征集合或不同识别目标流量数据集上的聚类性能,并提出基于多特征集合的集成聚类方法。比较分析实验表明,相同聚类方法在不同特征集合或不同识别目标数据集上性能有所不同,集成聚类方法能够有效提高利用单个特征集合聚类方法的性能。进一步将集成聚类方法应用于App关联分析,分析结果可为移动App的划分和用户行为分析提供客观依据。  相似文献   

17.
Singh V  Mukherjee L  Peng J  Xu J 《Machine Learning》2010,79(1-2):177-200
In this paper, we study the ensemble clustering problem, where the input is in the form of multiple clustering solutions. The goal of ensemble clustering algorithms is to aggregate the solutions into one solution that maximizes the agreement in the input ensemble. We obtain several new results for this problem. Specifically, we show that the notion of agreement under such circumstances can be better captured using a 2D string encoding rather than a voting strategy, which is common among existing approaches. Our optimization proceeds by first constructing a non-linear objective function which is then transformed into a 0-1 Semidefinite program (SDP) using novel convexification techniques. This model can be subsequently relaxed to a polynomial time solvable SDP. In addition to the theoretical contributions, our experimental results on standard machine learning and synthetic datasets show that this approach leads to improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. In addition, we identify several new application scenarios for this problem. These include combining multiple image segmentations and generating tissue maps from multiple-channel Diffusion Tensor brain images to identify the underlying structure of the brain.  相似文献   

18.

Graphs are commonly used to express the communication of various data. Faced with uncertain data, we have probabilistic graphs. As a fundamental problem of such graphs, clustering has many applications in analyzing uncertain data. In this paper, we propose a novel method based on ensemble clustering for large probabilistic graphs. To generate ensemble clusters, we develop a set of probable possible worlds of the initial probabilistic graph. Then, we present a probabilistic co-association matrix as a consensus function to integrate base clustering results. It relies on co-occurrences of node pairs based on the probability of the corresponding common cluster graphs. Also, we apply two improvements in the steps before and after of ensembles generation. In the before step, we append neighborhood information based on node features to the initial graph to achieve a more accurate estimation of the probability between the nodes. In the after step, we use supervised metric learning-based Mahalanobis distance to automatically learn a metric from ensemble clusters. It aims to gain crucial features of the base clustering results. We evaluate our work using five real-world datasets and three clustering evaluation metrics, namely the Dunn index, Davies–Bouldin index, and Silhouette coefficient. The results show the impressive performance of clustering large probabilistic graphs.

  相似文献   

19.
《Pattern recognition》2014,47(2):833-842
Ensemble clustering is a recently evolving research direction in cluster analysis and has found several different application domains. In this work the complex ensemble clustering problem is reduced to the well-known Euclidean median problem by clustering embedding in vector spaces. The Euclidean median problem is solved by the Weiszfeld algorithm and an inverse transformation maps the Euclidean median back into the clustering domain. In the experiment study different evaluation strategies are considered. The proposed embedding strategy is compared to several state-of-art ensemble clustering algorithms and demonstrates superior performance.  相似文献   

20.
Multi-view clustering has become an important extension of ensemble clustering. In multi-view clustering, we apply clustering algorithms on different views of the data to obtain different cluster labels for the same set of objects. These results are then combined in such a manner that the final clustering gives better result than individual clustering of each multi-view data. Multi view clustering can be applied at various stages of the clustering paradigm. This paper proposes a novel multi-view clustering algorithm that combines different ensemble techniques. Our approach is based on computing different similarity matrices on the individual datasets and aggregates these to form a combined similarity matrix, which is then used to obtain the final clustering. We tested our approach on several datasets and perform a comparison with other state-of-the-art algorithms. Our results show that the proposed algorithm outperforms several other methods in terms of accuracy while maintaining the overall complexity of the individual approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号