首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
In high-dimensional data, clusters of objects usually exist in subspaces; besides, different clusters probably have different shape volumes. Most existing methods for high-dimensional data clustering, however, only consider the former factor. They ignore the latter factor by assuming the same shape volume value for different clusters. In this paper we propose a new Gaussian mixture model (GMM) type algorithm for discovering clusters with various shape volumes in subspaces. We extend the GMM clustering method to calculate a local weight vector as well as a local variance within each cluster, and use the weight and variance values to capture main properties that discriminate different clusters, including subsets of relevant dimensions and shape volumes. This is achieved by introducing negative entropy of weight vectors, along with adaptively-chosen coefficients, into the objective function of the extended GMM. Experimental results on both synthetic and real datasets show that the proposed algorithm outperforms its competitors, especially when applying to high-dimensional datasets.  相似文献   

2.
This paper proposes a new method to weight subspaces in feature groups and individual features for clustering high-dimensional data. In this method, the features of high-dimensional data are divided into feature groups, based on their natural characteristics. Two types of weights are introduced to the clustering process to simultaneously identify the importance of feature groups and individual features in each cluster. A new optimization model is given to define the optimization process and a new clustering algorithm FG-k-means is proposed to optimize the optimization model. The new algorithm is an extension to k-means by adding two additional steps to automatically calculate the two types of subspace weights. A new data generation method is presented to generate high-dimensional data with clusters in subspaces of both feature groups and individual features. Experimental results on synthetic and real-life data have shown that the FG-k-means algorithm significantly outperformed four k-means type algorithms, i.e., k-means, W-k-means, LAC and EWKM in almost all experiments. The new algorithm is robust to noise and missing values which commonly exist in high-dimensional data.  相似文献   

3.
Pattern Analysis and Applications - The curse of dimensionality in high-dimensional data is one of the major challenges in data clustering. Recently, a considerable amount of literature has been...  相似文献   

4.
Spectral clustering is an important component of clustering method, via tightly relying on the affinity matrix. However, conventional spectral clustering methods 1). equally treat each data point, so that easily affected by the outliers; 2). are sensitive to the initialization; 3). need to specify the number of cluster. To conquer these problems, we have proposed a novel spectral clustering algorithm, via employing an affinity matrix learning to learn an intrinsic affinity matrix, using the local PCA to resolve the intersections; and further taking advantage of a robust clustering that is insensitive to initialization to automatically generate clusters without an input of number of cluster. Experimental results on both artificial and real high-dimensional datasets have exhibited our proposed method outperforms the clustering methods under comparison in term of four clustering metrics.  相似文献   

5.
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques.  相似文献   

6.
Identifying clusters is an important aspect of analyzing large datasets. Clustering algorithms classically require access to the complete dataset. However, as huge amounts of data are increasingly originating from multiple, dispersed sources in distributed systems, alternative solutions are required. Furthermore, data and network dynamicity in a distributed setting demand adaptable clustering solutions that offer accurate clustering models at a reasonable pace. In this paper, we propose GoScan, a fully decentralized density-based clustering algorithm which is capable of clustering dynamic and distributed datasets without requiring central control or message flooding. We identify two major tasks: finding the core data points, and forming the actual clusters, which we execute in parallel employing gossip-based communication. This approach is very efficient, as it offers each peer enough authority to discover the clusters it is interested in. Our algorithm poses no extra burden of overlay formation in the network, while providing high levels of scalability. We also offer several optimizations to the basic clustering algorithm for improving communication overhead and processing costs. Coping with dynamic data is made possible by introducing an age factor, which gradually detects data-set changes and enables clustering updates. In our experimental evaluation, we will show that GoSCAN can discover the clusters efficiently with scalable transmission cost.  相似文献   

7.
无重叠子空间分类聚类算法   总被引:1,自引:0,他引:1  
传统的聚类算法主要是对数值型的数据进行聚类,而随着对数据的发展需求,建立在分类数据上的算法也越来越多,由于分类数据没有直接意义上的距离,传统算法不能解决这个问题.同时,现有子空间上的分类聚类研究不是很多.引用熵的概念来选择确定划分的类别和类的最优中心点,同时提出了一种新的目标函数来得到每个类上的相关子空间集,并根据目标函数的最小值来优化聚类的划分.实验结果表明,该方法是可行的,同时也能够了解每个类中的数据结构特点.  相似文献   

8.
高维数据聚类方法综述*   总被引:10,自引:2,他引:10  
总结了高维数据聚类算法的研究现状,分析比较了算法性能的主要差异,并指出其今后的发展趋势,即在子空间聚类过程中融入其他传统聚类方法的思想,以提高聚类性能。  相似文献   

9.
Tick data are used in several applications that need to keep track of values changing over time, like prices on the stock market or meteorological measurements. Due to the possibly very frequent changes, the size of tick data tends to increase rapidly. Therefore, it becomes of paramount importance to reduce the storage space of tick data while, at the same time, allowing queries to be executed efficiently. In this paper, we propose an approach to decompose the original tick data matrix by clustering their attributes using a new clustering algorithm called Storage-Optimizing Hierarchical Agglomerative Clustering (SOHAC). We additionally propose a method for speeding up SOHAC based on a new lower bounding technique that allows SOHAC to be applied to high-dimensional tick data. Our experimental evaluation shows that the proposed approach compares favorably to several baselines in terms of compression. Additionally, it can lead to significant speedup in terms of running time.  相似文献   

10.
KNN-kernel density-based clustering for high-dimensional multivariate data   总被引:1,自引:0,他引:1  
Density-based clustering algorithms for multivariate data often have difficulties with high-dimensional data and clusters of very different densities. A new density-based clustering algorithm, called KNNCLUST, is presented in this paper that is able to tackle these situations. It is based on the combination of nonparametric k-nearest-neighbor (KNN) and kernel (KNN-kernel) density estimation. The KNN-kernel density estimation technique makes it possible to model clusters of different densities in high-dimensional data sets. Moreover, the number of clusters is identified automatically by the algorithm. KNNCLUST is tested using simulated data and applied to a multispectral compact airborne spectrographic imager (CASI)_image of a floodplain in the Netherlands to illustrate the characteristics of the method.  相似文献   

11.
优化子空间的高维聚类算法   总被引:1,自引:0,他引:1  
针对当前大多数典型软子空间聚类算法未能考虑簇类投影子空间的优化问题,提出一种新的软子空间聚类算法。该算法将最大化权重之间的差异性作为子空间优化的目标,并提出了一个量化公式。以此为基础设计了一个新的优化目标函数,在最小化簇内紧凑度的同时,优化每个簇所在的软子空间。通过数学推导得到了新的特征权重计算方法,并基于k-means算法框架定义了新聚类算法。实验结果表明,所提算法对子空间的优化降低了算法过早陷入局部最优的可能性,提高了算法的稳定性,并且具有良好的性能和聚类效果,适合用于高维数据聚类分析。  相似文献   

12.
子空间聚类是高维数据聚类的一种有效手段,子空间聚类的原理就是在最大限度地保留原始数据信息的同时用尽可能小的子空间对数据聚类。在研究了现有的子空间聚类的基础上,引入了一种新的子空间的搜索方式,它结合簇类大小和信息熵计算子空间维的权重,进一步用子空间的特征向量计算簇类的相似度。该算法采用类似层次聚类中凝聚层次聚类的思想进行聚类,克服了单用信息熵或传统相似度的缺点。通过在Zoo、Votes、Soybean三个典型分类型数据集上进行测试发现:与其他算法相比,该算法不仅提高了聚类精度,而且具有很高的稳定性。  相似文献   

13.
In several application domains, high-dimensional observations are collected and then analysed in search for naturally occurring data clusters which might provide further insights about the nature of the problem. In this paper we describe a new approach for partitioning such high-dimensional data. Our assumption is that, within each cluster, the data can be approximated well by a linear subspace estimated by means of a principal component analysis (PCA). The proposed algorithm, Predictive Subspace Clustering (PSC) partitions the data into clusters while simultaneously estimating cluster-wise PCA parameters. The algorithm minimises an objective function that depends upon a new measure of influence for PCA models. A penalised version of the algorithm is also described for carrying our simultaneous subspace clustering and variable selection. The convergence of PSC is discussed in detail, and extensive simulation results and comparisons to competing methods are presented. The comparative performance of PSC has been assessed on six real gene expression data sets for which PSC often provides state-of-art results.  相似文献   

14.
Motivated by the high demand to construct compact and accurate statistical models that are automatically adjustable to dynamic changes, in this paper, we propose an online probabilistic framework for high-dimensional spherical data modeling. The proposed framework allows simultaneous clustering and feature selection in online settings using finite mixtures of von Mises distributions (movM). The unsupervised learning of the resulting model is approached using Expectation Maximization (EM) for parameter estimation along with minimum message length (MML) to determine the optimal number of mixture components. The gradient stochastic descent approach is considered for incremental updating of model parameters, also. Through empirical experiments, we demonstrate the merits of the proposed learning framework on diverse high dimensional datasets and challenging applications.  相似文献   

15.
Clustering is to group similar data and find out hidden information about the characteristics of dataset for the further analysis. The concept of dissimilarity of objects is a decisive factor for good quality of results in clustering. When attributes of data are not just numerical but categorical and high dimensional, it is not simple to discriminate the dissimilarity of objects which have synonymous values or unimportant attributes. We suggest a method to quantify the level of difference between categorical values and to weigh the implicit influence of each attribute on constructing a particular cluster. Our method exploits distributional information of data correlated with each categorical value so that intrinsic relationship of values can be discovered. In addition, it measures significance of each attribute in constructing respective cluster dynamically. Experiments on real datasets show the propriety and effectiveness of the method, which improves the results considerably even with simple clustering algorithms. Our approach does not couple with a clustering algorithm tightly and can also be applied to various algorithms flexibly.  相似文献   

16.
一种高维空间数据的子空间聚类算法   总被引:6,自引:1,他引:6  
王生生  刘大有  曹斌  刘杰 《计算机应用》2005,25(11):2615-2617
传统网格聚类方法由于没有考虑到相邻网格内的数据点对考查网格的影响,存在不能平滑聚类以及聚类边界判断不清的情况。为此提出了一种高维空间数据的子空间聚类算法,扩展了相邻聚类空间。实验结果显示,克服了传统聚类的不平滑现象,使聚类边界得以很好的处理。  相似文献   

17.
18.
A novel random-gradient-based algorithm is developed for online tracking the minor component (MC) associated with the smallest eigenvalue of the autocorrelation matrix of the input vector sequence. The five available learning algorithms for tracking one MC are extended to those for tracking multiple MCs or the minor subspace (MS). In order to overcome the dynamical divergence properties of some available random-gradient-based algorithms, we propose a modification of the Oja-type algorithms, called OJAm, which can work satisfactorily. The averaging differential equation and the energy function associated with the OJAm are given. It is shown that the averaging differential equation will globally asymptotically converge to an invariance set. The corresponding energy or Lyapunov functions exhibit a unique global minimum attained if and only if its state matrices span the MS of the autocorrelation matrix of a vector data stream. The other stationary points are saddle (unstable) points. The globally convergence of OJAm is also studied. The OJAm provides an efficient online learning for tracking the MS. It can track an orthonormal basis of the MS while the other five available algorithms cannot track any orthonormal basis of the MS. The performances of the relative algorithms are shown via computer simulations.  相似文献   

19.
We propose a new approach, the forward functional testing (FFT) procedure, to cluster number selection for functional data clustering. We present a framework of subspace projected functional data clustering based on the functional multiplicative random-effects model, and propose to perform functional hypothesis tests on equivalence of cluster structures to identify the number of clusters. The aim is to find the maximum number of distinctive clusters while retaining significant differences between cluster structures. The null hypotheses comprise equalities between the cluster mean functions and between the sets of cluster eigenfunctions of the covariance kernels. Bootstrap resampling methods are developed to construct reference distributions of the derived test statistics. We compare several other cluster number selection criteria, extended from methods of multivariate data, with the proposed FFT procedure. The performance of the proposed approaches is examined by simulation studies, with applications to clustering gene expression profiles.  相似文献   

20.
SUBCLU高维子空间聚类算法在自底向上搜索最大兴趣子空间类的过程中不断迭代产生中间类,这些中间类的产生消耗了大量时间,针对这一问题,提出改进算法BDFS-SUBCLU,采用一种带回溯的深度优先搜索策略来挖掘最大兴趣子空间中的类,通过这种策略避免了中间类的产生,降低了算法的时间复杂度。同时BDFS-SUBCLU算法在子空间中对核心点增加一种约束,通过这个约束条件在一定程度上避免了聚类过程中相邻的类由于特殊的数据点合为一类的情况。在仿真数据集和真实数据集上的实验结果表明BDFS-SUBCLU算法与SUBCLU算法相比,效率和准确性均有所提高。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号