首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper presents a new k-means type algorithm for clustering high-dimensional objects in sub-spaces. In high-dimensional data, clusters of objects often exist in subspaces rather than in the entire space. For example, in text clustering, clusters of documents of different topics are categorized by different subsets of terms or keywords. The keywords for one cluster may not occur in the documents of other clusters. This is a data sparsity problem faced in clustering high-dimensional data. In the new algorithm, we extend the k-means clustering process to calculate a weight for each dimension in each cluster and use the weight values to identify the subsets of important dimensions that categorize different clusters. This is achieved by including the weight entropy in the objective function that is minimized in the k-means clustering process. An additional step is added to the k-means clustering process to automatically compute the weights of all dimensions in each cluster. The experiments on both synthetic and real data have shown that the new algorithm can generate better clustering results than other subspace clustering algorithms. The new algorithm is also scalable to large data sets.  相似文献   

2.
Projective clustering by histograms   总被引:5,自引:0,他引:5  
Recent research suggests that clustering for high-dimensional data should involve searching for "hidden" subspaces with lower dimensionalities, in which patterns can be observed when data objects are projected onto the subspaces. Discovering such interattribute correlations and location of the corresponding clusters is known as the projective clustering problem. We propose an efficient projective clustering technique by histogram construction (EPCH). The histograms help to generate "signatures", where a signature corresponds to some region in some subspace, and signatures with a large number of data objects are identified as the regions for subspace clusters. Hence, projected clusters and their corresponding subspaces can be uncovered. Compared to the best previous methods to our knowledge, this approach is more flexible in that less prior knowledge on the data set is required, and it is also much more efficient. Our experiments compare behaviors and performances of this approach and other projective clustering algorithms with different data characteristics. The results show that our technique is scalable to very large databases, and it is able to return accurate clustering results.  相似文献   

3.
A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendall's rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.  相似文献   

4.
一种面向高维符号数据的随机投影聚类算法   总被引:1,自引:0,他引:1  
现实数据往往分布在高维空间中,从整个向量空间来看,这些数据间的联系非常分散,因此如何降低维数实现高维数据的聚类受到众多研究者的普遍关注.介绍了一种适用于符号型高维数据的随机投影聚类算法.其根据频率选择与聚类相关的维向量,随机产生并根据投影聚类效果择优选择聚类中心及相关维向量,将投影聚类算法扩展至符号数据空间.实验结果证实了这种算法的实用性与有效性.  相似文献   

5.
Cluster analysis is used to explore structure in unlabeled batch data sets in a wide range of applications. An important part of cluster analysis is validating the quality of computationally obtained clusters. A large number of different internal indices have been developed for validation in the offline setting. However, this concept cannot be directly extended to the online setting because streaming algorithms do not retain the data, nor maintain a partition of it, both needed by batch cluster validity indices. In this paper, we develop two incremental versions (with and without forgetting factors) of the Xie-Beni and Davies-Bouldin validity indices, and use them to monitor and control two streaming clustering algorithms (sk-means and online ellipsoidal clustering), In this context, our new incremental validity indices are more accurately viewed as performance monitoring functions. We also show that incremental cluster validity indices can send a distress signal to online monitors when evolving structure leads an algorithm astray. Our numerical examples indicate that the incremental Xie-Beni index with a forgetting factor is superior to the other three indices tested.  相似文献   

6.
Clustering algorithms tend to generate clusters even when applied to random data. This paper provides a semi-tutorial review of the state-of-the-art in cluster validity, or the verification of results from clustering algorithms. The paper covers ways of measuring clustering tendency, the fit of hierarchical and partitional structures and indices of compactness and isolation for individual clusters. Included are structural criteria for validating clusters and the factors involved in choosing criteria, according to which the literature of cluster validity is classified. An application to speaker identification demonstrates several indices. The development of new clustering techniques and the wide availability of clustering programs necessitates vigorous research in cluster validity.  相似文献   

7.
Redefining clustering for high-dimensional applications   总被引:1,自引:0,他引:1  
Clustering problems are well-known in the database literature for their use in numerous applications, such as customer segmentation, classification, and trend analysis. High-dimensional data has always been a challenge for clustering algorithms because of the inherent sparsity of the points. Recent research results indicate that, in high-dimensional data, even the concept of proximity or clustering may not be meaningful. We introduce a very general concept of projected clustering which is able to construct clusters in arbitrarily aligned subspaces of lower dimensionality. The subspaces are specific to the clusters themselves. This definition is substantially more general and realistic than the currently available techniques which limit the method to only projections from the original set of attributes. The generalized projected clustering technique may also be viewed as a way of trying to redefine clustering for high-dimensional applications by searching for hidden subspaces with clusters which are created by interattribute correlations. We provide a new concept of using extended cluster feature vectors in order to make the algorithm scalable for very large databases. The running time and space requirements of the algorithm are adjustable and are likely to trade-off with better accuracy  相似文献   

8.
Cluster validity indices are used for estimating the quality of partitions produced by clustering algorithms and for determining the number of clusters in data. Cluster validation is difficult task, because for the same data set more partitions exists regarding the level of details that fit natural groupings of a given data set. Even though several cluster validity indices exist, they are inefficient when clusters widely differ in density or size. We propose a clustering validity index that addresses these issues. It is based on compactness and overlap measures. The overlap measure, which indicates the degree of overlap between fuzzy clusters, is obtained by calculating the overlap rate of all data objects that belong strongly enough to two or more clusters. The compactness measure, which indicates the degree of similarity of data objects in a cluster, is calculated from membership values of data objects that are strongly enough associated to one cluster. We propose ratio and summation type of index using the same compactness and overlap measures. The maximal value of index denotes the optimal fuzzy partition that is expected to have a high compactness and a low degree of overlap among clusters. Testing many well-known previously formulated and proposed indices on well-known data sets showed the superior reliability and effectiveness of the proposed index in comparison to other indices especially when evaluating partitions with clusters that widely differ in size or density.  相似文献   

9.
Since in practical data mining problems high-dimensional data are clustered, the resulting clusters are high-dimensional geometrical objects, which are difficult to analyze and interpret. Cluster validity measures try to solve this problem by providing a single numerical value. As a low dimensional graphical representation of the clusters could be much more informative than such a single value, this paper proposes a new tool for the visualization of fuzzy clustering results. By using the basic properties of fuzzy clustering algorithms, this new tool maps the cluster centers and the data such that the distances between the clusters and the data-points are preserved. During the iterative mapping process, the algorithm uses the membership values of the data and minimizes an objective function similar to the original clustering algorithm. Comparing to the original Sammon mapping not only reliable cluster shapes are obtained but the numerical complexity of the algorithm is also drastically reduced. The developed tool has been applied for visualization of reconstructed phase space trajectories of chaotic systems. The case study demonstrates that proposed FUZZSAMM algorithm is a useful tool in user-guided clustering.  相似文献   

10.
Cluster validity indices are used to validate results of clustering and to find a set of clusters that best fits natural partitions for given data set. Most of the previous validity indices have been considerably dependent on the number of data objects in clusters, on cluster centroids and on average values. They have a tendency to ignore small clusters and clusters with low density. Two cluster validity indices are proposed for efficient validation of partitions containing clusters that widely differ in sizes and densities. The first proposed index exploits a compactness measure and a separation measure, and the second index is based an overlap measure and a separation measure. The compactness and the overlap measures are calculated from few data objects of a cluster while the separation measure uses all data objects. The compactness measure is calculated only from data objects of a cluster that are far enough away from the cluster centroids, while the overlap measure is calculated from data objects that are enough near to one or more other clusters. A good partition is expected to have low degree of overlap and a larger separation distance and compactness. The maximum value of the ratio of compactness to separation and the minimum value of the ratio of overlap to separation indicate the optimal partition. Testing of both proposed indices on some artificial and three well-known real data sets showed the effectiveness and reliability of the proposed indices.  相似文献   

11.
From a dataset automatically identifying possible count of clusters is an important task of unsupervised classification. To address this issue, in the current paper, we have focused on the symmetry property of any cluster. Point and line symmetry are two important attributes of data partitions. Here we have proposed line symmetry versions of eight well-known validity indices: XB, PBM, FCM, PS, FS, K, SV, and DB indices to make them capable of identifying the accurate count of partitions from data sets containing clusters having line symmetric property. The global optimality of two of these newly developed indices is established mathematically. Eight artificially generated data sets of varying dimensions containing clusters of different convexities and shapes and three real-life data sets are used for the purpose of experiment. Initially, to obtain different partitions an existing genetic clustering technique which uses line symmetry property (GALS clustering) is applied on data sets varying the count of clusters. queryPlease check and confirm the edit in the following sentence: We have also provided a comparative study of our proposed line-symmetry-based cluster validity indices with their point-symmetry-based versions and original versions based on Euclidean distance. We have also provided a comparative study of our proposed line-symmetry-based cluster validity indices with their point-symmetry-based versions and original versions based on Euclidean distance. From the experimental results it is revealed that most of the line-symmetry-distance-based cluster validity indices perform better than their point symmetry and Euclidean-distance-based versions.  相似文献   

12.
Identification of the correct number of clusters and the appropriate partitioning technique are some important considerations in clustering where several cluster validity indices, primarily utilizing the Euclidean distance, have been used in the literature. In this paper a new measure of connectivity is incorporated in the definitions of seven cluster validity indices namely, DB-index, Dunn-index, Generalized Dunn-index, PS-index, I-index, XB-index and SV-index, thereby yielding seven new cluster validity indices which are able to automatically detect clusters of any shape, size or convexity as long as they are well-separated. Here connectivity is measured using a novel approach following the concept of relative neighborhood graph. It is empirically established that incorporation of the property of connectivity significantly improves the capabilities of these indices in identifying the appropriate number of clusters. The well-known clustering techniques, single linkage clustering technique and K-means clustering technique are used as the underlying partitioning algorithms. Results on eight artificially generated and three real-life data sets show that connectivity based Dunn-index performs the best as compared to all the other six indices. Comparisons are made with the original versions of these seven cluster validity indices.  相似文献   

13.
聚类作为一种无监督的学习方法,通常需要人为地提供聚类的簇数。在先验知识缺乏的情况下,通过人为指定聚类参数是不合实际的。近年来研究的聚类有效性函数(Cluster Validity Index) 用于估计簇的数目及聚类效果的优劣。本文提出了一种新的基于有效性指数的聚类算法,无需提供聚类的参数。算法每步合并两个簇,使有效性指数值增加最大或减小最少。本文运用引力模型度量相似度,对可能出现的异常点情况作均匀化的处理。实验表明,本文的算法能正确发现特定数据的簇个数,和其它聚类方法比较,聚类结果具有较低的错误率,并在效率上优于一般的基于有效性指数的聚类算法。  相似文献   

14.
In this paper the problem of automatic clustering a data set is posed as solving a multiobjective optimization (MOO) problem, optimizing a set of cluster validity indices simultaneously. The proposed multiobjective clustering technique utilizes a recently developed simulated annealing based multiobjective optimization method as the underlying optimization strategy. Here variable number of cluster centers is encoded in the string. The number of clusters present in different strings varies over a range. The points are assigned to different clusters based on the newly developed point symmetry based distance rather than the existing Euclidean distance. Two cluster validity indices, one based on the Euclidean distance, XB-index, and another recently developed point symmetry distance based cluster validity index, Sym-index, are optimized simultaneously in order to determine the appropriate number of clusters present in a data set. Thus the proposed clustering technique is able to detect both the proper number of clusters and the appropriate partitioning from data sets either having hyperspherical clusters or having point symmetric clusters. A new semi-supervised method is also proposed in the present paper to select a single solution from the final Pareto optimal front of the proposed multiobjective clustering technique. The efficacy of the proposed algorithm is shown for seven artificial data sets and six real-life data sets of varying complexities. Results are also compared with those obtained by another multiobjective clustering technique, MOCK, two single objective genetic algorithm based automatic clustering techniques, VGAPS clustering and GCUK clustering.  相似文献   

15.
徐鲲鹏  陈黎飞  孙浩军  王备战 《软件学报》2020,31(11):3492-3505
现有的类属型数据子空间聚类方法大多基于特征间相互独立假设,未考虑属性间存在的线性或非线性相关性.提出一种类属型数据核子空间聚类方法.首先引入原作用于连续型数据的核函数将类属型数据投影到核空间,定义了核空间中特征加权的类属型数据相似性度量.其次,基于该度量推导了类属型数据核子空间聚类目标函数,并提出一种高效求解该目标函数的优化方法.最后,定义了一种类属型数据核子空间聚类算法.该算法不仅在非线性空间中考虑了属性间的关系,而且在聚类过程中赋予每个属性衡量其与簇类相关程度的特征权重,实现了类属型属性的嵌入式特征选择.还定义了一个聚类有效性指标,以评价类属型数据聚类结果的质量.在合成数据和实际数据集上的实验结果表明,与现有子空间聚类算法相比,核子空间聚类算法可以发掘类属型属性间的非线性关系,并有效提高了聚类结果的质量.  相似文献   

16.
In this paper, we present an agglomerative fuzzy $k$-means clustering algorithm for numerical data, an extension to the standard fuzzy $k$-means algorithm by introducing a penalty term to the objective function to make the clustering process not sensitive to the initial cluster centers. The new algorithm can produce more consistent clustering results from different sets of initial clusters centers. Combined with cluster validation techniques, the new algorithm can determine the number of clusters in a data set, which is a well known problem in $k$-means clustering. Experimental results on synthetic data sets (2 to 5 dimensions, 500 to 5000 objects and 3 to 7 clusters), the BIRCH two-dimensional data set of 20000 objects and 100 clusters, and the WINE data set of 178 objects, 17 dimensions and 3 clusters from UCI, have demonstrated the effectiveness of the new algorithm in producing consistent clustering results and determining the correct number of clusters in different data sets, some with overlapping inherent clusters.  相似文献   

17.
高维数据流的自适应子空间聚类算法   总被引:1,自引:0,他引:1       下载免费PDF全文
高维数据流聚类是数据挖掘领域中的研究热点。由于数据流具有数据量大、快速变化、高维性等特点,许多聚类算法不能取得较好的聚类质量。提出了高维数据流的自适应子空间聚类算法SAStream。该算法改进了HPStream中的微簇结构并定义了候选簇,只在相应的子空间内计算新来数据点到候选簇质心的距离,减少了聚类时被检查微簇的数目,将形成的微簇存储在金字塔时间框架中,使用时间衰减函数删除过期的微簇;当数据流量大时,根据监测的系统资源使用情况自动调整界限半径和簇选择因子,从而调节聚类的粒度。实验结果表明,该算法具有良好的聚类质量和快速的数据处理能力。  相似文献   

18.
In high-dimensional data, clusters of objects usually exist in subspaces; besides, different clusters probably have different shape volumes. Most existing methods for high-dimensional data clustering, however, only consider the former factor. They ignore the latter factor by assuming the same shape volume value for different clusters. In this paper we propose a new Gaussian mixture model (GMM) type algorithm for discovering clusters with various shape volumes in subspaces. We extend the GMM clustering method to calculate a local weight vector as well as a local variance within each cluster, and use the weight and variance values to capture main properties that discriminate different clusters, including subsets of relevant dimensions and shape volumes. This is achieved by introducing negative entropy of weight vectors, along with adaptively-chosen coefficients, into the objective function of the extended GMM. Experimental results on both synthetic and real datasets show that the proposed algorithm outperforms its competitors, especially when applying to high-dimensional datasets.  相似文献   

19.
自适应熵的投影聚类算法   总被引:1,自引:0,他引:1  
受“维度效应”的影响,许多传统聚类方法运用于高维数据时往往聚类效果不佳。近年来投影聚类方法获得广泛关注,其中软子空间聚类法更是得到了广泛的研究和应用。然而,现有的投影子空间聚类算法大多数均要求用户预先设置一些重要参数,且未能考虑簇类投影子空间的优化问题,从而降低了算法的聚类性能。为此,定义了一种新的优化目标函数,在最小化簇内紧凑度的同时,优化每个簇所在的子空间。通过数学推导得到了新的特征权重计算方法,并提出了一种自适应的“均值型投影聚类算法。该算法在聚类过程中,依靠数据集自身的相关信息及推导获得的公式动态地计算各优化参数。实验结果表明,新算法通过对投影子空间的优化改善了聚类质量,其性能较已有投影聚类算法有了明显提升。  相似文献   

20.
In cluster analysis, a fundamental problem is to determine the best estimate of the number of clusters; this is known as the automatic clustering problem. Because of lack of prior domain knowledge, it is difficult to choose an appropriate number of clusters, especially when the data have many dimensions, when clusters differ widely in shape, size, and density, and when overlapping exists among groups. In the late 1990s, the automatic clustering problem gave rise to a new era in cluster analysis with the application of nature-inspired metaheuristics. Since then, researchers have developed several new algorithms in this field. This paper presents an up-to-date review of all major nature-inspired metaheuristic algorithms used thus far for automatic clustering. Also, the main components involved during the formulation of metaheuristics for automatic clustering are presented, such as encoding schemes, validity indices, and proximity measures. A total of 65 automatic clustering approaches are reviewed, which are based on single-solution, single-objective, and multiobjective metaheuristics, whose usage percentages are 3%, 69%, and 28%, respectively. Single-objective clustering algorithms are adequate to efficiently group linearly separable clusters. However, a strong tendency in using multiobjective algorithms is found nowadays to address non-linearly separable problems. Finally, a discussion and some emerging research directions are presented.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号