首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
An improved spectral clustering algorithm based on random walk   总被引:2,自引:0,他引:2  
The construction process for a similarity matrix has an important impact on the performance of spectral clustering algorithms. In this paper, we propose a random walk based approach to process the Gaussian kernel similarity matrix. In this method, the pair-wise similarity between two data points is not only related to the two points, but also related to their neighbors. As a result, the new similarity matrix is closer to the ideal matrix which can provide the best clustering result. We give a theoretical analysis of the similarity matrix and apply this similarity matrix to spectral clustering. We also propose a method to handle noisy items which may cause deterioration of clustering performance. Experimental results on real-world data sets show that the proposed spectral clustering algorithm significantly outperforms existing algorithms.  相似文献   

2.
基于成对约束的判别型半监督聚类分析   总被引:10,自引:1,他引:9  
尹学松  胡恩良  陈松灿 《软件学报》2008,19(11):2791-2802
现有一些典型的半监督聚类方法一方面难以有效地解决成对约束的违反问题,另一方面未能同时处理高维数据.通过提出一种基于成对约束的判别型半监督聚类分析方法来同时解决上述问题.该方法有效地利用了监督信息集成数据降维和聚类,即在投影空间中使用基于成对约束的K均值算法对数据聚类,再利用聚类结果选择投影空间.同时,该算法降低了基于约束的半监督聚类算法的计算复杂度,并解决了聚类过程中成对约束的违反问题.在一组真实数据集上的实验结果表明,与现有相关半监督聚类算法相比,新方法不仅能够处理高维数据,还有效地提高了聚类性能.  相似文献   

3.
Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, customer segmentation, trend analysis, pattern recognition and image analysis. Although many clustering algorithms have been proposed, most of them deal with clustering of one data type (numerical or nominal) or with mix data type (numerical and nominal) and only few of them provide a generic method that clusters all types of data. It is required for most real-world applications data to handle both feature types and their mix. In this paper, we propose an automated technique, called SpectralCAT, for unsupervised clustering of high-dimensional data that contains numerical or nominal or mix of attributes. We suggest to automatically transform the high-dimensional input data into categorical values. This is done by discovering the optimal transformation according to the Calinski–Harabasz index for each feature and attribute in the dataset. Then, a method for spectral clustering via dimensionality reduction of the transformed data is applied. This is achieved by automatic non-linear transformations, which identify geometric patterns in the data, and find the connections among them while projecting them onto low-dimensional spaces. We compare our method to several clustering algorithms using 16 public datasets from different domains and types. The experiments demonstrate that our method outperforms in most cases these algorithms.  相似文献   

4.
《Knowledge》2005,18(2-3):99-105
The discovery of association rules is an important data-mining task for which many algorithms have been proposed. However, the efficiency of these algorithms needs to be improved to handle real-world large datasets. In this paper, we present an efficient algorithm named cluster-based association rule (CBAR). The CBAR method is to create cluster tables by scanning the database once, and then clustering the transaction records to the k-th cluster table, where the length of a record is k. Moreover, the large itemsets are generated by contrasts with the partial cluster tables. This not only prunes considerable amounts of data reducing the time needed to perform data scans and requiring less contrast, but also ensures the correctness of the mined results. Experiments with the FoodMart transaction database provided by Microsoft SQL Server show that CBAR outperforms Apriori, a well-known and widely used association rule.  相似文献   

5.
Spectral clustering: A semi-supervised approach   总被引:2,自引:0,他引:2  
Recently, graph-based spectral clustering algorithms have been developing rapidly, which are proposed as discrete combinatorial optimization problems and approximately solved by relaxing them into tractable eigenvalue decomposition problems. In this paper, we first review the current existing spectral clustering algorithms in a unified-framework way and give a straightforward explanation about spectral clustering. We also present a novel model for generalizing the unsupervised spectral clustering to semi-supervised spectral clustering. Under this model, prior information given by some instance-level constraints can be generalized to space-level constraints. We find that (undirected) graph built on the enlarged prior information is more meaningful, hence the boundaries of the clusters are more correct. Experimental results based on toy data, real-world data and image segmentation demonstrate the advantages of the proposed model.  相似文献   

6.
Recently, many methods have appeared in the field of cluster analysis. Most existing clustering algorithms have considerable limitations in dealing with local and nonlinear data patterns. Algorithms based on graphs provide good results for this problem. However, some widely used graph-based clustering methods, such as spectral clustering algorithms, are sensitive to noise and outliers. In this paper, a cut-point clustering algorithm (CutPC) based on a natural neighbor graph is proposed. The CutPC method performs noise cutting when a cut-point value is above the critical value. Normally, the method can automatically identify clusters with arbitrary shapes and detect outliers without any prior knowledge or preparatory parameter settings. The user can also adjust a coefficient to adapt clustering solutions for particular problems better. Experimental results on various synthetic and real-world datasets demonstrate the obvious superiority of CutPC compared with k-means, DBSCAN, DPC, SC, and DCore.  相似文献   

7.
Clustering ensemble integrates multiple base clustering results to obtain a consensus result and thus improves the stability and robustness of the single clustering method. Since it is natural to use a hypergraph to represent the multiple base clustering results, where instances are represented by nodes and base clusters are represented by hyperedges, some hypergraph based clustering ensemble methods are proposed. Conventional hypergraph based methods obtain the final consensus result by partitioning a pre-defined static hypergraph. However, since base clusters may be imperfect due to the unreliability of base clustering methods, the pre-defined hypergraph constructed from the base clusters is also unreliable. Therefore, directly obtaining the final clustering result by partitioning the unreliable hypergraph is inappropriate. To tackle this problem, in this paper, we propose a clustering ensemble method via structured hypergraph learning, i.e., instead of being constructed directly, the hypergraph is dynamically learned from base results, which will be more reliable. Moreover, when dynamically learning the hypergraph, we enforce it to have a clear clustering structure, which will be more appropriate for clustering tasks, and thus we do not need to perform any uncertain postprocessing, such as hypergraph partitioning. Extensive experiments show that, our method not only performs better than the conventional hypergraph based ensemble methods, but also outperforms the state-of-the-art clustering ensemble methods.  相似文献   

8.
高维空间中的离群点发现   总被引:35,自引:2,他引:33  
在许多KDD(knowledge discovery in databases)应用中,如电子商务中的欺诈行为监测,例外情况或离群点的发现比常规知识的发现更有意义.现有的离群点发现大多是针对数值属性的,而且这些方法只能发现离群点,不能对其含义进行解释.提出了一种基于超图模型的离群点(outlier)定义,这一定义既体现了"局部"的概念,又能很好地解释离群点的含义.同时给出了HOT(hypergraph-based outlier test)算法,通过计算每个点的支持度、隶属度和规模偏差来检测离群点.该算法既能够处理数值属性,又能够处理类别属性.分析表明,该算法能有效地发现高维空间数据中的离群点.  相似文献   

9.
Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus of this paper is the partitioning clustering problem with a special interest in two recent approaches: kernel and spectral methods. The aim of this paper is to present a survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters. The presented kernel clustering methods are the kernel version of many classical clustering algorithms, e.g., K-means, SOM and neural gas. Spectral clustering arise from concepts in spectral graph theory and the clustering problem is configured as a graph cut problem where an appropriate objective function has to be optimized. An explicit proof of the fact that these two paradigms have the same objective is reported since it has been proven that these two seemingly different approaches have the same mathematical foundation. Besides, fuzzy kernel clustering methods are presented as extensions of kernel K-means clustering algorithm.  相似文献   

10.
In recent years, the spectral clustering method has gained attentions because of its superior performance. To the best of our knowledge, the existing spectral clustering algorithms cannot incrementally update the clustering results given a small change of the data set. However, the capability of incrementally updating is essential to some applications such as websphere or blogsphere. Unlike the traditional stream data, these applications require incremental algorithms to handle not only insertion/deletion of data points but also similarity changes between existing points. In this paper, we extend the standard spectral clustering to such evolving data, by introducing the incidence vector/matrix to represent two kinds of dynamics in the same framework and by incrementally updating the eigen-system. Our incremental algorithm, initialized by a standard spectral clustering, continuously and efficiently updates the eigenvalue system and generates instant cluster labels, as the data set is evolving. The algorithm is applied to a blog data set. Compared with recomputation of the solution by the standard spectral clustering, it achieves similar accuracy but with much lower computational cost. It can discover not only the stable blog communities but also the evolution of the individual multi-topic blogs. The core technique of incrementally updating the eigenvalue system is a general algorithm and has a wide range of applications—as well as incremental spectral clustering—where dynamic graphs are involved. This demonstrates the wide applicability of our incremental algorithm.  相似文献   

11.
基于属性分布相似度的超图高维聚类算法研究   总被引:4,自引:0,他引:4  
在许多聚类应用中,数据对象是具有高维、稀疏、二元的特征。传统聚类算法无法有效地处理此类数据。该文提出一种基于超图模型的高维聚类算法,通过定义对象属性分布特征向量和对象间属性分布相似度,建立超图模型,并应用超图分割法进行聚类。聚类结果通过簇内奇异特征值进行评价。实验结果和算法分析表明,该算法可以有效地进行聚类知识挖掘。  相似文献   

12.
In recent years, semi-supervised clustering (SSC) has aroused considerable interests from the machine learning and data mining communities. In this paper we propose a novel SSC approach with enhanced spectral embedding (ESE), which not only considers the geometric structure information contained in data sets, but also can make use of the given side information such as pairwise constraints. Specially, we first construct a symmetry-favored k-NN graph, which is highly robust to noise and outliers, and can reflect the underlying manifold structures of data sets. Then we learn the enhanced spectral embedding towards an ideal data representation as consistent with the given pairwise constraints as possible. Finally, by using the regularization of spectral embedding we formulate learning the new data representation as a semidefinite-quadratic-linear programming (SQLP) problem, which can be efficiently solved. Experimental results on a variety of synthetic and real-world data sets show that our ESE approach outperforms the state-of-the-art SSC algorithms in terms of speed and quality on both vector-based and graph-based clustering.  相似文献   

13.
Hierarchical clustering of mixed data based on distance hierarchy   总被引:1,自引:0,他引:1  
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity.  相似文献   

14.
Hypergraph Models and Algorithms for Data-Pattern-Based Clustering   总被引:2,自引:0,他引:2  
In traditional approaches for clustering market basket type data, relations among transactions are modeled according to the items occurring in these transactions. However, an individual item might induce different relations in different contexts. Since such contexts might be captured by interesting patterns in the overall data, we represent each transaction as a set of patterns through modifying the conventional pattern semantics. By clustering the patterns in the dataset, we infer a clustering of the transactions represented this way. For this, we propose a novel hypergraph model to represent the relations among the patterns. Instead of a local measure that depends only on common items among patterns, we propose a global measure that is based on the cooccurences of these patterns in the overall data. The success of existing hypergraph partitioning based algorithms in other domains depends on sparsity of the hypergraph and explicit objective metrics. For this, we propose a two-phase clustering approach for the above hypergraph, which is expected to be dense. In the first phase, the vertices of the hypergraph are merged in a multilevel algorithm to obtain large number of high quality clusters. Here, we propose new quality metrics for merging decisions in hypergraph clustering specifically for this domain. In order to enable the use of existing metrics in the second phase, we introduce a vertex-to-cluster affinity concept to devise a method for constructing a sparse hypergraph based on the obtained clustering. The experiments we have performed show the effectiveness of the proposed framework.  相似文献   

15.
Matrix-variate and higher-order probabilistic projections   总被引:1,自引:0,他引:1  
Feature extraction from two-dimensional or higher-order data, such as face images and surveillance videos, have recently been an active research area. There have been several 2D or higher-order PCA-style dimensionality reduction algorithms, but they mostly lack probabilistic interpretations and are difficult to apply with, e.g., incomplete data. It is also hard to extend these algorithms for applications where a certain region of the data point needs special focus in the dimensionality reduction process (e.g., the facial region in a face image). In this paper we propose a probabilistic dimensionality reduction framework for 2D and higher-order data. It specifies a particular generative process for this type of data, and leads to better understanding of some 2D and higher-order PCA-style algorithms. In particular, we show it actually takes several existing algorithms as its (non-probabilistic) special cases. We develop efficient iterative learning algorithms within this framework and study the theoretical properties of the stationary points. The model can be easily extended to handle special regions in the high-order data. Empirical studies on several benchmark data and real-world cardiac ultrasound images demonstrate the strength of this framework.  相似文献   

16.
Many real-world clustering problems are plagued by incomplete data characterized by missing or absent features for some or all of the data instances. Traditional clustering methods cannot be directly applied to such data without preprocessing by imputation or marginalization techniques. In this article, we overcome this drawback by utilizing a penalized dissimilarity measure which we refer to as the feature weighted penalty based dissimilarity (FWPD). Using the FWPD measure, we modify the traditional k-means clustering algorithm and the standard hierarchical agglomerative clustering algorithms so as to make them directly applicable to datasets with missing features. We present time complexity analyses for these new techniques and also undertake a detailed theoretical analysis showing that the new FWPD based k-means algorithm converges to a local optimum within a finite number of iterations. We also present a detailed method for simulating random as well as feature dependent missingness. We report extensive experiments on various benchmark datasets for different types of missingness showing that the proposed clustering techniques have generally better results compared to some of the most well-known imputation methods which are commonly used to handle such incomplete data. We append a possible extension of the proposed dissimilarity measure to the case of absent features (where the unobserved features are known to be undefined).  相似文献   

17.
一种基于超图模式的高维空间数据聚类方法   总被引:7,自引:0,他引:7  
张蓉  彭宏 《计算机工程》2002,28(7):54-55,164
把一个救解高维空间数据聚类问题的转换为一个超图分割寻优问题,提出了一种基于超图模式的高维空间数据聚类方法,该方法不需要减少高维空间数据顶的维数,直接用超图模式描述原始数据之间的关系,并通过选择适当的支持度阈值,有效祛除噪声点,保证数据聚类的质量。  相似文献   

18.
Hypergraph is a powerful representation for several computer vision, machine learning, and pattern recognition problems. In the last decade, many researchers have been keen to develop different hypergraph models. In contrast, no much attention has been paid to the design of hyperedge weighting schemes. However, many studies on pairwise graphs showed that the choice of edge weight can significantly influence the performances of such graph algorithms. We argue that this also applies to hypergraphs. In this paper, we empirically study the influence of hyperedge weights on hypergraph learning via proposing three novel hyperedge weighting schemes from the perspectives of geometry, multivariate statistical analysis, and linear regression. Extensive experiments on ORL, COIL20, JAFFE, Sheffield, Scene15 and Caltech256 datasets verified our hypothesis for both classification and clustering problems. For each of these classes of problems, our empirical study concludes with suggesting a suitable hypergraph weighting scheme. Moreover, the experiments also demonstrate that the combinations of such weighting schemes and conventional hypergraph models can achieve competitive classification and clustering performances in comparison with some recent state-of-the-art algorithms.  相似文献   

19.
针对传统谱聚类算法仅考虑数据点对点间的相互关系而未考虑数据间可能隐藏的复杂的相关性的问题,提出一种基于超图和自表征的谱聚类方法。首先,建立数据的超图,得到超图的拉普拉斯矩阵表示;然后,利用L2,1-范数对样本进行行稀疏自表征,同时融入超图来描述数据间多层次的相互关系;最后,利用生成的自表征系数进行谱聚类。利用基于超图的样本自表征技术考虑了样本之间复杂的相关性。通过在Hopkins155等数据集上的实验表明,在聚类错误率评判标准下,算法优于现有基于普通图的谱聚类算法SSC、SRC等。  相似文献   

20.
在本文中,我们提出了一种新的非数值数据聚类算法-VBCCD.VBCCD算法由关系表计算关系的一维分割,再由关系的分割来构造一个超图,而后通过超图分割算法,对构造出来的超图进行优化分割,得到最终的聚类结果。试验结果表明,该算法比传统的针对数值数据设计的聚类算法有更好的效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号