首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
贾洪杰  丁世飞  史忠植 《软件学报》2015,26(11):2836-2846
谱聚类将聚类问题转化成图划分问题,是一种基于代数图论的聚类方法.在求解图划分目标函数时,一般利用Rayleigh熵的性质,通过计算Laplacian矩阵的特征向量将原始数据点映射到一个低维的特征空间中,再进行聚类.然而在谱聚类过程中,存储相似矩阵的空间复杂度是O(n2),对Laplacian矩阵特征分解的时间复杂度一般为O(n3),这样的复杂度在处理大规模数据时是无法接受的.理论证明,Normalized Cut图聚类与加权核k-means都等价于矩阵迹的最大化问题.因此,可以用加权核k-means算法来优化Normalized Cut的目标函数,这就避免了对Laplacian矩阵特征分解.不过,加权核k-means算法需要计算核矩阵,其空间复杂度依然是O(n2).为了应对这一挑战,提出近似加权核k-means算法,仅使用核矩阵的一部分来求解大数据的谱聚类问题.理论分析和实验对比表明,近似加权核k-means的聚类表现与加权核k-means算法是相似的,但是极大地减小了时间和空间复杂性.  相似文献   

2.
Semi-supervised model-based document clustering: A comparative study   总被引:4,自引:0,他引:4  
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete. Editor: Andrew Moore  相似文献   

3.
Clustering is a popular data analysis and data mining technique. A popular technique for clustering is based on k-means such that the data is partitioned into K clusters. However, the k-means algorithm highly depends on the initial state and converges to local optimum solution. This paper presents a new hybrid evolutionary algorithm to solve nonlinear partitional clustering problem. The proposed hybrid evolutionary algorithm is the combination of FAPSO (fuzzy adaptive particle swarm optimization), ACO (ant colony optimization) and k-means algorithms, called FAPSO-ACO–K, which can find better cluster partition. The performance of the proposed algorithm is evaluated through several benchmark data sets. The simulation results show that the performance of the proposed algorithm is better than other algorithms such as PSO, ACO, simulated annealing (SA), combination of PSO and SA (PSO–SA), combination of ACO and SA (ACO–SA), combination of PSO and ACO (PSO–ACO), genetic algorithm (GA), Tabu search (TS), honey bee mating optimization (HBMO) and k-means for partitional clustering problem.  相似文献   

4.
In recent years, semi-supervised clustering (SSC) has aroused considerable interests from the machine learning and data mining communities. In this paper we propose a novel SSC approach with enhanced spectral embedding (ESE), which not only considers the geometric structure information contained in data sets, but also can make use of the given side information such as pairwise constraints. Specially, we first construct a symmetry-favored k-NN graph, which is highly robust to noise and outliers, and can reflect the underlying manifold structures of data sets. Then we learn the enhanced spectral embedding towards an ideal data representation as consistent with the given pairwise constraints as possible. Finally, by using the regularization of spectral embedding we formulate learning the new data representation as a semidefinite-quadratic-linear programming (SQLP) problem, which can be efficiently solved. Experimental results on a variety of synthetic and real-world data sets show that our ESE approach outperforms the state-of-the-art SSC algorithms in terms of speed and quality on both vector-based and graph-based clustering.  相似文献   

5.
Spectral Curvature Clustering (SCC)   总被引:1,自引:0,他引:1  
This paper presents novel techniques for improving the performance of a multi-way spectral clustering framework (Govindu in Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 1150–1157, 2005; Chen and Lerman, 2007, preprint in the supplementary webpage) for segmenting affine subspaces. Specifically, it suggests an iterative sampling procedure to improve the uniform sampling strategy, an automatic scheme of inferring the tuning parameter from the data, a precise initialization procedure for K-means, as well as a simple strategy for isolating outliers. The resulting algorithm, Spectral Curvature Clustering (SCC), requires only linear storage and takes linear running time in the size of the data. It is supported by theory which both justifies its successful performance and guides our practical choices. We compare it with other existing methods on a few artificial instances of affine subspaces. Application of the algorithm to several real-world problems is also discussed. This work was supported by NSF grant #0612608. Supplementary webpage: .  相似文献   

6.
一种半监督K均值多关系数据聚类算法   总被引:1,自引:0,他引:1  
高滢  刘大有  齐红  刘赫 《软件学报》2008,19(11):2814-2821
提出了一种半监督K均值多关系数据聚类算法.该算法在K均值聚类算法的基础上扩展了其初始类簇的选择方法和对象相似性度量方法,以用于多关系数据的半监督学习.为了获取高性能,该算法在聚类过程中充分利用了标记数据、对象属性及各种关系信息.多关系数据库Movie上的实验结果验证了该算法的有效性.  相似文献   

7.
王亮  王士同 《计算机工程》2012,38(1):148-150
针对样本间的不均衡性,提出一种基于成对约束的动态加权半监督模糊核聚类算法。在传统模糊聚类算法中加入半监督学习机制,通过Mercer核将原数据空间映射到特征空间,为特征空间中的每个向量分配一个动态权值,由此得到新的目标函数,并结合一种简单的核参数选择方法实现数据分类。理论分析和实验结果表明,与模糊核聚类算法及成对约束的竞争群算法相比,该算法具有更好的聚类效果。  相似文献   

8.
By using a kernel function, data that are not easily separable in the original space can be clustered into homogeneous groups in the implicitly transformed high-dimensional feature space. Kernel k-means algorithms have recently been shown to perform better than conventional k-means algorithms in unsupervised classification. However, few reports have examined the benefits of using a kernel function and the relative merits of the various kernel clustering algorithms with regard to the data distribution. In this study, we reformulated four representative clustering algorithms based on a kernel function and evaluated their performances for various data sets. The results indicate that each kernel clustering algorithm gives markedly better performance than its conventional counterpart for almost all data sets. Of the kernel clustering algorithms studied in the present work, the kernel average linkage algorithm gives the most accurate clustering results.  相似文献   

9.
Fast and exact out-of-core and distributed k-means clustering   总被引:1,自引:2,他引:1  
Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance. Ruoming Jin is currently an assistant professor in the Computer Science Department at Kent State University. He received a BE and a ME degree in computer engineering from Beihang University (BUAA), China in 1996 and 1999, respectively. He earned his MS degree in computer science from University of Delaware in 2001, and his Ph.D. degree in computer science from the Ohio State University in 2005. His research interests include data mining, databases, processing of streaming data, bioinformatics, and high performance computing. He has published more than 30 papers in these areas. He is a member of ACM and SIGKDD. Anjan Goswami studied robotics at the Indian Institute of Technology at Kanpur. While working with IBM, he was interested in studying computer science. He then obtained a masters degree from the University of South Florida, where he worked on computer vision problems. He then transferred to the PhD program in computer science at OSU, where he did a Masters thesis on efficient clustering algorithms for massive, distributed and streaming data. On successful completion of this, he decided to join a web-service-provider company to do research in designing and developing high-performance search solutions for very large structured data. Anjan' favourite recreations are studying and predicting technology trends, nature photography, hiking, literature and soccer. Gagan Agrawal is an Associate Professor of Computer Science and Engineering at the Ohio State University. He received his B.Tech degree from Indian Institute of Technology, Kanpur, in 1991, and M.S. and Ph.D degrees from University of Maryland, College Park, in 1994 and 1996, respectively. His research interests include parallel and distributed computing, compilers, data mining, grid computing, and data integration. He has published more than 110 refereed papers in these areas. He is a member of ACM and IEEE Computer Society. He received a National Science Foundation CAREER award in 1998.  相似文献   

10.
数值型和分类型混合数据的模糊K-Prototypes聚类算法   总被引:15,自引:0,他引:15  
陈宁  陈安  周龙骧 《软件学报》2001,12(8):1107-1119
由于数据库经常同时包含数值型和分类型的属性,因此研究能够处理混合型数据的聚类算法无疑是很重要的.讨论了混合型数据的聚类问题,提出了一种模糊K-prototypes算法.该算法融合了K-means和K-modes对数值型和分类型数据的处理方法,能够处理混合类型的数据.模糊技术体现聚类的边界特征,更适合处理含有噪声和缺失数据的数据库.实验结果显示,模糊算法比相应的确定算法得到的结果准确度高.  相似文献   

11.
Adapting k-means for supervised clustering   总被引:2,自引:1,他引:1  
k-means is traditionally viewed as an algorithm for the unsupervised clustering of a heterogeneous population into a number of more homogeneous groups of objects. However, it is not necessarily guaranteed to group the same types (classes) of objects together. In such cases, some supervision is needed to partition objects which have the same label into one cluster. This paper demonstrates how the popular k-means clustering algorithm can be profitably modified to be used as a classifier algorithm. The output field itself cannot be used in the clustering but it is used in developing a suitable metric defined on other fields. The proposed algorithm combines Simulated Annealing with the modified k-means algorithm. We apply the proposed algorithm to real data sets, and compare the output of the resultant classifier to that of C4.5.  相似文献   

12.
An algorithm for optimizing data clustering in feature space is studied in this work. Using graph Laplacian and extreme learning machine (ELM) mapping technique, we develop an optimal weight matrix W for feature mapping. This work explicitly performs a mapping of the original data for clustering into an optimal feature space, which can further increase the separability of original data in the feature space, and the patterns points in same cluster are still closely clustered. Our method, which can be easily implemented, gets better clustering results than some popular clustering algorithms, like k-means on the original data, kernel clustering method, spectral clustering method, and ELM k-means on data include three UCI real data benchmarks (IRIS data, Wisconsin breast cancer database, and Wine database).  相似文献   

13.
Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus of this paper is the partitioning clustering problem with a special interest in two recent approaches: kernel and spectral methods. The aim of this paper is to present a survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters. The presented kernel clustering methods are the kernel version of many classical clustering algorithms, e.g., K-means, SOM and neural gas. Spectral clustering arise from concepts in spectral graph theory and the clustering problem is configured as a graph cut problem where an appropriate objective function has to be optimized. An explicit proof of the fact that these two paradigms have the same objective is reported since it has been proven that these two seemingly different approaches have the same mathematical foundation. Besides, fuzzy kernel clustering methods are presented as extensions of kernel K-means clustering algorithm.  相似文献   

14.
In this paper, we propose a novel Fast Affinity Propagation clustering approach (FAP). FAP simultaneously considers both local and global structure information contained in datasets, and is a high-quality multilevel graph partitioning method that can implement both vector-based and graph-based clustering. First, a new Fast Sampling algorithm (FS) is proposed to coarsen the input sparse graph and choose a small number of final representative exemplars. Then a density-weighted spectral clustering method is presented to partition those exemplars on the global underlying structure of data manifold. Finally, the cluster assignments of all data points can be achieved through their corresponding representative exemplars. Experimental results on two synthetic datasets and many real-world datasets show that our algorithm outperforms the state-of-the-art original affinity propagation and spectral clustering algorithms in terms of speed, memory usage, and quality on both vector-based and graph-based clustering.  相似文献   

15.
根据聚类假设,提出一种新的基于图的半监督学习算法,称为密度敏感的半监督聚类。该算法引入一种密度敏感的距离测度,它能较好地反映聚类假设,并且充分挖掘了数据集中复杂的内在结构信息,同时与基于图的半监督学习方法相结合,使得算法在聚类性能上有了显著的提高。经过实验仿真进一步表明,该算法在特定图像应用上具有优越性。  相似文献   

16.
Clustering is a very powerful data mining technique for topic discovery from text documents. The partitional clustering algorithms, such as the family of k-means, are reported performing well on document clustering. They treat the clustering problem as an optimization process of grouping documents into k clusters so that a particular criterion function is minimized or maximized. Usually, the cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. To solve this problem, we applied the concepts of neighbors and link, introduced in [S. Guha, R. Rastogi, K. Shim, ROCK: a robust clustering algorithm for categorical attributes, Information Systems 25 (5) (2000) 345–366], to document clustering. If two documents are similar enough, they are considered as neighbors of each other. And the link between two documents represents the number of their common neighbors. Instead of just considering the pairwise similarity, the neighbors and link involve the global information into the measurement of the closeness of two documents. In this paper, we propose to use the neighbors and link for the family of k-means algorithms in three aspects: a new method to select initial cluster centroids based on the ranks of candidate documents; a new similarity measure which uses a combination of the cosine and link functions; and a new heuristic function for selecting a cluster to split based on the neighbors of the cluster centroids. Our experimental results on real-life data sets demonstrated that our proposed methods can significantly improve the performance of document clustering in terms of accuracy without increasing the execution time much.  相似文献   

17.
Robust self-tuning semi-supervised learning   总被引:3,自引:0,他引:3  
Fei  Changshui 《Neurocomputing》2007,70(16-18):2931
We investigate the issue of graph-based semi-supervised learning (SSL). The labeled and unlabeled data points are represented as vertices in an undirected weighted neighborhood graph, with the edge weights encoding the pairwise similarities between data objects in the same neighborhood. The SSL problem can be then formulated as a regularization problem on this graph. In this paper we propose a robust self-tuning graph-based SSL method, which (1) can determine the similarities between pairwise data points automatically; (2) is not sensitive to outliers. Promising experimental results are given for both synthetic and real data sets.  相似文献   

18.
In recent years, there have been numerous attempts to extend the k-means clustering protocol for single database to a distributed multiple database setting and meanwhile keep privacy of each data site. Current solutions for (whether two or more) multiparty k-means clustering, built on one or more secure two-party computation algorithms, are not equally contributory, in other words, each party does not equally contribute to k-means clustering. This may lead a perfidious attack where a party who learns the outcome prior to other parties tells a lie of the outcome to other parties. In this paper, we present an equally contributory multiparty k-means clustering protocol for vertically partitioned data, in which each party equally contributes to k-means clustering. Our protocol is built on ElGamal's encryption scheme, Jakobsson and Juels's plaintext equivalence test protocol, and mix networks, and protects privacy in terms that each iteration of k-means clustering can be performed without revealing the intermediate values.  相似文献   

19.
Partitional clustering is a common approach to cluster analysis. Although many algorithms have been proposed, partitional clustering remains a challenging problem with respect to the reliability and efficiency of recovering high quality solutions in terms of its criterion functions. In this paper, we propose a niching genetic k-means algorithm (NGKA) for partitional clustering, which aims at reliably and efficiently identifying high quality solutions in terms of the sum of squared errors criterion. Within the NGKA, we design a niching method, which encourages mating among similar clustering solutions while allowing for some competitions among dissimilar solutions, and integrate it into a genetic algorithm to prevent premature convergence during the evolutionary clustering search. Further, we incorporate one step of k-means operation into the regeneration steps of the resulted niching genetic algorithm to improve its computational efficiency. The proposed algorithm was applied to cluster both simulated data and gene expression data and compared with previous work. Experimental results clear show that the NGKA is an effective clustering algorithm and outperforms two other genetic algorithm based clustering methods implemented for comparison.  相似文献   

20.
Spectral clustering aims to partition a data set into several groups by using the Laplacian of the graph such that data points in the same group are similar while data points in different groups are dissimilar to each other. Spectral clustering is very simple to implement and has many advantages over the traditional clustering algorithms such as k-means. Non-negative matrix factorization (NMF) factorizes a non-negative data matrix into a product of two non-negative (lower rank) matrices so as to achieve dimension reduction and part-based data representation. In this work, we proved that the spectral clustering under some conditions is equivalent to NMF. Unlike the previous work, we formulate the spectral clustering as a factorization of data matrix (or scaled data matrix) rather than the symmetrical factorization of the symmetrical pairwise similarity matrix as the previous study did. Under the NMF framework, where regularization can be easily incorporated into the spectral clustering, we propose several non-negative and sparse spectral clustering algorithms. Empirical studies on real world data show much better clustering accuracy of the proposed algorithms than some state-of-the-art methods such as ratio cut and normalized cut spectral clustering and non-negative Laplacian embedding.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号