期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Generative model-based document clustering: a comparative study 总被引：7，自引：2，他引：7

Shi Zhong Joydeep Ghosh 《Knowledge and Information Systems》2005,8(3):374-384

This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets. 相似文献

2.

Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results

Ian Davidson S. S. Ravi 《Data mining and knowledge discovery》2009,18(2):257-282

Clustering with constraints is a powerful method that allows users to specify background knowledge and the expected cluster properties. Significant work has explored the incorporation of instance-level constraints into non-hierarchical clustering but not into hierarchical clustering algorithms. In this paper we present a formal complexity analysis of the problem and show that constraints can be used to not only improve the quality of the resultant dendrogram but also the efficiency of the algorithms. This is particularly important since many agglomerative style algorithms have running times that are quadratic (or faster growing) functions of the number of instances to be clustered. We present several bounds on the improvement in the running times of algorithms obtainable using constraints. A preliminary version of this paper appeared as Davidson and Ravi (2005b). 相似文献

3.

Semi-supervised fuzzy clustering: A kernel-based approach 总被引：1，自引：0，他引：1

Huaxiang Zhang Jing Lu 《Knowledge》2009,22(6):477-481

Semi-supervised clustering algorithms aim to improve the clustering accuracy under the supervisions of a limited amount of labeled data. Since kernel-based approaches, such as kernel-based fuzzy c-means algorithm (KFCM), have been successfully used in classification and clustering problems, in this paper, we propose a novel semi-supervised clustering approach using the kernel-based method based on KFCM and denote it the semi-supervised kernel fuzzy c-mean algorithm (SSKFCM). The objective function of SSKFCM is defined by adding classification errors of both the labeled and the unlabeled data, and its global optimum has been obtained through repeatedly updating the fuzzy memberships and the optimized kernel parameter. The objective function may have more than one local optimum, so we employ a function transformation technique to reformulate the objective function after a local minimum has been obtained, and select the best optimum as the solution to the objective function. Experimental results on both the artificial and several real data sets show SSKFCM performs better than its conventional counterparts and it achieves the best accurate clustering results when the parameter is optimized. 相似文献

4.

结构复杂数据的半监督聚类 总被引：1，自引：0，他引：1

黄添强余养强秦小麟《控制与决策》2010,25(1)

基于成对限制,提出一种半监督聚类算法（ＳＣＣＤ）,它能够处理存在多种密度结构复杂的数据且识别任意形状的簇．利用成对限制反映的多密度分布信息计算基于密度的聚类算法（ＤＢＳＣＡＮ）的邻域半径参数Ｅｐｓ,并利用不同参数的ＤＢＳＣＡＮ算法处理复杂形状且密度变化的数据集．实验结果表明,ＳＣＣＤ算法能在噪声环境下发现任意形状且多密度的簇,性能优于已有同类算法．相似文献

5.

Semi-supervised clustering via multi-level random walk

《Pattern recognition》2014,47(2):820-832

A key issue of semi-supervised clustering is how to utilize the limited but informative pairwise constraints. In this paper, we propose a new graph-based constrained clustering algorithm, named SCRAWL. It is composed of two random walks with different granularities. In the lower-level random walk, SCRAWL partitions the vertices (i.e., data points) into constrained and unconstrained ones, according to whether they are in the pairwise constraints. For every constrained vertex, its influence range, or the degrees of influence it exerts on the unconstrained vertices, is encapsulated in an intermediate structure called component. The edge set between each pair of components determines the affecting scope of the pairwise constraints. In the higher-level random walk, SCRAWL enforces the pairwise constraints on the components, so that the constraint influence can be propagated to the unconstrained edges. At last, we combine the cluster membership of all the components to obtain the cluster assignment for each vertex. The promising experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our method. 相似文献

6.

Semi-supervised graph clustering: a kernel approach 总被引：6，自引：0，他引：6

Brian Kulis Sugato Basu Inderjit Dhillon Raymond Mooney 《Machine Learning》2009,74(1):1-22

Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining, 2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets. 相似文献

7.

Semi-supervised clustering with metric learning: An adaptive kernel method

Xuesong Yin Author Vitae Songcan Chen Author Vitae Enliang Hu Author Vitae Author Vitae 《Pattern recognition》2010,43(4):1320-1333

Most existing representative works in semi-supervised clustering do not sufficiently solve the violation problem of pairwise constraints. On the other hand, traditional kernel methods for semi-supervised clustering not only face the problem of manually tuning the kernel parameters due to the fact that no sufficient supervision is provided, but also lack a measure that achieves better effectiveness of clustering. In this paper, we propose an adaptive Semi-supervised Clustering Kernel Method based on Metric learning (SCKMM) to mitigate the above problems. Specifically, we first construct an objective function from pairwise constraints to automatically estimate the parameter of the Gaussian kernel. Then, we use pairwise constraint-based K-means approach to solve the violation issue of constraints and to cluster the data. Furthermore, we introduce metric learning into nonlinear semi-supervised clustering to improve separability of the data for clustering. Finally, we perform clustering and metric learning simultaneously. Experimental results on a number of real-world data sets validate the effectiveness of the proposed method. 相似文献

8.

Shape-based clustering for 3D CAD objects: A comparative study of effectiveness

Subramaniam Jayanti Yagnanarayanan Kalyanaraman Karthik Ramani 《Computer aided design》2009,41(12):999-1007

相似文献

9.

Internet traffic clustering with side information

Yu Wang Yang Xiang Jun Zhang Wanlei Zhou Bailin Xie 《Journal of Computer and System Sciences》2014

Internet traffic classification is a critical and essential functionality for network management and security systems. Due to the limitations of traditional port-based and payload-based classification approaches, the past several years have seen extensive research on utilizing machine learning techniques to classify Internet traffic based on packet and flow level characteristics. For the purpose of learning from unlabeled traffic data, some classic clustering methods have been applied in previous studies but the reported accuracy results are unsatisfactory. In this paper, we propose a semi-supervised approach for accurate Internet traffic clustering, which is motivated by the observation of widely existing partial equivalence relationships among Internet traffic flows. In particular, we formulate the problem using a Gaussian Mixture Model (GMM) with set-based equivalence constraint and propose a constrained Expectation Maximization (EM) algorithm for clustering. Experiments with real-world packet traces show that the proposed approach can significantly improve the quality of resultant traffic clusters. 相似文献

10.

Spectral clustering: A semi-supervised approach 总被引：2，自引：0，他引：2

Weifu ChenAuthor Vitae Guocan FengAuthor Vitae 《Neurocomputing》2012,77(1):229-242

Recently, graph-based spectral clustering algorithms have been developing rapidly, which are proposed as discrete combinatorial optimization problems and approximately solved by relaxing them into tractable eigenvalue decomposition problems. In this paper, we first review the current existing spectral clustering algorithms in a unified-framework way and give a straightforward explanation about spectral clustering. We also present a novel model for generalizing the unsupervised spectral clustering to semi-supervised spectral clustering. Under this model, prior information given by some instance-level constraints can be generalized to space-level constraints. We find that (undirected) graph built on the enlarged prior information is more meaningful, hence the boundaries of the clusters are more correct. Experimental results based on toy data, real-world data and image segmentation demonstrate the advantages of the proposed model. 相似文献