共查询到20条相似文献,搜索用时 15 毫秒
1.
《Pattern recognition》2014,47(2):820-832
A key issue of semi-supervised clustering is how to utilize the limited but informative pairwise constraints. In this paper, we propose a new graph-based constrained clustering algorithm, named SCRAWL. It is composed of two random walks with different granularities. In the lower-level random walk, SCRAWL partitions the vertices (i.e., data points) into constrained and unconstrained ones, according to whether they are in the pairwise constraints. For every constrained vertex, its influence range, or the degrees of influence it exerts on the unconstrained vertices, is encapsulated in an intermediate structure called component. The edge set between each pair of components determines the affecting scope of the pairwise constraints. In the higher-level random walk, SCRAWL enforces the pairwise constraints on the components, so that the constraint influence can be propagated to the unconstrained edges. At last, we combine the cluster membership of all the components to obtain the cluster assignment for each vertex. The promising experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our method. 相似文献
2.
Semi-supervised graph clustering: a kernel approach 总被引:6,自引:0,他引:6
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally
given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are
designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that
a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean
distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining,
2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors
or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering
objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input
data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International
Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current
state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets. 相似文献
3.
4.
Spectral clustering: A semi-supervised approach 总被引:2,自引:0,他引:2
Recently, graph-based spectral clustering algorithms have been developing rapidly, which are proposed as discrete combinatorial optimization problems and approximately solved by relaxing them into tractable eigenvalue decomposition problems. In this paper, we first review the current existing spectral clustering algorithms in a unified-framework way and give a straightforward explanation about spectral clustering. We also present a novel model for generalizing the unsupervised spectral clustering to semi-supervised spectral clustering. Under this model, prior information given by some instance-level constraints can be generalized to space-level constraints. We find that (undirected) graph built on the enlarged prior information is more meaningful, hence the boundaries of the clusters are more correct. Experimental results based on toy data, real-world data and image segmentation demonstrate the advantages of the proposed model. 相似文献
5.
Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts
is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering.
In this paper, we propose SS-NMF: a semi-supervised non-negative matrix factorization framework for data clustering. In SS-NMF,
users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether
they “must” or “cannot” be clustered together. Through an iterative algorithm, we perform symmetric tri-factorization of the
data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF. Moveover,
we show that SS-NMF provides a general framework for semi-supervised clustering. Existing approaches can be considered as
special cases of it. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance
of SS-NMF for clustering.
相似文献
Ming DongEmail: |
6.
Relevant component analysis (RCA) is a recently proposed metric learning method for semi-supervised learning applications. It is a simple and efficient method that has been applied successfully to give impressive results. However, RCA can make use of supervisory information in the form of positive equivalence constraints only. In this paper, we propose an extension to RCA that allows both positive and negative equivalence constraints to be incorporated. Experimental results show that the extended RCA algorithm is effective. 相似文献
7.
Enliang HuAuthor Vitae Songcan ChenAuthor VitaeLishan QiaoAuthor Vitae 《Neurocomputing》2011,74(17):2725-2733
We introduce a kernel learning algorithm, called kernel propagation (KP), to learn a nonparametric kernel from a mixture of a few pairwise constraints and plentiful unlabeled samples. Specifically, KP consists of two stages: the first is to learn a small-sized sub-kernel matrix just restricted to the samples with constrains, and the second is to propagate this learned sub-kernel matrix into a large-sized full-kernel matrix over all samples. As an interesting fact, our approach exposes a natural connection between KP and label propagation (LP), that is, one LP can naturally induce its KP counterpart. Thus, we develop three KPs from the three typical LPs correspondingly. Following the idea in KP, we also naturally develop an out-of-sample extension to directly capture a kernel matrix for outside-training data without the need of relearning. The final experiments verify that our developments are more efficient, more error-tolerant and also comparably effective in comparison with the state-of-the-art algorithm. 相似文献
8.
Semi-supervised fuzzy clustering: A kernel-based approach 总被引:1,自引:0,他引:1
Semi-supervised clustering algorithms aim to improve the clustering accuracy under the supervisions of a limited amount of labeled data. Since kernel-based approaches, such as kernel-based fuzzy c-means algorithm (KFCM), have been successfully used in classification and clustering problems, in this paper, we propose a novel semi-supervised clustering approach using the kernel-based method based on KFCM and denote it the semi-supervised kernel fuzzy c-mean algorithm (SSKFCM). The objective function of SSKFCM is defined by adding classification errors of both the labeled and the unlabeled data, and its global optimum has been obtained through repeatedly updating the fuzzy memberships and the optimized kernel parameter. The objective function may have more than one local optimum, so we employ a function transformation technique to reformulate the objective function after a local minimum has been obtained, and select the best optimum as the solution to the objective function. Experimental results on both the artificial and several real data sets show SSKFCM performs better than its conventional counterparts and it achieves the best accurate clustering results when the parameter is optimized. 相似文献
9.
Shi Zhong 《Machine Learning》2006,65(1):3-29
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using
unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely,
semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when
the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present
in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled
model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised
clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments
on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised
clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach
excels when available labels are incomplete.
Editor: Andrew Moore 相似文献
10.
《Expert systems with applications》2014,41(5):2372-2378
Co-training is a good paradigm of semi-supervised, which requires the data set to be described by two views of features. There are a notable characteristic shared by many co-training algorithm: the selected unlabeled instances should be predicted with high confidence, since a high confidence score usually implies that the corresponding prediction is correct. Unfortunately, it is not always able to improve the classification performance with these high confidence unlabeled instances. In this paper, a new semi-supervised learning algorithm was proposed combining the benefits of both co-training and active learning. The algorithm applies co-training to select the most reliable instances according to the two criterions of high confidence and nearest neighbor for boosting the classifier, also exploit the most informative instances with human annotation for improve the classification performance. Experiments on several UCI data sets and natural language processing task, which demonstrate our method achieves more significant improvement for sacrificing the same amount of human effort. 相似文献
11.
Clustering analysis is to identify inherent structures and discover useful information from large amount of data. However, the decision makers may suffer insufficient understanding the nature of the data and do not know how to set the optimal parameters for the clustering method. To overcome the drawback above, this paper proposes a new entropy clustering method using adaptive learning. The proposed method considers the data spreading to determine the adaptive threshold within parameters optimized by adaptive learning. Four datasets in UCI database are used as the experimental data to compare the accuracy of the proposed method with the listing clustering methods. The experimental results indicate that the proposed method is superior to the listing methods. 相似文献
12.
13.
Matthew B. Blaschko Jacquelyn A. Shelton Christoph H. Lampert 《Pattern recognition letters》2011,32(11):1572-1583
Kernel canonical correlation analysis (KCCA) is a general technique for subspace learning that incorporates principal components analysis (PCA) and Fisher linear discriminant analysis (LDA) as special cases. By finding directions that maximize correlation, KCCA learns representations that are more closely tied to the underlying process that generates the data and can ignore high-variance noise directions. However, for data where acquisition in one or more modalities is expensive or otherwise limited, KCCA may suffer from small sample effects. We propose to use semi-supervised Laplacian regularization to utilize data that are present in only one modality. This approach is able to find highly correlated directions that also lie along the data manifold, resulting in a more robust estimate of correlated subspaces.Functional magnetic resonance imaging (fMRI) acquired data are naturally amenable to subspace techniques as data are well aligned. fMRI data of the human brain are a particularly interesting candidate. In this study we implemented various supervised and semi-supervised versions of KCCA on human fMRI data, with regression to single and multi-variate labels (corresponding to video content subjects viewed during the image acquisition). In each variate condition, the semi-supervised variants of KCCA performed better than the supervised variants, including a supervised variant with Laplacian regularization. We additionally analyze the weights learned by the regression in order to infer brain regions that are important to different types of visual processing. 相似文献
14.
针对网络环境,提出了一种新的半监督聚类入侵检测算法,将主动学习策略应用于半监督聚类过程中,利用少量的标记数据,生成用于初始化算法的种子聚类,通过辅助聚类过程,根据网络数据的特点,检测已知和未知攻击。主动学习策略查询网络中未标记数据与标记数据的约束关系,对标记数据可以快速获得k个不相交的非空近邻集,经检测结果证明,改进了算法的性能,且表明了算法的可行性及有效性。 相似文献
15.
Hong Chang 《Pattern recognition》2006,39(7):1253-1264
Many computer vision and pattern recognition algorithms are very sensitive to the choice of an appropriate distance metric. Some recent research sought to address a variant of the conventional clustering problem called semi-supervised clustering, which performs clustering in the presence of some background knowledge or supervisory information expressed as pairwise similarity or dissimilarity constraints. However, existing metric learning methods for semi-supervised clustering mostly perform global metric learning through a linear transformation. In this paper, we propose a new metric learning method that performs nonlinear transformation globally but linear transformation locally. In particular, we formulate the learning problem as an optimization problem and present three methods for solving it. Through some toy data sets, we show empirically that our locally linear metric adaptation (LLMA) method can handle some difficult cases that cannot be handled satisfactorily by previous methods. We also demonstrate the effectiveness of our method on some UCI data sets. Besides applying LLMA to semi-supervised clustering, we have also used it to improve the performance of content-based image retrieval systems through metric learning. Experimental results based on two real-world image databases show that LLMA significantly outperforms other methods in boosting the image retrieval performance. 相似文献
16.
Photo clustering is an effective way to organize albums and it is useful in many applications, such as photo browsing and tagging. But automatic photo clustering is not an easy task due to the large variation of photo content. In this paper, we propose an interactive photo clustering paradigm that jointly explores human and computer. In this paradigm, the photo clustering task is semi-automatically accomplished: users are allowed to manually adjust clustering results with different operations, such as splitting clusters, merging clusters and moving photos from one cluster to another. Behind users’ operations, we have a learning engine that keeps updating the distance measurements between photos in an online way, such that better clustering can be performed based on the distance measure. Experimental results on multiple photo albums demonstrated that our approach is able to improve automatic photo clustering results, and by exploring distance metric learning, our method is much more effective than pure manual adjustments of photo clustering. 相似文献
17.
This paper presents a fuzzy rough semi-supervised outlier detection (FRSSOD) approach with the help of some labeled samples and fuzzy rough C-means clustering. This method introduces an objective function, which minimizes the sum squared error of clustering results and the deviation from known labeled examples as well as the number of outliers. Each cluster is represented by a center, a crisp lower approximation and a fuzzy boundary by using fuzzy rough C-means clustering and only those points located in boundary can be further discussed the possibility to be reassigned as outliers. As a result, this method can obtain better clustering results for normal points and better accuracy for outlier detection. Experiment results show that the proposed method, on average, keep, or improve the detection precision and reduce false alarm rate as well as reduce the number of candidate outliers to be discussed. 相似文献
18.
World Wide Web - As a common technology in social network, clustering has attracted lots of research interest due to its high performance, and many clustering methods have been presented. The most... 相似文献
19.
Mahdieh Soleymani Baghshah Author Vitae Saeed Bagheri Shouraki Author Vitae 《Pattern recognition》2010,43(8):2982-2992
The problem of clustering with side information has received much recent attention and metric learning has been considered as a powerful approach to this problem. Until now, various metric learning methods have been proposed for semi-supervised clustering. Although some of the existing methods can use both positive (must-link) and negative (cannot-link) constraints, they are usually limited to learning a linear transformation (i.e., finding a global Mahalanobis metric). In this paper, we propose a framework for learning linear and non-linear transformations efficiently. We use both positive and negative constraints and also the intrinsic topological structure of data. We formulate our metric learning method as an appropriate optimization problem and find the global optimum of this problem. The proposed non-linear method can be considered as an efficient kernel learning method that yields an explicit non-linear transformation and thus shows out-of-sample generalization ability. Experimental results on synthetic and real-world data sets show the effectiveness of our metric learning method for semi-supervised clustering tasks. 相似文献
20.
In Gaussian mixture modeling, it is crucial to select the number of Gaussians for a sample set, which becomes much more difficult
when the overlap in the mixture is larger. Under regularization theory, we aim to solve this problem using a semi-supervised
learning algorithm through incorporating pairwise constraints into entropy regularized likelihood (ERL) learning which can
make automatic model selection for Gaussian mixture. The simulation experiments further demonstrate that the presented semi-supervised
learning algorithm (i.e., the constrained ERL learning algorithm) can automatically detect the number of Gaussians with a
good parameter estimation, even when two or more actual Gaussians in the mixture are overlapped at a high degree. Moreover,
the constrained ERL learning algorithm leads to some promising results when applied to iris data classification and image
database categorization. 相似文献