共查询到20条相似文献,搜索用时 0 毫秒
1.
《Pattern recognition》2014,47(2):820-832
A key issue of semi-supervised clustering is how to utilize the limited but informative pairwise constraints. In this paper, we propose a new graph-based constrained clustering algorithm, named SCRAWL. It is composed of two random walks with different granularities. In the lower-level random walk, SCRAWL partitions the vertices (i.e., data points) into constrained and unconstrained ones, according to whether they are in the pairwise constraints. For every constrained vertex, its influence range, or the degrees of influence it exerts on the unconstrained vertices, is encapsulated in an intermediate structure called component. The edge set between each pair of components determines the affecting scope of the pairwise constraints. In the higher-level random walk, SCRAWL enforces the pairwise constraints on the components, so that the constraint influence can be propagated to the unconstrained edges. At last, we combine the cluster membership of all the components to obtain the cluster assignment for each vertex. The promising experimental results on both synthetic and real-world data sets demonstrate the effectiveness of our method. 相似文献
2.
Semi-supervised graph clustering: a kernel approach 总被引:6,自引:0,他引:6
Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally
given as pairwise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are
designed for data represented as vectors. In this paper, we unify vector-based and graph-based approaches. We first show that
a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean
distance and a certain class of constraint penalty functions, can be expressed as a special case of the weighted kernel k-means objective (Dhillon et al., in Proceedings of the 10th International Conference on Knowledge Discovery and Data Mining,
2004a). A recent theoretical connection between weighted kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data given either as vectors
or as a graph. For graph data, this result leads to algorithms for optimizing several new semi-supervised graph clustering
objectives. For vector data, the kernel approach also enables us to find clusters with non-linear boundaries in the input
data space. Furthermore, we show that recent work on spectral learning (Kamvar et al., in Proceedings of the 17th International
Joint Conference on Artificial Intelligence, 2003) may be viewed as a special case of our formulation. We empirically show that our algorithm is able to outperform current
state-of-the-art semi-supervised algorithms on both vector-based and graph-based data sets. 相似文献
3.
4.
Most existing semi-supervised clustering algorithms are not designed for handling high-dimensional data. On the other hand, semi-supervised dimensionality reduction methods may not necessarily improve the clustering performance, due to the fact that the inherent relationship between subspace selection and clustering is ignored. In order to mitigate the above problems, we present a semi-supervised clustering algorithm using adaptive distance metric learning (SCADM) which performs semi-supervised clustering and distance metric learning simultaneously. SCADM applies the clustering results to learn a distance metric and then projects the data onto a low-dimensional space where the separability of the data is maximized. Experimental results on real-world data sets show that the proposed method can effectively deal with high-dimensional data and provides an appealing clustering performance. 相似文献
5.
Spectral clustering: A semi-supervised approach 总被引:2,自引:0,他引:2
Recently, graph-based spectral clustering algorithms have been developing rapidly, which are proposed as discrete combinatorial optimization problems and approximately solved by relaxing them into tractable eigenvalue decomposition problems. In this paper, we first review the current existing spectral clustering algorithms in a unified-framework way and give a straightforward explanation about spectral clustering. We also present a novel model for generalizing the unsupervised spectral clustering to semi-supervised spectral clustering. Under this model, prior information given by some instance-level constraints can be generalized to space-level constraints. We find that (undirected) graph built on the enlarged prior information is more meaningful, hence the boundaries of the clusters are more correct. Experimental results based on toy data, real-world data and image segmentation demonstrate the advantages of the proposed model. 相似文献
6.
Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts
is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering.
In this paper, we propose SS-NMF: a semi-supervised non-negative matrix factorization framework for data clustering. In SS-NMF,
users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether
they “must” or “cannot” be clustered together. Through an iterative algorithm, we perform symmetric tri-factorization of the
data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF. Moveover,
we show that SS-NMF provides a general framework for semi-supervised clustering. Existing approaches can be considered as
special cases of it. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance
of SS-NMF for clustering.
相似文献
Ming DongEmail: |
7.
提出一种选择最富信息数据并予以标记的基于主动学习策略的半监督聚类算法。首先, 采用传统K-均值聚类算法对数据集进行粗聚类; 其次, 根据粗聚类结果计算出每个数据隶属于每个类簇的隶属度, 筛选出满足最大与次大隶属度差值小于阈值的候选数据, 并从中选择差值较小的数据作为最富信息的数据进行标记; 最后, 将候选数据集合中未标记数据分组到与每类已被标记数据平均距离最小的类簇中。实验表明, 提出的主动学习策略能够很好地学习到最富信息数据, 基于该学习策略的半监督聚类算法在测试不同数据集时均获得了较高的准确率。 相似文献
8.
现有的半监督聚类集成方法能利用先验信息,使集成的准确性、鲁棒性和稳定性得到提高,但在集成阶段加入成对约束信息时,只考虑了给定的约束信息而忽视了约束点与被约束点的邻域点之间的关系.针对此问题,提出了一种基于数据相关性的半监督模糊聚类集成方法.该方法首先利用半监督模糊聚类算法建立集成信息矩阵,并将其转换为相似性矩阵;然后,利用已知的约束信息及约束点与被约束点的邻域点之间的关系来修改相似性矩阵;最后,利用图划分算法得到最终的聚类结果.真实数据上的实验结果表明,提出的方法可以有效提高聚类质量. 相似文献
9.
Enliang HuAuthor Vitae Songcan ChenAuthor VitaeLishan QiaoAuthor Vitae 《Neurocomputing》2011,74(17):2725-2733
We introduce a kernel learning algorithm, called kernel propagation (KP), to learn a nonparametric kernel from a mixture of a few pairwise constraints and plentiful unlabeled samples. Specifically, KP consists of two stages: the first is to learn a small-sized sub-kernel matrix just restricted to the samples with constrains, and the second is to propagate this learned sub-kernel matrix into a large-sized full-kernel matrix over all samples. As an interesting fact, our approach exposes a natural connection between KP and label propagation (LP), that is, one LP can naturally induce its KP counterpart. Thus, we develop three KPs from the three typical LPs correspondingly. Following the idea in KP, we also naturally develop an out-of-sample extension to directly capture a kernel matrix for outside-training data without the need of relearning. The final experiments verify that our developments are more efficient, more error-tolerant and also comparably effective in comparison with the state-of-the-art algorithm. 相似文献
10.
Relevant component analysis (RCA) is a recently proposed metric learning method for semi-supervised learning applications. It is a simple and efficient method that has been applied successfully to give impressive results. However, RCA can make use of supervisory information in the form of positive equivalence constraints only. In this paper, we propose an extension to RCA that allows both positive and negative equivalence constraints to be incorporated. Experimental results show that the extended RCA algorithm is effective. 相似文献
11.
基于迭代框架的主动半监督聚类框架(IASSCF)是一个流行的半监督聚类框架。该框架存在两个问题:其一,初始先验信息较少导致迭代初期聚类效果不佳,进而影响后续聚类结果;其二,每次迭代只选择信息量最大的一个样本标记,导致运行速度慢、性能提升慢。针对这两个问题,设计了一种基于主动学习先验的半监督K-means聚类算法。该方法包含初始化阶段和迭代阶段。初始化阶段主动选择代表性较高的节点集合,并基于代表节点集合构建各类的先验节点集合和约束先验集合。迭代阶段,每次迭代包含三步:1)基于当前约束先验集合,利用约束半监督聚类算法PCK-means对数据进行聚类;2)依据当前聚类结果,主动选择每个簇中最具价值信息的未标注样本点;3)利用选择样本点扩充先验节点集合及约束集合。迭代此过程至达到收敛阈值。实验结果表明,与基于原IASSCF框架的半监督K-means聚类算法相比,所提算法运行速度更快,性能更优。 相似文献
12.
Semi-supervised fuzzy clustering: A kernel-based approach 总被引:1,自引:0,他引:1
Semi-supervised clustering algorithms aim to improve the clustering accuracy under the supervisions of a limited amount of labeled data. Since kernel-based approaches, such as kernel-based fuzzy c-means algorithm (KFCM), have been successfully used in classification and clustering problems, in this paper, we propose a novel semi-supervised clustering approach using the kernel-based method based on KFCM and denote it the semi-supervised kernel fuzzy c-mean algorithm (SSKFCM). The objective function of SSKFCM is defined by adding classification errors of both the labeled and the unlabeled data, and its global optimum has been obtained through repeatedly updating the fuzzy memberships and the optimized kernel parameter. The objective function may have more than one local optimum, so we employ a function transformation technique to reformulate the objective function after a local minimum has been obtained, and select the best optimum as the solution to the objective function. Experimental results on both the artificial and several real data sets show SSKFCM performs better than its conventional counterparts and it achieves the best accurate clustering results when the parameter is optimized. 相似文献
13.
Shi Zhong 《Machine Learning》2006,65(1):3-29
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using
unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely,
semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when
the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present
in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled
model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised
clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments
on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised
clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach
excels when available labels are incomplete.
Editor: Andrew Moore 相似文献
14.
《Expert systems with applications》2014,41(5):2372-2378
Co-training is a good paradigm of semi-supervised, which requires the data set to be described by two views of features. There are a notable characteristic shared by many co-training algorithm: the selected unlabeled instances should be predicted with high confidence, since a high confidence score usually implies that the corresponding prediction is correct. Unfortunately, it is not always able to improve the classification performance with these high confidence unlabeled instances. In this paper, a new semi-supervised learning algorithm was proposed combining the benefits of both co-training and active learning. The algorithm applies co-training to select the most reliable instances according to the two criterions of high confidence and nearest neighbor for boosting the classifier, also exploit the most informative instances with human annotation for improve the classification performance. Experiments on several UCI data sets and natural language processing task, which demonstrate our method achieves more significant improvement for sacrificing the same amount of human effort. 相似文献
15.
16.
Clustering analysis is to identify inherent structures and discover useful information from large amount of data. However, the decision makers may suffer insufficient understanding the nature of the data and do not know how to set the optimal parameters for the clustering method. To overcome the drawback above, this paper proposes a new entropy clustering method using adaptive learning. The proposed method considers the data spreading to determine the adaptive threshold within parameters optimized by adaptive learning. Four datasets in UCI database are used as the experimental data to compare the accuracy of the proposed method with the listing clustering methods. The experimental results indicate that the proposed method is superior to the listing methods. 相似文献
17.
Matthew B. Blaschko Jacquelyn A. Shelton Christoph H. Lampert 《Pattern recognition letters》2011,32(11):1572-1583
Kernel canonical correlation analysis (KCCA) is a general technique for subspace learning that incorporates principal components analysis (PCA) and Fisher linear discriminant analysis (LDA) as special cases. By finding directions that maximize correlation, KCCA learns representations that are more closely tied to the underlying process that generates the data and can ignore high-variance noise directions. However, for data where acquisition in one or more modalities is expensive or otherwise limited, KCCA may suffer from small sample effects. We propose to use semi-supervised Laplacian regularization to utilize data that are present in only one modality. This approach is able to find highly correlated directions that also lie along the data manifold, resulting in a more robust estimate of correlated subspaces.Functional magnetic resonance imaging (fMRI) acquired data are naturally amenable to subspace techniques as data are well aligned. fMRI data of the human brain are a particularly interesting candidate. In this study we implemented various supervised and semi-supervised versions of KCCA on human fMRI data, with regression to single and multi-variate labels (corresponding to video content subjects viewed during the image acquisition). In each variate condition, the semi-supervised variants of KCCA performed better than the supervised variants, including a supervised variant with Laplacian regularization. We additionally analyze the weights learned by the regression in order to infer brain regions that are important to different types of visual processing. 相似文献
18.
针对网络环境,提出了一种新的半监督聚类入侵检测算法,将主动学习策略应用于半监督聚类过程中,利用少量的标记数据,生成用于初始化算法的种子聚类,通过辅助聚类过程,根据网络数据的特点,检测已知和未知攻击。主动学习策略查询网络中未标记数据与标记数据的约束关系,对标记数据可以快速获得k个不相交的非空近邻集,经检测结果证明,改进了算法的性能,且表明了算法的可行性及有效性。 相似文献
19.
Hong Chang 《Pattern recognition》2006,39(7):1253-1264
Many computer vision and pattern recognition algorithms are very sensitive to the choice of an appropriate distance metric. Some recent research sought to address a variant of the conventional clustering problem called semi-supervised clustering, which performs clustering in the presence of some background knowledge or supervisory information expressed as pairwise similarity or dissimilarity constraints. However, existing metric learning methods for semi-supervised clustering mostly perform global metric learning through a linear transformation. In this paper, we propose a new metric learning method that performs nonlinear transformation globally but linear transformation locally. In particular, we formulate the learning problem as an optimization problem and present three methods for solving it. Through some toy data sets, we show empirically that our locally linear metric adaptation (LLMA) method can handle some difficult cases that cannot be handled satisfactorily by previous methods. We also demonstrate the effectiveness of our method on some UCI data sets. Besides applying LLMA to semi-supervised clustering, we have also used it to improve the performance of content-based image retrieval systems through metric learning. Experimental results based on two real-world image databases show that LLMA significantly outperforms other methods in boosting the image retrieval performance. 相似文献
20.
Photo clustering is an effective way to organize albums and it is useful in many applications, such as photo browsing and tagging. But automatic photo clustering is not an easy task due to the large variation of photo content. In this paper, we propose an interactive photo clustering paradigm that jointly explores human and computer. In this paradigm, the photo clustering task is semi-automatically accomplished: users are allowed to manually adjust clustering results with different operations, such as splitting clusters, merging clusters and moving photos from one cluster to another. Behind users’ operations, we have a learning engine that keeps updating the distance measurements between photos in an online way, such that better clustering can be performed based on the distance measure. Experimental results on multiple photo albums demonstrated that our approach is able to improve automatic photo clustering results, and by exploring distance metric learning, our method is much more effective than pure manual adjustments of photo clustering. 相似文献