首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
成对约束的属性加权半监督模糊核聚类算法   总被引:1,自引:0,他引:1  
在机器学习和数据挖掘中,带约束的半监督聚类是一个活跃的研究领域。为了利用约束条件获得表现更优异的聚类效果,提出了一种成对约束的属性加权半监督聚类算法,该方法充分考虑了属性间的不平衡性,在传统模糊聚类算法中融合半监督学习机制并通过Mercer核把原始的观察空间映射到高维特征空间。实验结果表明,该算法优于相似的成对约束的竞争群算法(PCCA)。  相似文献   

2.
Most existing semi-supervised clustering algorithms are not designed for handling high-dimensional data. On the other hand, semi-supervised dimensionality reduction methods may not necessarily improve the clustering performance, due to the fact that the inherent relationship between subspace selection and clustering is ignored. In order to mitigate the above problems, we present a semi-supervised clustering algorithm using adaptive distance metric learning (SCADM) which performs semi-supervised clustering and distance metric learning simultaneously. SCADM applies the clustering results to learn a distance metric and then projects the data onto a low-dimensional space where the separability of the data is maximized. Experimental results on real-world data sets show that the proposed method can effectively deal with high-dimensional data and provides an appealing clustering performance.  相似文献   

3.
Although there have been many researches on cluster analysis considering feature (or variable) weights, little effort has been made regarding sample weights in clustering. In practice, not every sample in a data set has the same importance in cluster analysis. Therefore, it is interesting to obtain the proper sample weights for clustering a data set. In this paper, we consider a probability distribution over a data set to represent its sample weights. We then apply the maximum entropy principle to automatically compute these sample weights for clustering. Such method can generate the sample-weighted versions of most clustering algorithms, such as k-means, fuzzy c-means (FCM) and expectation & maximization (EM), etc. The proposed sample-weighted clustering algorithms will be robust for data sets with noise and outliers. Furthermore, we also analyze the convergence properties of the proposed algorithms. This study also uses some numerical data and real data sets for demonstration and comparison. Experimental results and comparisons actually demonstrate that the proposed sample-weighted clustering algorithms are effective and robust clustering methods.  相似文献   

4.
本文针对超图切割上的半监督学习和聚类算法进行了研究;首先,通过对超图切割和超边展开法及其切割函数的讨论,引入了超图上的总变异作为超图切割的洛瓦兹扩展,并在此基础上提出了一组正则化函数,它对应于图上的拉普拉斯型正则化;然后,基于正则化函数族提出了半监督学习方法,并基于平衡超图切割提出了谱聚类方法;为了求解这两个学习问题,将它们转化为求解凸优化问题,并为此提出了一种主要组成部分为近端映射的可扩展算法,从而实现半监督学习和聚类;仿真实验结果表明,本文提出的基于超图切割实现的半监督学习和聚类方法相比于经典的超边展开法和其他图切割方法有更好的标准偏差和聚类误差性能。  相似文献   

5.
The problem of clustering with side information has received much recent attention and metric learning has been considered as a powerful approach to this problem. Until now, various metric learning methods have been proposed for semi-supervised clustering. Although some of the existing methods can use both positive (must-link) and negative (cannot-link) constraints, they are usually limited to learning a linear transformation (i.e., finding a global Mahalanobis metric). In this paper, we propose a framework for learning linear and non-linear transformations efficiently. We use both positive and negative constraints and also the intrinsic topological structure of data. We formulate our metric learning method as an appropriate optimization problem and find the global optimum of this problem. The proposed non-linear method can be considered as an efficient kernel learning method that yields an explicit non-linear transformation and thus shows out-of-sample generalization ability. Experimental results on synthetic and real-world data sets show the effectiveness of our metric learning method for semi-supervised clustering tasks.  相似文献   

6.
Semi-supervised clustering exploits a small quantity of supervised information to improve the accuracy of data clustering. In this paper, a framework for semi-supervised clustering is proposed. This framework is capable of integrating with a traditional clustering algorithm seamlessly, and particularly useful for the application where a traditional clustering is designated to use.In the proposed framework, discriminative random fields (DRFs) are employed to model the consistency between the result of a traditional clustering algorithm and the supervised information with the assumption of semi-supervised learning. The semi-supervised clustering problem is thus formulated as finding the label configuration with the maximum a posteriori (MAP) probability of the DRF. A procedure based on the iterated conditional modes algorithm and a metric-learning algorithm is developed to find a suboptimal MAP solution of the DRF. The proposed approach has been tested against various data sets. Experimental results demonstrate that our approach can enhance the clustering accuracy, and thus prove the feasibility of the proposed approach.  相似文献   

7.
一种结合主动学习的半监督文档聚类算法   总被引:1,自引:0,他引:1  
半监督文档聚类,即利用少量具有监督信息的数据来辅助无监督文档聚类,近几年来逐渐成为机器学习和数据挖掘领域研究的热点问题.由于获取大量监督信息费时费力,因此,国内外学者考虑如何获得少量但对聚类性能提高显著的监督信息.提出一种结合主动学习的半监督文档聚类算法,通过引入成对约束信息指导DBSCAN的聚类过程来提高聚类性能,得到一种半监督文档聚类算法Cons-DBSCAN.通过对约束集中所含信息量的衡量和对DBSCAN算法本身的分析,提出了一种启发式的主动学习算法,能够选取含信息量大的成对约束集,从而能够更高效地辅助半监督文档聚类.实验结果表明,所提出的算法能够高效地进行文档聚类.通过主动学习算法获得的成对约束集,能够显著地提高聚类性能.并且,算法的性能优于两个代表性的结合主动学习的半监督聚类算法.  相似文献   

8.
Semi-supervised model-based document clustering: A comparative study   总被引:4,自引:0,他引:4  
Semi-supervised learning has become an attractive methodology for improving classification models and is often viewed as using unlabeled data to aid supervised learning. However, it can also be viewed as using labeled data to help clustering, namely, semi-supervised clustering. Viewing semi-supervised learning from a clustering angle is useful in practical situations when the set of labels available in labeled data are not complete, i.e., unlabeled data contain new classes that are not present in labeled data. This paper analyzes several multinomial model-based semi-supervised document clustering methods under a principled model-based clustering framework. The framework naturally leads to a deterministic annealing extension of existing semi-supervised clustering approaches. We compare three (slightly) different semi-supervised approaches for clustering documents: Seeded damnl, Constrained damnl, and Feedback-based damnl, where damnl stands for multinomial model-based deterministic annealing algorithm. The first two are extensions of the seeded k-means and constrained k-means algorithms studied by Basu et al. (2002); the last one is motivated by Cohn et al. (2003). Through empirical experiments on text datasets, we show that: (a) deterministic annealing can often significantly improve the performance of semi-supervised clustering; (b) the constrained approach is the best when available labels are complete whereas the feedback-based approach excels when available labels are incomplete. Editor: Andrew Moore  相似文献   

9.
Image retrieval based on augmented relational graph representation   总被引:1,自引:1,他引:0  
The “semantic gap” problem is one of the main difficulties in image retrieval tasks. Semi-supervised learning, typically integrated with the relevance feedback techniques, is an effective method to narrow down the semantic gap. However, in semi-supervised learning, the amount of unlabeled data is usually much greater than that of labeled data. Therefore, the performance of a semi-supervised learning algorithm relies heavily on its effectiveness of using the relationships between the labeled and unlabeled data. This paper proposes a novel algorithm to better explore those relationships by augmenting the relational graph representation built on the entire data set, expected to increase the intra-class weights while decreasing the inter-class weights and linking the potential intra-class data. The augmented relational matrix can be directly used in any semi-supervised learning algorithms. The experimental results in a range of feedback-based image retrieval tasks show that the proposed algorithm not only achieves good generality, but also outperforms other algorithms in the same semi-supervised learning framework.  相似文献   

10.
How to organize and retrieve images is now a great challenge in various domains. Image clustering is a key tool in some practical applications including image retrieval and understanding. Traditional image clustering algorithms consider a single set of features and use ad hoc distance functions, such as Euclidean distance, to measure the similarity between samples. However, multi-modal features can be extracted from images. The dimension of multi-modal data is very high. In addition, we usually have several, but not many labeled images, which lead to semi-supervised learning. In this paper, we propose a framework of image clustering based on semi-supervised distance learning and multi-modal information. First we fuse multiple features and utilize a small amount of labeled images for semi-supervised metric learning. Then we compute similarity with the Gaussian similarity function and the learned metric. Finally, we construct a semi-supervised Laplace matrix for spectral clustering and propose an effective clustering method. Extensive experiments on some image data sets show the competent performance of the proposed algorithm.  相似文献   

11.
Hierarchical clustering of mixed data based on distance hierarchy   总被引:1,自引:0,他引:1  
Data clustering is an important data mining technique which partitions data according to some similarity criterion. Abundant algorithms have been proposed for clustering numerical data and some recent research tackles the problem of clustering categorical or mixed data. Unlike the subtraction scheme used for numerical attributes, there is no standard for measuring distance between categorical values. In this article, we propose a distance representation scheme, distance hierarchy, which facilitates expressing the similarity between categorical values and also unifies distance measuring of numerical and categorical values. We then apply the scheme to mixed data clustering, in particular, to integrate with a hierarchical clustering algorithm. Consequently, this integrated approach can uniformly handle numerical data and categorical data, and also enables one to take the similarity between categorical values into consideration. Experimental results show that the proposed approach produces better clustering results than conventional clustering algorithms when categorical attributes are present and their values have different degree of similarity.  相似文献   

12.
李延超  肖甫  陈志  李博 《软件学报》2020,31(12):3808-3822
主动学习从大量无标记样本中挑选样本交给专家标记.现有的批抽样主动学习算法主要受3个限制:(1)一些主动学习方法基于单选择准则或对数据、模型设定假设,这类方法很难找到既有不确定性又有代表性的未标记样本;(2)现有批抽样主动学习方法的性能很大程度上依赖于样本之间相似性度量的准确性,例如预定义函数或差异性衡量;(3)噪声标签问题一直影响批抽样主动学习算法的性能.提出一种基于深度学习批抽样的主动学习方法.通过深度神经网络生成标记和未标记样本的学习表示和采用标签循环模式,使得标记样本与未标记样本建立联系,再回到相同标签的标记样本.这样同时考虑了样本的不确定性和代表性,并且算法对噪声标签具有鲁棒性.在提出的批抽样主动学习方法中,算法使用的子模块函数确保选择的样本集合具有多样性.此外,自适应参数的优化,使得主动学习算法可以自动平衡样本的不确定性和代表性.将提出的主动学习方法应用到半监督分类和半监督聚类中,实验结果表明,所提出的主动学习方法的性能优于现有的一些先进的方法.  相似文献   

13.
谱聚类是基于谱图划分理论的一种聚类算法,传统的谱聚类算法属于无监督学习算法,只能利用单一数据来进行聚类。针对这种情况,提出一种基于密度自适应邻域相似图的半监督谱聚类(DAN-SSC)算法。DAN-SSC算法在传统谱聚类算法的基础上结合了半监督学习的思想,很好地解决了传统谱聚类算法无法充分利用所有数据,不得不对一些有标签数据进行舍弃的问题;将少量的成对约束先验信息扩散至整个空间,使其能更好地对聚类过程进行指导。实验结果表明,DAN-SSC算法具有可行性和有效性。  相似文献   

14.
In this paper, we develop a semi-supervised regression algorithm to analyze data sets which contain both categorical and numerical attributes. This algorithm partitions the data sets into several clusters and at the same time fits a multivariate regression model to each cluster. This framework allows one to incorporate both multivariate regression models for numerical variables (supervised learning methods) and k-mode clustering algorithms for categorical variables (unsupervised learning methods). The estimates of regression models and k-mode parameters can be obtained simultaneously by minimizing a function which is the weighted sum of the least-square errors in the multivariate regression models and the dissimilarity measures among the categorical variables. Both synthetic and real data sets are presented to demonstrate the effectiveness of the proposed method.  相似文献   

15.
双层随机游走半监督聚类   总被引:3,自引:0,他引:3  
何萍  徐晓华  陆林  陈崚 《软件学报》2014,25(5):997-1013
半监督聚类旨在根据用户给出的必连和不连约束,把所有数据点划分到不同的簇中,从而获得更准确、更加符合用户要求的聚类结果.目前的半监督聚类算法大多数通过修改已有的聚类算法或者结合度规学习,使聚类结果与点对约束尽可能地保持一致,却很少考虑点对约束对周围无约束数据的显式影响程度.提出一种由在顶点上的低层随机游走和在组件上的高层随机游走两部分构成的双层随机游走半监督聚类算法,其中,低层随机游走主要负责计算选出的约束顶点对其他顶点的影响范围和影响程度,称为组件;高层随机游走则进一步将各个点对约束以自适应的强度在组件上进行约束传播,把它们在每个顶点上的影响综合在一个簇指示矩阵中.UCI数据集和大型真实数据集上的实验结果表明,双层随机游走半监督聚类算法比其他半监督聚类算法更准确,也比较高效.  相似文献   

16.
Most existing representative works in semi-supervised clustering do not sufficiently solve the violation problem of pairwise constraints. On the other hand, traditional kernel methods for semi-supervised clustering not only face the problem of manually tuning the kernel parameters due to the fact that no sufficient supervision is provided, but also lack a measure that achieves better effectiveness of clustering. In this paper, we propose an adaptive Semi-supervised Clustering Kernel Method based on Metric learning (SCKMM) to mitigate the above problems. Specifically, we first construct an objective function from pairwise constraints to automatically estimate the parameter of the Gaussian kernel. Then, we use pairwise constraint-based K-means approach to solve the violation issue of constraints and to cluster the data. Furthermore, we introduce metric learning into nonlinear semi-supervised clustering to improve separability of the data for clustering. Finally, we perform clustering and metric learning simultaneously. Experimental results on a number of real-world data sets validate the effectiveness of the proposed method.  相似文献   

17.
In this paper, we introduce new algorithms that perform clustering and feature weighting simultaneously and in an unsupervised manner. The proposed algorithms are computationally and implementationally simple, and learn a different set of feature weights for each identified cluster. The cluster dependent feature weights offer two advantages. First, they guide the clustering process to partition the data set into more meaningful clusters. Second, they can be used in the subsequent steps of a learning system to improve its learning behavior. An extension of the algorithm to deal with an unknown number of clusters is also proposed. The extension is based on competitive agglomeration, whereby the number of clusters is over-specified, and adjacent clusters are allowed to compete for data points in a manner that causes clusters which lose in the competition to gradually become depleted and vanish. We illustrate the performance of the proposed approach by using it to segment color images, and to build a nearest prototype classifier.  相似文献   

18.
王亮  王士同 《计算机工程》2012,38(1):148-150
针对样本间的不均衡性,提出一种基于成对约束的动态加权半监督模糊核聚类算法。在传统模糊聚类算法中加入半监督学习机制,通过Mercer核将原数据空间映射到特征空间,为特征空间中的每个向量分配一个动态权值,由此得到新的目标函数,并结合一种简单的核参数选择方法实现数据分类。理论分析和实验结果表明,与模糊核聚类算法及成对约束的竞争群算法相比,该算法具有更好的聚类效果。  相似文献   

19.
Semi-supervised learning has attracted a significant amount of attention in pattern recognition and machine learning. Most previous studies have focused on designing special algorithms to effectively exploit the unlabeled data in conjunction with labeled data. Our goal is to improve the classification accuracy of any given supervised learning algorithm by using the available unlabeled examples. We call this as the Semi-supervised improvement problem, to distinguish the proposed approach from the existing approaches. We design a metasemi-supervised learning algorithm that wraps around the underlying supervised algorithm and improves its performance using unlabeled data. This problem is particularly important when we need to train a supervised learning algorithm with a limited number of labeled examples and a multitude of unlabeled examples. We present a boosting framework for semi-supervised learning, termed as SemiBoost. The key advantages of the proposed semi-supervised learning approach are: 1) performance improvement of any supervised learning algorithm with a multitude of unlabeled data, 2) efficient computation by the iterative boosting algorithm, and 3) exploiting both manifold and cluster assumption in training classification models. An empirical study on 16 different data sets and text categorization demonstrates that the proposed framework improves the performance of several commonly used supervised learning algorithms, given a large number of unlabeled examples. We also show that the performance of the proposed algorithm, SemiBoost, is comparable to the state-of-the-art semi-supervised learning algorithms.  相似文献   

20.
将监督信息引入到聚类算法中去,在先前提出的鲁棒联机聚类算法(ROC)的基础上,通过引入以样本类标号形式给出的监督信息,提出了一种半监督的鲁棒联机聚类算法(Semi-ROC).在算法的聚类精度和鲁棒性能上,算法Semi-ROC比ROC和AddC有着更好的性能,在人工数据集和UCI标准数据集上的实验结果表明,Semi-ROC能有效地利用少量的监督信息来提高算法的聚类性能,得到较优的结果.另外,在添加噪声的情况下,算法Semi-ROC比原始的联机聚类算法AddC和ROC都更加鲁棒.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号