首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Semi-supervised fuzzy co-clustering algorithm for document categorization   总被引:1,自引:1,他引:0  
In this paper, we propose a new semi-supervised fuzzy co-clustering algorithm called SS-FCC for categorization of large web documents. In this new approach, the clustering process is carried out by incorporating some prior domain knowledge of a dataset in the form of pairwise constraints provided by users into the fuzzy co-clustering framework. With the help of those constraints, the clustering problem is formulated as the problem of maximizing a competitive agglomeration cost function with fuzzy terms, taking into account the provided domain knowledge. The constraint specifies whether a pair of objects “must” or “cannot” be clustered together. The update rules for fuzzy memberships are derived, and an iterative algorithm is designed for the soft co-clustering process. Our experimental studies show that the quality of clustering results can be improved significantly with the proposed approach. Simulations on 10 large benchmark datasets demonstrate the strength and potentials of SS-FCC in terms of performance evaluation criteria, stability and operating time, compared with some of the existing semi-supervised algorithms.  相似文献   

2.
3.
一种结合主动学习的半监督文档聚类算法   总被引:1,自引:0,他引:1  
半监督文档聚类,即利用少量具有监督信息的数据来辅助无监督文档聚类,近几年来逐渐成为机器学习和数据挖掘领域研究的热点问题.由于获取大量监督信息费时费力,因此,国内外学者考虑如何获得少量但对聚类性能提高显著的监督信息.提出一种结合主动学习的半监督文档聚类算法,通过引入成对约束信息指导DBSCAN的聚类过程来提高聚类性能,得到一种半监督文档聚类算法Cons-DBSCAN.通过对约束集中所含信息量的衡量和对DBSCAN算法本身的分析,提出了一种启发式的主动学习算法,能够选取含信息量大的成对约束集,从而能够更高效地辅助半监督文档聚类.实验结果表明,所提出的算法能够高效地进行文档聚类.通过主动学习算法获得的成对约束集,能够显著地提高聚类性能.并且,算法的性能优于两个代表性的结合主动学习的半监督聚类算法.  相似文献   

4.
Semi-supervised document clustering, which takes into account limited supervised data to group unlabeled documents into clusters, has received significant interest recently. Because of getting supervised data may be expensive, it is important to get most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm and a new method for actively selecting informative instance-level constraints to get improved clustering performance. The semi- supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that Cons-DBSCAN with our proposed active learning approach can improve the clustering performance significantly when given a relatively small amount of constraints.  相似文献   

5.
Many of the real world clustering problems arising in data mining applications are heterogeneous in nature. Heterogeneous co-clustering involves simultaneous clustering of objects of two or more data types. While pairwise co-clustering of two data types has been well studied in the literature, research on high-order heterogeneous co-clustering is still limited. In this paper, we propose a graph theoretical framework for addressing starstructured co-clustering problems in which a central data type is connected to all the other data types. Partitioning this graph leads to co-clustering of all the data types under the constraints of the star-structure. Although, graph partitioning approach has been adopted before to address star-structured heterogeneous complex problems, the main contribution of this work lies in an e cient algorithm that we propose for partitioning the star-structured graph. Computationally, our algorithm is very quick as it requires a simple solution to a sparse system of overdetermined linear equations. Theoretical analysis and extensive experiments performed on toy and real datasets demonstrate the quality, e ciency and stability of the proposed algorithm.  相似文献   

6.
密度敏感的半监督谱聚类   总被引:27,自引:0,他引:27  
王玲  薄列峰  焦李成 《软件学报》2007,18(10):2412-2422
聚类通常被认为是一种无监督的数据分析方法,然而在实际问题中可以很容易地获得有限的样本先验信息,如样本的成对限制信息.大量研究表明,在聚类搜索过程中充分利用先验信息会显著提高聚类算法的性能.首先分析了在聚类过程中仅利用成对限制信息存在的不足,尝试探索数据集本身固有的先验信息--空间一致性先验信息,并提出利用这类先验信息的具体方法.接着,将两类先验信息同时引入经典的谱聚类算法中,提出一种密度敏感的半监督谱聚类算法(density-sensitive semi-supervised spectral clustering algorithm,简称DS-SSC).两类先验信息在指导聚类搜索的过程中能够起到相辅相成的作用,这使得DS-SSC算法相对于仅利用成对限制信息的聚类算法在聚类性能上有了显著的提高.在UCI基准数据集、USPS手写体数字集以及TREC的文本数据集上的实验结果验证了这一点.  相似文献   

7.
We provide an overall framework for learning in search based systems that are used to find optimum solutions to problems. This framework assumes that prior knowledge is available in the form of one or more heuristic functions (or features) of the problem domain. An appropriate clustering strategy is used to partition the state space into a number of classes based on the available features. The number of classes formed will depend on the resource constraints of the system. In the training phase, example problems are run using a standard admissible search algorithm. In this phase, heuristic information corresponding to each class is learned. This new information can be used in the problem solving phase by appropriate search algorithms so that subsequent problem instances can be solved more efficiently. In this framework, we also show that heuristic information of forms other than the conventional single valued underestimate value can be used, since we maintain the heuristic of each class explicitly. We show some novel search algorithms that can work with some such forms. Experimental results have been provided for some domains  相似文献   

8.
Recent advances in clustering consider incorporating background knowledge in the partitioning algorithm, using, e.g., pairwise constraints between objects. As a matter of fact, prior information, when available, often makes it possible to better retrieve meaningful clusters in data. Here, this approach is investigated in the framework of belief functions, which allows us to handle the imprecision and the uncertainty of the clustering process. In this context, the EVCLUS algorithm was proposed for partitioning objects described by a dissimilarity matrix. It is extended here so as to take pairwise constraints into account, by adding a term to its objective function. This term corresponds to a penalty term that expresses pairwise constraints in the belief function framework. Various synthetic and real datasets are considered to demonstrate the interest of the proposed method, called CEVCLUS, and two applications are presented. The performances of CEVCLUS are also compared to those of other constrained clustering algorithms.  相似文献   

9.
In traditional co-clustering, the only basis for the clustering task is a given relationship matrix, describing the strengths of the relationships between pairs of elements in the different domains. Relying on this single input matrix, co-clustering discovers relationships holding among groups of elements from the two input domains. In many real life applications, on the other hand, other background knowledge or metadata about one or more of the two input domain dimensions may be available and, if leveraged properly, such metadata might play a significant role in the effectiveness of the co-clustering process. How additional metadata affects co-clustering, however, depends on how the process is modified to be context-aware. In this paper, we propose, compare, and evaluate three alternative strategies (metadata-driven, metadata-constrained, and metadata-injected co-clustering) for embedding available contextual knowledge into the co-clustering process. Experimental results show that it is possible to leverage the available metadata in discovering contextually-relevant co-clusters, without significant overheads in terms of information theoretical co-cluster quality or execution cost.  相似文献   

10.
针对现有的聚类集成算法大都是无监督聚类集成算法且不能很好地处理高维数据的问题,设计一种基于PCA降维技术的成对约束半监督聚类集成算法(SSCEDR).SSCEDR方法使用PCA主成分分析对原始数据进行降维,结合半监督聚类集成技术,在降维后的空间中将成对约束等先验知识代入到聚类集成过程中.本文通过在多组数据集上实验来验证...  相似文献   

11.
Clustering is often considered as an unsupervised data analysis method, but making full use of the prior information in the process of clustering will significantly improve the performance of the clustering algorithm. Spectral clustering algorithm can well use the prior pairwise constraint information to cluster and has become a new hot spot of machine learning research in recent years. In this paper, we propose an effective clustering algorithm, called a semi-supervised spectral clustering algorithm based on pairwise constraints, in which the similarity matrix of data points is adjusted and optimized by pairwise constraints. The experiments on real-world data sets demonstrate the effectiveness of this algorithm.  相似文献   

12.
Over the last decade there has been an increasing interest in semi-supervised clustering. Several studies have suggested that even a small amount of supervised information can significantly improve the results of unsupervised learning. One popular method of incorporating partial supervised information is through pair-wise constraints indicating whether a certain pair of patterns should belong to the same (Must-link) or different (Dont-link) clusters. In this study we propose a novel semi-supervised fuzzy clustering algorithm (SSFCA). The supervised information is incorporated via a method quantifying Must-link and/or Dont-link constraints. Additionally, we present an extension of SSFCA that allows the algorithm to automatically detect the number of clusters in the data. We apply SSFCA to the intrinsic problem of gene expression profiles clustering. The advantageous properties of fuzzy logic, inherited to SSFCA, allow genes to belong to more than one group, revealing this way more profound information concerning their multiple functioning roles. Finally, we investigate the incorporation of prior biological knowledge arriving from Gene Ontology in the process of selecting pair-wise constraints. Simulations on artificial and real life datasets proved that the proposed SSFCA significantly outperformed other standard and semi-supervised clustering methods.  相似文献   

13.
基于SSKM算法的遥感图像半监督聚类   总被引:1,自引:0,他引:1  
闫利  曹君 《遥感信息》2010,(2):8-11
半监督聚类是近几年提出的一种新的聚类方法,具有良好的聚类性能,但是,它们绝大多数都需要有完整的先验信息,即对于所有的样本类别,都需要有至少一个标签数据。本文提出了一种基于不完整信息的遥感图像半监督聚类方法——SSKM聚类算法,算法利用部分样本类别的先验信息,辅助遥感图像聚类。实验表明,相比于传统的K均值聚类,该算法能够有效地改善遥感图像的聚类效果。  相似文献   

14.
协同聚类是对数据矩阵的行和列两个方向同时进行聚类的一类算法。本文将双层加权的思想引入协同聚类,提出了一种双层子空间加权协同聚类算法(TLWCC)。TLWCC对聚类块(co-cluster)加一层权重,对行和列再加一层权重,并且算法在迭代过程中自动计算块、行和列这三组权重。TLWCC考虑不同的块、行和列与相应块、行和列中心的距离,距离越大,认为其噪声越强,就给予小权重;反之噪声越弱,给予大权重。通过给噪声信息小权重,TLWCC能有效地降低噪声信息带来的干扰,提高聚类效果。本文通过四组实验展示TLWCC算法识别噪声信息的能力、参数选取对算法聚类结果的影响程度,算法的聚类性能和时间性能。  相似文献   

15.
Most variants of fuzzy c-means (FCM) clustering algorithms involving prior knowledge are generally based on the modification of the objective function or the clustering process. This paper proposes a new weighted semi-supervised FCM algorithm (SSFCM-HPR) that transforms the prior knowledge in the labeled samples into constraint conditions in terms of fuzzy membership degrees, assigns different weights according to the representativeness of the samples, and then uses the HPR multiplier to solve the clustering problem. The “representativeness” of the labeled samples is decided by their distances to the cluster centers they belong to. In this paper, we take the ratio of the largest to the second largest fuzzy membership degree from a labeled sample as its weight. This algorithm not only retains the fuzzy partition of the labeled samples, which guarantees the effective guidance on the clustering process, but also can detect whether a sample is an outlier or not. Moreover, when part of the supervised information of the labeled samples is wrong, this algorithm can reduce the influence of the incorrectly labeled samples on the final clustering results. The experimental evaluation on synthetic and real data sets demonstrates the efficiency and effectiveness of our approach.  相似文献   

16.
Multi-view kernel construction   总被引:1,自引:0,他引:1  
In many problem domains data may come from multiple sources (or views), such as video and audio from a camera or text on and links to a web page. These multiple views of the data are often not directly comparable to one another, and thus a principled method for their integration is warranted. In this paper we develop a new algorithm to leverage information from multiple views for unsupervised clustering by constructing a custom kernel. We generate a multipartite graph (with the number of parts given by the number of views) that induces a kernel we then use for spectral clustering. Our algorithm can be seen as a generalization of co-clustering and spectral clustering and a relative of Kernel Canonical Correlation Analysis. We demonstrate the algorithm on four data sets: an illustrative artificial data set, synthetic fMRI data, voxels from an fMRI study, and a collection of web pages. Finally, we compare its performance to common alternatives.  相似文献   

17.
To deal with data patterns with linguistic ambiguity and with probabilistic uncertainty in a single framework, we construct an interpretable probabilistic fuzzy rule-based system that requires less human intervention and less prior knowledge than other state of the art methods. Specifically, we present a new iterative fuzzy clustering algorithm that incorporates a supervisory scheme into an unsupervised fuzzy clustering process. The learning process starts in a fully unsupervised manner using fuzzy c-means (FCM) clustering algorithm and a cluster validity criterion, and then gradually constructs meaningful fuzzy partitions over the input space. The corresponding fuzzy rules with probabilities are obtained through an iterative learning process of selecting clusters with supervisory guidance based on the notions of cluster-pureness and class-separability. The proposed algorithm is tested first with synthetic data sets and benchmark data sets from the UCI Repository of Machine Learning Database and then, with real facial expression data and TV viewing data.  相似文献   

18.
挖掘多视图一致性是提升多视图聚类性能的关键,为更好地从多视图数据中学习一致性表示,提出一种新的多视图聚类算法OMTSC。OMTSC算法同时学习每个视图的聚类分配矩阵和特征嵌入,并将聚类分配矩阵分解为共享正交基矩阵和聚类编码矩阵。正交基矩阵可捕获并储存多视图一致性信息形成潜在聚类中心,经过加权融合的多视图聚类编码矩阵可更好地平衡不同视图的质量差异。引入基于二部图的协同聚类,实现正交基、聚类编码和特征嵌入3个矩阵的知识相互迁移,以提升多视图数据一致性和多样性,并利用特征嵌入的多样性最大化多视图一致性学习最优的潜在聚类中心,从而提高多视图聚类的性能。此外,基于群稀疏约束的特征嵌入可有效消除多视图数据中的噪声,提升算法的鲁棒性。在WikipediaArticles、COIL20和ORL数据集上的实验结果表明,与SC-Best、Co-Reg等先进的多视图聚类算法相比,OMTSC算法在ACC、NMI、ARI 3个评价指标上整体取得最优值,其中在COIL20和ORL数据集中的NMI评价指标均高于0.9。  相似文献   

19.
The event detection problem, which is closely related to clustering, has gained a lot of attentions within event detection for textual documents. However, although image clustering is a problem that has been treated extensively in both Content-Based Image Retrieval (CBIR) and Text-Based Image Retrieval (TBIR) systems, event detection within image management is a relatively new area. Having this in mind, we propose a novel approach for event extraction and clustering of images, taking into account textual annotations, time and geographical positions. Our goal is to develop a clustering method based on the fact that an image may belong to an event cluster. Here, we stress the necessity of having an event clustering and cluster extraction algorithm that are both scalable and allow online applications. To achieve this, we extend a well-known clustering algorithm called Suffix Tree Clustering (STC), originally developed to cluster text documents using document snippets. The idea is that we consider an image along with its annotation as a document. Further, we extend it to also include time and geographical position so that we can capture the contextual information from each image during the clustering process. This has appeared to be particularly useful on images gathered from online photo-sharing applications such as Flickr. Hence, our STC-based approach is aimed at dealing with the challenges induced by capturing contextual information from Flickr images and extracting related events. We evaluate our algorithm using different annotated datasets mainly gathered from Flickr. As part of this evaluation we investigate the effects of using different parameters, such as time and space granularities, and compare these effects. In addition, we evaluate the performance of our algorithm with respect to mining events from image collections. Our experimental results clearly demonstrate the effectiveness of our STC-based algorithm in extracting and clustering events.  相似文献   

20.
刘琰琼  张文生  李益群  杨柳 《计算机工程》2011,37(5):207-209,212
传统聚类方法处理的是同构数据,无法满足异构数据同时聚类的应用需求,聚类结果的准确率较低,标签可读性较差。针对上述问题,提出一种基于电阻网络的异构数据协同聚类算法。该算法将异构关联数据抽象为多部图形式的电阻网络,进行特征计算及聚类。在对异构数据进行协同聚类后,可以得到一种聚类结构,其中每一类包含多种异构数据,它们之间可以互为标签,标签可读性高。实验结果证明,该方法是一种切实可行且效果优异的数据聚类算法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号