首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Semi-supervised document clustering, which takes into account limited supervised data to group unlabeled documents into clusters, has received significant interest recently. Because of getting supervised data may be expensive, it is important to get most informative knowledge to improve the clustering performance. This paper presents a semi-supervised document clustering algorithm and a new method for actively selecting informative instance-level constraints to get improved clustering performance. The semi- supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN) algorithm, which incorporates instance-level constraints to guide the clustering process in DBSCAN. An active learning approach is proposed to select informative document pairs for obtaining user feedbacks. Experimental results show that Cons-DBSCAN with our proposed active learning approach can improve the clustering performance significantly when given a relatively small amount of constraints.  相似文献   

2.
一种结合主动学习的半监督文档聚类算法   总被引:1,自引:0,他引:1  
半监督文档聚类,即利用少量具有监督信息的数据来辅助无监督文档聚类,近几年来逐渐成为机器学习和数据挖掘领域研究的热点问题.由于获取大量监督信息费时费力,因此,国内外学者考虑如何获得少量但对聚类性能提高显著的监督信息.提出一种结合主动学习的半监督文档聚类算法,通过引入成对约束信息指导DBSCAN的聚类过程来提高聚类性能,得到一种半监督文档聚类算法Cons-DBSCAN.通过对约束集中所含信息量的衡量和对DBSCAN算法本身的分析,提出了一种启发式的主动学习算法,能够选取含信息量大的成对约束集,从而能够更高效地辅助半监督文档聚类.实验结果表明,所提出的算法能够高效地进行文档聚类.通过主动学习算法获得的成对约束集,能够显著地提高聚类性能.并且,算法的性能优于两个代表性的结合主动学习的半监督聚类算法.  相似文献   

3.
Aspect mining improves the modularity of legacy software systems through identifying their underlying crosscutting concerns (CCs). However, a realistic CC is a composite one that consists of CC seeds and relative program elements, which makes it a great challenge to identify a composite CC. In this paper, inspired by the state‐of‐the‐art information retrieval techniques, we model this problem as a semi‐supervised learning problem. First, the link analysis technique is adopted to generate CC seeds. Second, we construct a coupling graph, which indicates the relationship between CC seeds. Then, we adopt community detection technique to generate groups of CC seeds as constraints for semi‐supervised learning, which can guide the clustering process. Furthermore, we propose a semi‐supervised graph clustering approach named constrained authority‐shift clustering to identify composite CCs. Two measurements, namely, similarity and connectivity, are defined and seeded graph is generated for clustering program elements. We evaluate constrained authority‐shift clustering on numerous software systems including large‐scale distributed software system. The experimental results demonstrate that our semi‐supervised learning is more effective in detecting composite CCs. Copyright © 2013 John Wiley & Sons, Ltd.  相似文献   

4.
对于所提出的建立在成对约束基础之上的半监督凝聚层次聚类算法,对聚类簇进行半监督处理的最主要目的在于借助于对样本监督信息的合理应用,达到提高样本在无监督状态下学习性能的目标.在现阶段的技术条件支持下,以半监督聚类分析为核心,建立在must link以及cannot link基础之上的约束关系被广泛地应用于样本聚类分析的过程当中.从这一角度上来说,为了使聚类簇与聚类簇之间的距离关系表述更加的真实与精确,就要求通过对成对约束关系的综合应用,实现对聚类簇距离的有效调整与优化.  相似文献   

5.
Semi-supervised context characterized by the presence of a few pairs of constraints between learning samples is abundant in many real applications. Analysing these instance constraints by recent spectral scores has shown good performances for semi-supervised feature selection. The performance evaluation of these scores is generally based on classification accuracy and is performed in a ground truth context. However, this supervised context used by the evaluation step is inconsistent with the semi-supervised context in which the feature selection operates. In this paper, we propose a semi-supervised performance evaluation procedure, so that both feature selection and clustering steps take into account the constraints given by the user. In this way, the selection and the evaluation steps are performed in the same context which is close to real life applications. Extensive experiments on benchmark datasets are carried out in the last section. These experiments are performed using a supervised classical evaluation and the semi-supervised proposed one. They demonstrate the effectiveness of feature selection based on constraint analysis that uses both pairwise constraints and the information brought by the unlabeled data.  相似文献   

6.
文本分类和文本聚类在信息过滤系统对用户兴趣进行学习的过程中,都具有很普遍的应用。文中对两者的工作原理进行了对比和分析,从根本上指出了文本分类作为有监督学习方法所存在的固有缺陷,提出了一种在文本聚类后根据词条与聚类的分布特征调整词条权重的方法,并设计和实现了一个基于文本聚类和权重调整的用户兴趣模型构造算法。  相似文献   

7.
针对网络流量特征选择过程中监督信息缺乏的问题,提出一种基于成对约束扩展的半监督网络流量特征选择算法。该算法同时考虑少量成对约束和大量无标记样本,利用样本集合间的相关性和自相关性,扩展成对约束集到无标记样本上,产生更多可靠性强的成对约束,以揭示样本空间分布信息。最后,利用扩展的成对约束集进行特征选择。实验证明:与未进行成对约束扩展的算法相比,该算法在少量初始成对约束的情况下能获得更好的分类性能。  相似文献   

8.
Feature selection is an important preprocessing step in mining high-dimensional data. Generally, supervised feature selection methods with supervision information are superior to unsupervised ones without supervision information. In the literature, nearly all existing supervised feature selection methods use class labels as supervision information. In this paper, we propose to use another form of supervision information for feature selection, i.e. pairwise constraints, which specifies whether a pair of data samples belong to the same class (must-link constraints) or different classes (cannot-link constraints). Pairwise constraints arise naturally in many tasks and are more practical and inexpensive than class labels. This topic has not yet been addressed in feature selection research. We call our pairwise constraints guided feature selection algorithm as Constraint Score and compare it with the well-known Fisher Score and Laplacian Score algorithms. Experiments are carried out on several high-dimensional UCI and face data sets. Experimental results show that, with very few pairwise constraints, Constraint Score achieves similar or even higher performance than Fisher Score with full class labels on the whole training data, and significantly outperforms Laplacian Score.  相似文献   

9.
PCCS部分聚类分类:一种快速的Web文档聚类方法   总被引:16,自引:1,他引:15  
PCCS是为了帮助Web用户从搜索引擎所返回的大量文档片中筛选出自已所需要的文档,而使用的一种对Web文档进行快速聚类的部分聚类分法,首先对一部分文档进行聚类,然后根据聚类结果形成类模型对其余的文档进行分类,采用交互式的一次改进一个聚类摘选的聚类方法快速地创建一个聚类摘选集,将其余的文档使用Naive-Bayes分类器进行划分,为了提高聚类与分类的效率,提出了一种混合特征选取方法以减少文档表示的维数,重新计算文档中各特征的熵,从中选取具有最大熵值的前若干个特征,或者基于持久分类模型中的特征集来进行特征选取,实验证明,部分聚类方法能够快速,准确地根据文档主题内容组织Web文档,使用户在更高的术题层次上来查看搜索引擎返回的结果,从以主题相似的文档所形成的集簇中选取相关文档。  相似文献   

10.
为了在只有少量已知标记的数据集中获得较好的聚类效果,提出了一种基于图收缩的半监督聚类算法。首先将整个样本空间中的数据表达为一个带权图,再根据给出的must-link约束,对图进行边收缩的修改,进而增强must-link约束。在此基础上引入图拉普拉斯算子,结合cannot-link约束将样本空间投影到一个特征子空间。最后在子空间上进行聚类分析。实验结果表明,该方法不仅提高了对复杂数据的聚类结果,而且在约束对数量较少时也能获得较好的结果。  相似文献   

11.
提出一种自动文本聚类方法,应用遗传算法进行全局和快速的文本特征项选择以实现降维处理,引入概率匿名思想,根据文本中不同特征项权重的组合,基于动态规划设计一个优化的多项式时间聚类算法,将文本集划分成适当个数的分区,并对每个分区进行聚类,从而形成初始聚类,采用相同方法对所有初始聚类进行再聚类,形成最终的文本聚类。实验结果表明,该方法既能实现文本特征项的有效选择,又能较好地改善文本聚类效果和性能。  相似文献   

12.
一种用于文章推荐系统中的用户模型表示方法   总被引:2,自引:0,他引:2  
分析了现有文章推荐系统中基于关键词向量的用户模型表示方法存在的不足,提出了基于聚类兴趣点的用户模型表示方法。该方法可通过文章聚类形成兴趣点。由于传统的基于划分的聚类算法存在的不足,提出了基于复杂网络特征的文章聚类算法。实验结果表明该用户模型的表示方法较好地反映了用户多方面的兴趣,提高了文章推荐系统的性能。  相似文献   

13.
王纵虎  刘速 《计算机科学》2016,43(12):183-188
半监督聚类能利用少量标记数据来提高聚类算法性能,但大部分文本聚类算法无法直接应用成对约束等先验信息。针对文本数据高维稀疏的特点,提出了一种半监督文本聚类算法。将成对约束信息扩展后嵌入文档相似度矩阵,在此基础上根据已划分与未划分文档之间的统计信息逐步找出剩余未划分文本集合中密集的且与已划分聚类中心集合相似度较小的K个初始聚类中心集合,然后将剩余的相对较难区分的文档结合成对约束限制信息划分到K个初始聚类中心集合,最后通过融合成对约束违反惩罚的收敛准则函数对聚类结果进行进一步优化。算法在聚类过程中自动确定初始聚类中心集合,避免了K均值算法对初始聚类中心选择的敏感性。在几个中英文数据集上的实验结果表明,所提算法能有效地利用少量的成对约束先验信息提高聚类效果。  相似文献   

14.
基于成对约束的判别型半监督聚类分析   总被引:10,自引:1,他引:9  
尹学松  胡恩良  陈松灿 《软件学报》2008,19(11):2791-2802
现有一些典型的半监督聚类方法一方面难以有效地解决成对约束的违反问题,另一方面未能同时处理高维数据.通过提出一种基于成对约束的判别型半监督聚类分析方法来同时解决上述问题.该方法有效地利用了监督信息集成数据降维和聚类,即在投影空间中使用基于成对约束的K均值算法对数据聚类,再利用聚类结果选择投影空间.同时,该算法降低了基于约束的半监督聚类算法的计算复杂度,并解决了聚类过程中成对约束的违反问题.在一组真实数据集上的实验结果表明,与现有相关半监督聚类算法相比,新方法不仅能够处理高维数据,还有效地提高了聚类性能.  相似文献   

15.
This paper proposes a graph‐based approach for semisupervised clustering based on pairwise relations among instances. In our approach, the entire data set is represented as an edge‐weighted graph by mapping each data element (instance) as a vertex and connecting the instances by edges with their similarities. In order to reflect pairwise constraints on the clustering process, the graph is modified by contraction as it is known from general graph theory and the graph Laplacian in spectral graph theory. The graph representation enables us to deal with pairwise constraints as well as pairwise similarities over the same unified representation. By exploiting the constraints as well as similarities among instances, the entire data set is projected onto a subspace via the modified graph, and data clustering is conducted over the projected representation. The proposed approach is evaluated over several real‐world data sets. The results are encouraging and show that it is worthwhile to pursue the proposed approach.  相似文献   

16.
Recent feature selection scores using pairwise constraints (must-link and cannot-link) have shown better performances than the unsupervised methods and comparable to the supervised ones. However, these scores use only the pairwise constraints and ignore the available information brought by the unlabeled data. Moreover, these constraint scores strongly depend on the given must-link and cannot-link subsets built by the user. In this paper, we address these problems and propose a new semi-supervised constraint score that uses both pairwise constraints and local properties of the unlabeled data. Experiments using Kendall’s coefficient and accuracy rates, show that this new score is less sensitive to the given constraints than the previous scores while providing similar performances.  相似文献   

17.
pSCAN算法的聚类结果受密度约束参数和相似度阈值参数的影响,如果用户提供的聚类参数得到的聚类结果无法满足需求,那么用户可以通过实例簇表达自己的聚类需求。针对实例簇表达聚类查询需求的问题,提出一种实例簇驱动的图结构聚类参数计算算法PART及其改进算法ImPART。首先,分析两个聚类参数对聚类结果的影响,并提取实例簇的相关子图;其次,对相关子图进行分析得到密度约束参数的可行区间,并根据当前密度约束参数和节点之间的结构相似度将实例簇内节点划分为核心节点和非核心节点;最后,依据节点划分结果计算出当前密度约束参数对应的最优相似度阈值参数,并在相关子图上对得到的参数进行验证和优化,直到得到满足实例簇需求的聚类参数。在真实数据集上的实验结果表明,所提算法能够为用户实例簇返回一组有效参数,且所提改进算法ImPART的运行时间比PART缩短了20%以上,能够快速有效地为用户返回满足实例簇要求的最优聚类参数。  相似文献   

18.
Semisupervised clustering algorithms partition a given data set using limited supervision from the user. The success of these algorithms depends on the type of supervision and also on the kind of dissimilarity measure used while creating partitions of the space. This paper proposes a clustering algorithm that uses supervision in terms of relative comparisons, viz., x is closer to y than to z. The proposed clustering algorithm simultaneously learns the underlying dissimilarity measure while finding compact clusters in the given data set using relative comparisons. Through our experimental studies on high-dimensional textual data sets, we demonstrate that the proposed algorithm achieves higher accuracy and is more robust than similar algorithms using pairwise constraints for supervision.  相似文献   

19.
Unsupervised feature selection is an important problem, especially for high‐dimensional data. However, until now, it has been scarcely studied and the existing algorithms cannot provide satisfying performance. Thus, in this paper, we propose a new unsupervised feature selection algorithm using similarity‐based feature clustering, Feature Selection‐based Feature Clustering (FSFC). FSFC removes redundant features according to the results of feature clustering based on feature similarity. First, it clusters the features according to their similarity. A new feature clustering algorithm is proposed, which overcomes the shortcomings of K‐means. Second, it selects a representative feature from each cluster, which contains most interesting information of features in the cluster. The efficiency and effectiveness of FSFC are tested upon real‐world data sets and compared with two representative unsupervised feature selection algorithms, Feature Selection Using Similarity (FSUS) and Multi‐Cluster‐based Feature Selection (MCFS) in terms of runtime, feature compression ratio, and the clustering results of K‐means. The results show that FSFC can not only reduce the feature space in less time, but also significantly improve the clustering performance of K‐means.  相似文献   

20.
一个基于关联规则的多层文档聚类算法   总被引:3,自引:0,他引:3  
提出了一种新的基于关联规则的多层文档聚类算法,该算法利用新的文档特征抽取方法构造了文档的主题和关键字特征向量。首先在主题特征向量空间中利用频集快速算法对文档进行初始聚类,然后在基于主题关键字的新的特征向量空间中利用类间距和连接度对初始文档类进行求精,从而得到最终聚类。由于使用了两层聚类方法,使算法的效率和精度都大大提高;使用新的文档特征抽取方法还解决了由于文档关键字过多而导致文档特征向量的维数过高的问题。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号