首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
王欣艺 《福建电脑》2013,29(3):129-131,139
当查询比较模糊,检索到的结果文档中表达了对查询的不同解释时,就要根据用户的相关反馈对返回结果进行聚类,本章首先介绍了一种著名的基于划分的聚类方法 K-均值算法。这种算法虽然效果显著,却无法处理类别属性的聚类任务。因此,本文基于层次分类方法,设计了一种针对类别属性分类的聚类算法,使其聚类后的返回结果具有高正确率的特点。  相似文献   

2.
Deep Web数据源聚类与分类   总被引:1,自引:0,他引:1  
随着Internet信息的迅速增长,许多Web信息已经被各种各样的可搜索在线数据库所深化,并被隐藏在Web查询接口下面.传统的搜索引擎由于技术原因不能索引这些信息--Deep Web信息.本文分析了Deep Web查询接口的各种类型,研究了基于查询接口特征的数据源聚类方法和基于聚类结果的数据源分类方法,讨论了从基于规则与线性文档分类器中抽取查询探测集的规则抽取算法和Web文档数据库分类的查询探测算法.  相似文献   

3.
本文研究了p2p网络中基于内容的节点聚类。基于文件名关键词精确匹配的查询没有考虑文本语义及内容相似性。如果能够根据节点发布内容的相似性,建立节点聚类,信息查询在类内进行,必将提高查询效率。本文提出了一种基于增量学习的节点聚类方法,通过兴趣爬虫代理计算节点得分,据此判断一个节点是否可以加入节点簇。实验表明,节点簇的建立可以有效地提高 p2p 网络的查询效率。  相似文献   

4.
为了解决Web数据库多查询结果问题,提出了一种基于改进决策树算法的Web数据库查询结果自动分类方法.该方法在离线阶段分析系统中所有用户的查询历史并聚合语义上相似的查询,根据聚合的查询将原始数据划分成多个元组聚类,每个元组聚类对应一种类型的用户偏好.当查询到来时,基于离线阶段划分的元组聚类,利用改进的决策树算法在查询结果集上自动构建一个带标签的分层分类树,使得用户能够通过检查标签的方式快速选择和定位其所需信息.实验结果表明,提出的分类方法具有较低的搜索代价和较好的分类效果,能够有效地满足不同类型用户的个性化查询需求.  相似文献   

5.
类别型数据聚类被广泛应用于现实世界的不同领域中,如医学科学、计算机科学等。通常的类别型数据聚类,是在基于相异度量上进行研究,针对不同特点的数据集,聚类结果会受到数据集自身特点和噪音信息的影响。此外,基于表示学习的类别型数据聚类,实现复杂,聚类结果受到表示结果的影响较大。本文以共现矩阵为基础,提出一种可以直接考虑类别型数据原始信息关联关系的聚类方法——基于从共现矩阵提取关联的类别型数据聚类方法(CDCBCM)。共现矩阵可被看作是一种对原始数据空间中信息关联情况的汇总。本文通过计算不同对象在各个属性子空间下的共现频率值来构建共现矩阵,并从共现矩阵中去除一些噪音信息,再使用归一化切割来得到聚类结果。本文方法在16个不同领域的公开数据集中进行测试,与8种现有方法进行比较,并采用F1-score指标进行检测。实验结果表明,本文方法在7个数据集上效果最好,平均排名最高,能更好地完成对类别型数据的聚类任务。  相似文献   

6.
传统的聚类方法大都是基于空间划分的方法,一般都假设数据符合混合高斯模型。这在实际应用中往往是不成立的。在大部分模式分类的问题中,常见的参数形式不适合实际遇到的概率密度,特别是所有经典的参数密度都是单峰的,而一般遥感图像都是包含多峰的密度,因此分类结果往往不够精确。用于模式分类的非参数方法正是解决这类问题的一个重要途径,可以从本质上克服这一缺陷,而且可以发现任意形状的聚类。均值漂移方法是基于密度估计的非参数聚类方法,遥感图像的聚类分析可以通过均值漂移方法来实现,而且均值漂移过程不需要预先给出地物的类别数目,在聚类过程中自动确定类别数,这对于图像中类别数目不易确定的情况,给非监督遥感图像聚类带来方便。  相似文献   

7.
为了解决搜索引擎检索结果中的主题混杂现象,帮助用户快速准确地定位到有价值的信息,提出基于主题短语的搜索引擎结果聚类方法。首先从检索结果中提取查询词并与相邻词语组成主题短语,建立包含高频独立词语及主题短语的混合向量空间模型,同时引入同义词词林对特征项进行语义扩充,最后采用改进的k-means聚类算法对搜索结果进行聚类,并为各个类别提取类别标签。实验结果表明,该算法能有效提高聚类结果的准确率。  相似文献   

8.
历史客运量与客运需求存在差距,基于余票查询数据的起讫点(OD)客流特征分析可以较为实时地反映客运需求。对于一些客流特征的挖掘目前主要的方法是利用聚类算法进行群体划分,进而发现每个类别的特征。针对余票查询数据维度高,直接使用聚类算法鲁棒性较差的问题,提出了一种基于随机距离预测的高层特征抽取模型RDP与K-means结合的OD客流聚类分析方法。以京沪高速铁路预售期内余票查询量数据为原始数据,将乘车日期作为预分类条件,运用RDP算法提取预分类后数据的重构特征,然后通过K-means算法对重构特征进行聚类。实验结果表明,RDP K-means算法在Calinski-Harabaz指数、轮廓系数、戴维森堡丁指数三种内部聚类评价指标下效果均优于传统的K-means、PCA K-means、层次聚类、DBSCAN等算法,证明了RDP K-means算法在基于余票查询数据的OD客流特征分析研究中的有效性,能够更好地进行OD类别划分、客流出行特征分析、热门OD挖掘,为改善运力调整等相关业务提供一定的参考依据。  相似文献   

9.
现有方法没有有效利用查询文本特征、点击行为和session信息来挖掘用户的搜索意图,获取的查询特征对于多意图查询在不同意图下的区分度不足,对于多意图查询的相关查询聚类效果不佳。针对以上问题,该文提出了基于查询图信息的GPLSI模型,并利用该模型学习所得的查询特征进行查询聚类。基于查询图信息的GPLSI模型利用查询的词语、点击和session共现现象,从查询的文本特征、点击行为和session信息等多个方面来模拟查询意图的产生和表现,学习查询在不同搜索意图上的概率分布。最后,实验结果验证了基于查询图信息的PLSI模型用于查询相似度计算和多意图查询聚类中的有效性。  相似文献   

10.
同义词和近义词现象以及强关联语义信息加大了文本向量的特征维数,对文本分类的效率和精度都会带来极大影响.为了有效降低文本向量的特征维数,提出一种基于混合并行遗传聚类的文本特征抽取方法.该方法首先使用K-means聚类算法进行特征词粗粒度聚类,然后采用混合并行遗传算法对各类特征词进行细粒度聚类,最后对各聚类中的特征词进行分析并压缩,得到最终能反映文本类别特征和语义信息的文本特征词集合.实验证明,该方法是一种有效的文本特征抽取方法,能切实提高文本分类的效率和精度.  相似文献   

11.
Huge numbers of documents are being generated on the Web, especially for news articles and social media. How to effectively organize these evolving documents so that readers can easily browse or search is a challenging task. Existing methods include classification, clustering, and chronological or geographical ordering, which only provides a partial view of the relations among news articles. To better utilize cross‐document relations in organizing news articles, in this paper, we propose a novel approach to organize news archives by exploiting their near‐duplicate relations. First, we use a sentence‐level statistics‐based approach to near‐duplicate copy detection, which is language independent, simple but effective. Since content‐based approaches are usually time consuming and not robust to term substitutions, near‐duplicate detection approach can be used. Second, by extracting the cross‐document relations in a block‐sharing graph, we can derive a near‐duplicate clustering by cross‐document relations in which users can easily browse and find out unnecessary repetitions among documents. From the experimental results, we observed high efficiency and good accuracy of the proposed approach in detecting and clustering near‐duplicate documents in news archives.  相似文献   

12.
基于遗传算法及聚类的基因表达数据特征选择   总被引:1,自引:0,他引:1  
特征选择是模式识别及数据挖掘等领域的重要问题之一。针对高维数据对象(如基因表达数据)的特征选择,一方面可以提高分类及聚类的精度和效率,另一方面可以找出富含信息的特征子集,如发现与疾病密切相关的重要基因。针对此问题,本文提出了一种新的面向基因表达数据的特征选择方法,在特征子集搜索上采用遗传算法进行随机搜索,在特征子集评价上采用聚类算法及聚类错误率作为学习算法及评价指标。实验结果表明,该算法可有效地找出具有较好可分离性的特征子集,从而实现降维并提高聚类及分类精度。  相似文献   

13.
The majority of the algorithms in the software clustering literature utilize structural information to decompose large software systems. Approaches using other attributes, such as file names or ownership information, have also demonstrated merit. At the same time, existing algorithms commonly deem all attributes of the software artifacts being clustered as equally important, a rather simplistic assumption. Moreover, no method that can assess the usefulness of a particular attribute for clustering purposes has been presented in the literature. In this paper, we present an approach that applies information theoretic techniques in the context of software clustering. Our approach allows for weighting schemes that reflect the importance of various attributes to be applied. We introduce LIMBO, a scalable hierarchical clustering algorithm based on the minimization of information loss when clustering a software system. We also present a method that can assess the usefulness of any nonstructural attribute in a software clustering context. We applied LIMBO to three large software systems in a number of experiments. The results indicate that this approach produces clusterings that come close to decompositions prepared by system experts. Experimental results were also used to validate our usefulness assessment method. Finally, we experimented with well-established weighting schemes from information retrieval, Web search, and data clustering. We report results as to which weighting schemes show merit in the decomposition of software systems.  相似文献   

14.
In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.  相似文献   

15.
Algorithms for clustering Web search results have to be efficient and robust. Furthermore they must be able to cluster a data set without using any kind of a priori information, such as the required number of clusters. Clustering algorithms inspired by the behavior of real ants generally meet these requirements. In this article we propose a novel approach to ant‐based clustering, based on fuzzy logic. We show that it improves existing approaches and illustrates how our algorithm can be applied to the problem of Web search results clustering. © 2007 Wiley Periodicals, Inc. Int J Int Syst 22: 455–474, 2007.  相似文献   

16.
Grouping problems aim to partition a set of items into multiple mutually disjoint subsets according to some specific criterion and constraints. Grouping problems cover a large class of computational problems including clustering and classification that frequently appear in expert and intelligent systems as well as many real applications. This paper focuses on developing a general-purpose solution approach for grouping problems, i.e., reinforcement learning based local search (RLS), which combines reinforcement learning techniques with local search. This paper makes the following contributions: we show that (1) reinforcement learning can help obtain useful information from discovered local optimum solutions; (2) the learned information can be advantageously used to guide the search algorithm towards promising regions. To the best of our knowledge, this is the first attempt to propose a formal model that combines reinforcement learning and local search for solving grouping problems. The proposed approach is verified on a well-known representative grouping problem (graph coloring). The generality of the approach makes it applicable to other grouping problems.  相似文献   

17.
18.
一种基于聚类技术的个性化信息检索方法   总被引:7,自引:2,他引:5       下载免费PDF全文
实践证明聚类技术是改进搜索结果显示方式的一种有效手段。然而,目前的聚类方法没有考虑到用户兴趣,对于相同的查询,返回给所有用户同样的聚类结果。由此提出一种个性化聚类检索方法。该方法改进了k-means算法,利用该算法对传统搜索引擎返回的结果结合用户兴趣进行聚类,返回针对特定用户的网页簇。实验证明该方法能够提供个性化服务,改善了聚类的效果,提高了用户的检索效率。  相似文献   

19.
目前,搜索结果聚类方法大多数采用基于文档的方法,不能生成有意义的聚类标签。为了解决这个问题,提出一种基于关键名词短语聚类的中文搜索结果聚类方法,该方法将名词短语、相关搜索词作为候选聚类标签,利用C-Value算法、IDF值筛选标签,然后使用Chameleon算法将标签聚类,最后将搜索结果划分到最相关的聚类簇。实验证明,该方法把关键名词短语和相关搜索词作为聚类标签,有效地提高了标签的描述性,降低了聚类算法的时间复杂度。  相似文献   

20.
In this paper, we develop a genetic algorithm method based on a latent semantic model (GAL) for text clustering. The main difficulty in the application of genetic algorithms (GAs) for document clustering is thousands or even tens of thousands of dimensions in feature space which is typical for textual data. Because the most straightforward and popular approach represents texts with the vector space model (VSM), that is, each unique term in the vocabulary represents one dimension. Latent semantic indexing (LSI) is a successful technology in information retrieval which attempts to explore the latent semantics implied by a query or a document through representing them in a dimension-reduced space. Meanwhile, LSI takes into account the effects of synonymy and polysemy, which constructs a semantic structure in textual data. GA belongs to search techniques that can efficiently evolve the optimal solution in the reduced space. We propose a variable string length genetic algorithm which has been exploited for automatically evolving the proper number of clusters as well as providing near optimal data set clustering. GA can be used in conjunction with the reduced latent semantic structure and improve clustering efficiency and accuracy. The superiority of GAL approach over conventional GA applied in VSM model is demonstrated by providing good Reuter document clustering results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号