首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
基于贝叶斯理论的图像标注和检索   总被引:2,自引:1,他引:1  
图像自动语义标注是基于内容图像检索中很重要且很有挑战性的工作.提出用语义约束的聚类方法对分割后的图像区域进行聚类,在图像标注阶段,使用贪心选择连接(GSJ)算法找出聚类区域的独立子集,然后使用贝叶斯理论进行语义标注.对图像进行标注以后,使用标注的关键字进行检索.在一个包含500幅图像的图像库进行实验,结果表明,提出的方法具有较好的检索性能.  相似文献   

2.
可分性判据在中文网页分类中的应用   总被引:3,自引:0,他引:3  
提出了一种改进的基于统计的中文网页的分类算法。通过对传统的基于计算相似度文本分类方法和基于贝叶斯模型文本分类算法的研究,我们对贝叶斯模型分类算法进行了改进,提出了利用一种基于概率分布的可分性判据分类方法,即用类别密度函数似然比来增加特征词的可分性信息的算法。通过对计算相似度方法,贝叶斯方法及改进的贝叶斯方法的对比实验表明,改进算法可以使类与类的间隔最大化,因而具有较高的分类精确率和召回率。  相似文献   

3.
中文语义标注在自然语言处理领域有广泛的应用,其目的在于挖掘并标注出中文多语义名词的多个语义。提出一种新颖的语义标注算法,通过在线URL分类目录,构建得到URL分类器。借助于URL分类器,对搜索引擎返回的多语义名词的搜索结果(包括网页URL及摘要)进行分类,得到多语义名词的初始语义分类结果。对初始语义分类结果按其网页摘要聚类,提取聚类特征后得到多语义词的语义标注结果。该算法利用基于URL的网页分类方法,能在线对中文多语义名词进行语义标注。实验结果证明,该语义标注算法可以取得70%的准确率及80%的召回率,适用于网络热词语义标注。  相似文献   

4.
采用潜在语义索引的全局模型和局部模型表示医学网页时,模糊聚类结果的类间包含度很大。该文提出一种新的潜在语义差异模型,将医学网页中的文本抽取出来并分别采用全局模型、局部模型和差异模型进行表示,利用FCM算法进行聚类并计算类间包含度。实验发现,对给定的5类医学网页进行聚类时,采用差异模型时的类间包含度平均约为全局模型的85%、局部模型的80%。  相似文献   

5.
图像自动标注的实质是通过对图像视觉特征的分析来提取高层语义关键词用于表示图像的含义,从而使得现有图像检索问题转化为技术已经相当成熟的文本检索问题,在一定程度上解决了基于内容图像检索中存在的语义鸿沟问题.采用t混合模型在已标注好的训练图像集上计算图像区域类与关键字的联合概率分布,在此基础上,对未曾观察过的测试图像集,利用生成的模型根据贝叶斯最小错误概率准则实现自动图像标注.实验结果表明,该方法能有效改善标注结果.  相似文献   

6.
由于现有的Web日志缺少明显语义,提出一种语义Web日志模型--SWLM,并给出基于该模型的网页和用户聚类算法.通过日志概念的语义距离定量计算来聚类网页和用户,奠定了Web个性化服务的基础.性能测试实验证明,该模型具有较好的整体性能,能有效地进行网页和用户聚类.  相似文献   

7.
徐光  郭红 《福建电脑》2006,(8):80-81
提出一个基于视觉本体的视频语义标注算法。该算法利用贝叶斯统计学习和决策理论,通过计算视频关键帧的主要区域与视觉本体中概念的视觉相似性.动态地实现对视频对象的半自动语义标注。实验结果表明,利用该算法进行语义标注效果良好.并具有稳定的性能。  相似文献   

8.
针对有效利用图像底层视觉特征和图像语义特征进行图像标注,提出一种改进的AP(Affinity Propagation)聚类标注模型。首先采用半监督距离测度学习算法,融合图像语义信息,训练得到新的距离测度。然后使用新的距离测度对每一类图像进行AP聚类,生成各类图像的聚类中心,计算待标注图像到各类图像聚类中心的平均距离,确定待标注图像类别。最后计算待标注图像到类内各个聚类中心的距离,确定待标注图像类内类别,统计该类别下图像的标注词,作为待标注图像的标注词。在Corel5K和NUS-WIDE数据集上进行了实验,经验证,该方法有效提高了标注精度。  相似文献   

9.
为非结构化的Web页面标注事件语义信息,可以丰富Web页面结构化信息,加深对Web页面内容的理解。选取新闻类型的Web页面,遵照事件语义标注规范对选取的未标注Web页面进行事件语义标注。对标注了事件语义的语料实例进行抽象得到事件语义结构模式;利用层次聚类算法,将所得的事件语义结构模式进行聚类分析,得到不同类别的事件语义模式。实验结果表明,在已标注事件语义的语料实例的基础上,利用聚类算法进行分析,获取各种类别的事件语义模式,对Web页面内容分析与理解是非常必要的。  相似文献   

10.
半监督聚类算法通常利用标注数据优化类别描述参数(如类的中心),然后通过类别描述参数划分无标注数据的类别,但是没有考虑标注数据对其周围无标注数据的类别划分的直接作用。文中提出一种双向选择调整策略,在根据类别描述参数对数据进行类别划分之后,利用标注数据调整其周围未标注数据的类别标签,从而提高类别划分的准确度。该方法根据标注数据周围的数据密度来动态确定数据调整范围,并采用新的相似度计算方法提高被调整的数据准确度。文中利用双向选择调整策略改进了基于多项式模型的半监督聚类算法和半监督模糊聚类算法,并使用多个标准数据集进行实验。实验结果表明改进的算法有效提高了半监督聚类的准确性。  相似文献   

11.
The results of traditional clustering methods are usually unreliable as there is not any guidance from the data labels, while the class labels can be predicted more reliable by the semisupervised learning if the labels of partial data are given. In this paper, we propose an actively self-training clustering method, in which the samples are actively selected as training set to minimize an estimated Bayes error, and then explore semisupervised learning to perform clustering. Traditional graph-based semisupervised learning methods are not convenient to estimate the Bayes error; we develop a specific regularization framework on graph to perform semisupervised learning, in which the Bayes error can be effectively estimated. In addition, the proposed clustering algorithm can be readily applied in a semisupervised setting with partial class labels. Experimental results on toy data and real-world data sets demonstrate the effectiveness of the proposed clustering method on the unsupervised and the semisupervised setting. It is worthy noting that the proposed clustering method is free of initialization, while traditional clustering methods are usually dependent on initialization.  相似文献   

12.
基于Rough集潜在语义索引的Web文档分类   总被引:5,自引:0,他引:5  
Rough集(粗糙集)埋论是一种处理不确定或模糊知识的数学工具。提出了一种基于Rough集理论的潜在语义索引的Web文档分类方法。首先应用向量空间模型表示Web文档信息,然后通过矩阵的奇异值分解来进行信息过滤和潜在语义索引;运用属性约简算法生成分类规则,最后利用多知识库进行文档分类。通过试验比较,该方法具有较好的分类效果。  相似文献   

13.
Blog mining addresses the problem of mining information from blog data. Although mining blogs may share many similarities to Web and text documents, existing techniques need to be reevaluated and adapted for the multidimensional representation of blog data, which exhibit dimensions not present in traditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuable sources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic model for blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topic model determines the most likely tags and words for a given topic in a collection of blog posts. The model has been successfully implemented and evaluated on real-world blog data.  相似文献   

14.
We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses latent semantic indexing on the whole document content. A main contribution of the paper is a novel strategy – called dynamic SVD clustering – to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classification performance, and that it can be effectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents.  相似文献   

15.
With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discover generalized frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Classic4, Re0, R8, and WebKB datasets. Our experimental results show that our proposed approach indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.  相似文献   

16.
提出了一种没有训练集情况下实现对未标注类别文本文档进行分类的问题。类关联词是与类主体相关、能反映类主体的单词或短语。利用类关联词提供的先验信息,形成文档分类的先验概率,然后组合利用朴素贝叶斯分类器和EM迭代算法,在半监督学习过程中加入分类约束条件,用类关联词来监督构造一个分类器,实现了对完全未标注类别文档的分类。实验结果证明,此方法能够以较高的准确率实现没有训练集情况下的文本分类问题,在类关联词约束下的分类准确率要高于没有约束情况下的分类准确率。  相似文献   

17.
18.
一种基于容错粗糙集的Web搜索结果聚类方法   总被引:1,自引:0,他引:1  
一些Web聚类方法把类严格作为互斥的关系,聚类效果不理想.一种基于容错粗糙集的k均值的聚类解决了这一问题.首先运用向量模型表示Web文档信息,采用常规方法得到文本特征词集,然后利用某些特征词协同出现的价值,构造特征词客错关系,扩充特征词的描述能力,最后用特征词容错类描述文档之间的相似关系,实现了Web搜索结果聚类,并提出了简单直观的衡量聚类精度的T模型.实验结果表明,利用容错关系聚类的类标记描述性强、容易理解、明显优于普通k均值算法.  相似文献   

19.
Probabilistic latent semantic analysis (PLSA) is a double structure mixture model which has got a wide application in text and web mining. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this approach, the selection of the correct number of latent variables is critical. In the most of the previous researches, the number of latent topics was selected based on the number of invoked classes. This paper presents a method, based on backward elimination approach, which is capable of unsupervised order selection in PLSA. This method starts with a model having a number of components more than the needed value, and then prunes the mixtures to reach their optimum size. During the elimination process, proper selection of some latent variables which must be deleted is the most essential problem, and its relation to the final performance of the pruned model is straightforward. To treat this problem, we introduce a new combined pruning method which selects the best options for removal, while keeping a low computational cost, at all. We conducted some experiments on two datasets from Reuters-21578 corpus. The obtained results show that this algorithm leads to an optimized number of latent variables and in turn achieves better clustering performance compared to the conventional model selection methods. It also shows superiority over the case in which a PLSA model with a fixed number of latent variables, equal to the real number of clusters, is exploited.  相似文献   

20.
在文本分类中获得有类别标记训练样本的代价是很高昂的,本文针对这个问题对传统的模糊聚类方法进行改进,提出模糊划分聚类方法FPCM,将聚类的无监督性和样本的先验知识结合起来,通过相似度度量聚类相关文本,取得比较客观的簇和少量标记文本,为监督学习找到分类依据,并结合朴素贝叶斯增量学习方式进行分类器的学习.本文进一步用估计分类误差损失的方法平衡选取候选样本,提高了分类准确率,实现了应用范围更加广泛的无标记文本分类学习模型.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号