共查询到20条相似文献,搜索用时 140 毫秒
1.
2.
3.
中文语义标注在自然语言处理领域有广泛的应用,其目的在于挖掘并标注出中文多语义名词的多个语义。提出一种新颖的语义标注算法,通过在线URL分类目录,构建得到URL分类器。借助于URL分类器,对搜索引擎返回的多语义名词的搜索结果(包括网页URL及摘要)进行分类,得到多语义名词的初始语义分类结果。对初始语义分类结果按其网页摘要聚类,提取聚类特征后得到多语义词的语义标注结果。该算法利用基于URL的网页分类方法,能在线对中文多语义名词进行语义标注。实验结果证明,该语义标注算法可以取得70%的准确率及80%的召回率,适用于网络热词语义标注。 相似文献
4.
5.
图像自动标注的实质是通过对图像视觉特征的分析来提取高层语义关键词用于表示图像的含义,从而使得现有图像检索问题转化为技术已经相当成熟的文本检索问题,在一定程度上解决了基于内容图像检索中存在的语义鸿沟问题.采用t混合模型在已标注好的训练图像集上计算图像区域类与关键字的联合概率分布,在此基础上,对未曾观察过的测试图像集,利用生成的模型根据贝叶斯最小错误概率准则实现自动图像标注.实验结果表明,该方法能有效改善标注结果. 相似文献
6.
由于现有的Web日志缺少明显语义,提出一种语义Web日志模型--SWLM,并给出基于该模型的网页和用户聚类算法.通过日志概念的语义距离定量计算来聚类网页和用户,奠定了Web个性化服务的基础.性能测试实验证明,该模型具有较好的整体性能,能有效地进行网页和用户聚类. 相似文献
7.
提出一个基于视觉本体的视频语义标注算法。该算法利用贝叶斯统计学习和决策理论,通过计算视频关键帧的主要区域与视觉本体中概念的视觉相似性.动态地实现对视频对象的半自动语义标注。实验结果表明,利用该算法进行语义标注效果良好.并具有稳定的性能。 相似文献
8.
《计算机工程与应用》2017,(23)
针对有效利用图像底层视觉特征和图像语义特征进行图像标注,提出一种改进的AP(Affinity Propagation)聚类标注模型。首先采用半监督距离测度学习算法,融合图像语义信息,训练得到新的距离测度。然后使用新的距离测度对每一类图像进行AP聚类,生成各类图像的聚类中心,计算待标注图像到各类图像聚类中心的平均距离,确定待标注图像类别。最后计算待标注图像到类内各个聚类中心的距离,确定待标注图像类内类别,统计该类别下图像的标注词,作为待标注图像的标注词。在Corel5K和NUS-WIDE数据集上进行了实验,经验证,该方法有效提高了标注精度。 相似文献
9.
10.
半监督聚类算法通常利用标注数据优化类别描述参数(如类的中心),然后通过类别描述参数划分无标注数据的类别,但是没有考虑标注数据对其周围无标注数据的类别划分的直接作用。文中提出一种双向选择调整策略,在根据类别描述参数对数据进行类别划分之后,利用标注数据调整其周围未标注数据的类别标签,从而提高类别划分的准确度。该方法根据标注数据周围的数据密度来动态确定数据调整范围,并采用新的相似度计算方法提高被调整的数据准确度。文中利用双向选择调整策略改进了基于多项式模型的半监督聚类算法和半监督模糊聚类算法,并使用多个标准数据集进行实验。实验结果表明改进的算法有效提高了半监督聚类的准确性。 相似文献
11.
Nie F Xu D Li X 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2012,42(1):17-27
The results of traditional clustering methods are usually unreliable as there is not any guidance from the data labels, while the class labels can be predicted more reliable by the semisupervised learning if the labels of partial data are given. In this paper, we propose an actively self-training clustering method, in which the samples are actively selected as training set to minimize an estimated Bayes error, and then explore semisupervised learning to perform clustering. Traditional graph-based semisupervised learning methods are not convenient to estimate the Bayes error; we develop a specific regularization framework on graph to perform semisupervised learning, in which the Bayes error can be effectively estimated. In addition, the proposed clustering algorithm can be readily applied in a semisupervised setting with partial class labels. Experimental results on toy data and real-world data sets demonstrate the effectiveness of the proposed clustering method on the unsupervised and the semisupervised setting. It is worthy noting that the proposed clustering method is free of initialization, while traditional clustering methods are usually dependent on initialization. 相似文献
12.
13.
Flora S. Tsai 《Expert systems with applications》2011,38(5):5330-5335
Blog mining addresses the problem of mining information from blog data. Although mining blogs may share many similarities to Web and text documents, existing techniques need to be reevaluated and adapted for the multidimensional representation of blog data, which exhibit dimensions not present in traditional documents, such as tags. Blog tags are semantic annotations in blogs which can be valuable sources of additional labels for the myriad of blog documents. In this paper, we present a tag-topic model for blog mining, which is based on the Author-Topic model and Latent Dirichlet Allocation. The tag-topic model determines the most likely tags and words for a given topic in a collection of blog posts. The model has been successfully implemented and evaluated on real-world blog data. 相似文献
14.
We develop a new algorithm for clustering search results. Differently from many other clustering systems that have been recently proposed as a post-processing step for Web search engines, our system is not based on phrase analysis inside snippets, but instead uses latent semantic indexing on the whole document content. A main contribution of the paper is a novel strategy – called dynamic SVD clustering – to discover the optimal number of singular values to be used for clustering purposes. Moreover, the algorithm is such that the SVD computation step has in practice good performance, which makes it feasible to perform clustering when term vectors are available. We show that the algorithm has very good classification performance, and that it can be effectively used to cluster results of a search engine to make them easier to browse by users. The algorithm has being integrated into the Noodles search engine, a tool for searching and clustering Web and desktop documents. 相似文献
15.
With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better
document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy,
and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. In
order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document
Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy
generated from WordNet is applied to discover generalized frequent itemsets as candidate cluster labels for grouping documents.
We have conducted experiments to evaluate our approach on Classic4, Re0, R8, and WebKB datasets. Our experimental results
show that our proposed approach indeed provide more accurate clustering results than prior influential clustering methods
presented in recent literature. 相似文献
16.
提出了一种没有训练集情况下实现对未标注类别文本文档进行分类的问题。类关联词是与类主体相关、能反映类主体的单词或短语。利用类关联词提供的先验信息,形成文档分类的先验概率,然后组合利用朴素贝叶斯分类器和EM迭代算法,在半监督学习过程中加入分类约束条件,用类关联词来监督构造一个分类器,实现了对完全未标注类别文档的分类。实验结果证明,此方法能够以较高的准确率实现没有训练集情况下的文本分类问题,在类关联词约束下的分类准确率要高于没有约束情况下的分类准确率。 相似文献
17.
18.
一种基于容错粗糙集的Web搜索结果聚类方法 总被引:1,自引:0,他引:1
一些Web聚类方法把类严格作为互斥的关系,聚类效果不理想.一种基于容错粗糙集的k均值的聚类解决了这一问题.首先运用向量模型表示Web文档信息,采用常规方法得到文本特征词集,然后利用某些特征词协同出现的价值,构造特征词客错关系,扩充特征词的描述能力,最后用特征词容错类描述文档之间的相似关系,实现了Web搜索结果聚类,并提出了简单直观的衡量聚类精度的T模型.实验结果表明,利用容错关系聚类的类标记描述性强、容易理解、明显优于普通k均值算法. 相似文献
19.
Probabilistic latent semantic analysis (PLSA) is a double structure mixture model which has got a wide application in text and web mining. This method is capable of establishing hidden semantic relations among the observed features, using a number of latent variables. In this approach, the selection of the correct number of latent variables is critical. In the most of the previous researches, the number of latent topics was selected based on the number of invoked classes. This paper presents a method, based on backward elimination approach, which is capable of unsupervised order selection in PLSA. This method starts with a model having a number of components more than the needed value, and then prunes the mixtures to reach their optimum size. During the elimination process, proper selection of some latent variables which must be deleted is the most essential problem, and its relation to the final performance of the pruned model is straightforward. To treat this problem, we introduce a new combined pruning method which selects the best options for removal, while keeping a low computational cost, at all. We conducted some experiments on two datasets from Reuters-21578 corpus. The obtained results show that this algorithm leads to an optimized number of latent variables and in turn achieves better clustering performance compared to the conventional model selection methods. It also shows superiority over the case in which a PLSA model with a fixed number of latent variables, equal to the real number of clusters, is exploited. 相似文献