共查询到19条相似文献,搜索用时 187 毫秒
1.
中文自然语言处理中专业领域分词的难度远远高于通用领域。特别是在专业领域的分词歧义方面,一直没有找到有效的解决方法。针对该问题提出基于无监督学习的专业领域分词歧义消解方法。以测试语料自身的字符串频次信息、互信息、边界熵信息为分词歧义的评价标准,独立、组合地使用这三种信息解决分词歧义问题。实验结果显示该方法可以有效消解专业领域的分词歧义,并明显提高分词效果。 相似文献
2.
以网络食品安全信息为研究对象,旨在提出一个能够解决食品安全领域专有名词指代不明的歧义消解算法。文中采用的歧义消解算法是在改进的TF-IDF特征选择算法的基础上,结合了隐含马尔可夫模型(HMM)和SVM分类器,从而实现专有名词的歧义消解。提出了一个在TF-IDF的基础上增加两个加权因子的特征提取算法LN-TF-IDF。实验表明,以202831条文本实验所得的准确率和召回率的调和平均值F1值为评价标准,设计的基于改进TF-IDF的食品安全领域歧义消解算法的效果比基于传统TF-IDF的歧义消解算法平均提升了7.31%,且在不同时间抓取的实验数据集下,本算法的效果也相对稳定。 相似文献
3.
4.
面向信息检索的自适应中文分词系统 总被引:16,自引:0,他引:16
新词的识别和歧义的消解是影响信息检索系统准确度的重要因素.提出了一种基于统计模型的、面向信息检索的自适应中文分词算法.基于此算法,设计和实现了一个全新的分词系统BUAASEISEG.它能够识别任意领域的各类新词,也能进行歧义消解和切分任意合理长度的词.它采用迭代式二元切分方法,对目标文档进行在线词频统计,使用离线词频词典或搜索引擎的倒排索引,筛选候选词并进行歧义消解.在统计模型的基础上,采用姓氏列表、量词表以及停词列表进行后处理,进一步提高了准确度.通过与著名的ICTCLAS分词系统针对新闻和论文进行对比评测,表明BUAASEISEG在新词识别和歧义消解方面有明显的优势. 相似文献
5.
组合歧义消解是分词中的关键问题之一,直接影响到分词的准确率。为了解决越南语组合歧义对分词的影响问题,结合越南语组合型词的特点,提出了一种基于集成学习的越南语组合歧义消解方法。该方法首先通过人工选取越南语组合歧义词,构建出越南语组合歧义字段库,对越南语语料与越南语组合词词典进行匹配,抽取出越南语组合歧义字段;其次,采用三类分类器引入越南语词频特征和上下文信息,构建三类分类器消解模型,得到三类分类器消解结果;最后,计算出各分类器权值,通过阈值对越南语组合歧义进行最终分类。实验表明,所提方法的正确率达到了83.32%,与消歧结果最好的单个分类器相比准确率提高了5.81%。 相似文献
6.
7.
基于HNC理论的句法结构歧义消解 总被引:3,自引:0,他引:3
歧义消解是自然语言理解和处理所面对的核心问题。基于词组和短语的消歧不能保证消歧结果的正确,歧义的成功消解基于对语境或上下文(context)的正确理解。HNC理论采取的概念基元化、层次化、网络化、形式化策略以及在此基础上建立的句类和句式体系,为自然语言的歧义消解提供了最大的可能。基于HNC理论的歧义消解的总体原则是,以语句为基础,充分利用语句语境提供的句类知识,采取宏观消歧与微观消歧相结合的策略。对于经典句法歧义结构V+NP1+的+NP2,本文描述了其三重性歧义性质,并提出了三条准则和十个推论以实现对其歧义的消解。 相似文献
8.
9.
汉语口语对话系统中语义分析的消歧策略 总被引:1,自引:0,他引:1
框架语义分析是目前汉语口语对话系统中常用的语义解析方法,本文分析了语义分析过程中容易产生的两种典型歧义现象- 结构歧义和语义关系歧义。并针对这两种歧义结构,分别提出基于语义PCFG模型的结构歧义消歧策略以及基于语义期待模型EM的语义关系歧义消歧策略,并给出了有效的消歧算法。实验结果表明综合运用本文提出的消歧策略后,基线系统理解模块的句子语义分析正确率大大提高,从原来的7517 %上升到9115 % ,而且标志语义单元理解率的三项指标,准确率,召回率和精度也平均提高了10 %。 相似文献
10.
命名实体识别和歧义消解是自然语言理解的重要研究内容。针对提供实体知识库情况下的命名实体识别和歧义消解任务,该文提出了一种基于多步聚类的方法。首先通过两轮聚类将命名实体与知识库实体定义链接,然后通过层次聚合式聚类对知识库中未出现的实体进行聚类,最后进行普通词的识别和基于K-Means聚类的结果调整。在CLP-2012的汉语命名实体识别和歧义消解评测数据上的实验表明,该文的方法表现出良好的性能,在测试集上的F值高出评测参赛队伍最好水平6.46%,达到86.68%。 相似文献
11.
Name ambiguity refers to a problem that different people might be referenced with an identical name. This problem has become critical in many applications, particularly in online bibliography systems, such as DBLP and CiterSeer. Although much work has been conducted to address this problem, there still exist many challenges. In this paper, a general framework of constraint-based topic modeling is proposed, which can make use of user-defined constraints to enhance the performance of name disambiguation. A Gibbs sampling algorithm that integrates the constraints has been proposed to do the inference of the topic model. Experimental results on a real-world dataset show that significant improvements can be obtained by taking the proposed approach. 相似文献
12.
Name ambiguity refers to a problem that different people might be referenced with an identical name. This problem has become
critical in many applications, particularly in online bibliography systems, such as DBLP and CiterSeer. Although much work
has been conducted to address this problem, there still exist many challenges. In this paper, a general framework of constraint-based
topic modeling is proposed, which can make use of user-defined constraints to enhance the performance of name disambiguation.
A Gibbs sampling algorithm that integrates the constraints has been proposed to do the inference of the topic model. Experimental
results on a real-world dataset show that significant improvements can be obtained by taking the proposed approach. 相似文献
13.
Personal name disambiguation is an important task in social network extraction, evaluation and integration of ontologies, information retrieval, cross‐document coreference resolution and word sense disambiguation. We propose an unsupervised method to automatically annotate people with ambiguous names on the Web using automatically extracted keywords. Given an ambiguous personal name, first, we download text snippets for the given name from a Web search engine. We then represent each instance of the ambiguous name by a term‐entity model (TEM), a model that we propose to represent the Web appearance of an individual. A TEM of a person captures named entities and attribute values that are useful to disambiguate that person from his or her namesakes (i.e., different people who share the same name). We then use group average agglomerative clustering to identify the instances of an ambiguous name that belong to the same person. Ideally, each cluster must represent a different namesake. However, in practice it is not possible to know the number of namesakes for a given ambiguous personal name in advance. To circumvent this problem, we propose a novel normalized cuts‐based cluster stopping criterion to determine the different people on the Web for a given ambiguous name. Finally, we annotate each person with an ambiguous name using keywords selected from the clusters. We evaluate the proposed method on a data set of over 2500 documents covering 200 different people for 20 ambiguous names. Experimental results show that the proposed method outperforms numerous baselines and previously proposed name disambiguation methods. Moreover, the extracted keywords reduce ambiguity of a name in an information retrieval task, which underscores the usefulness of the proposed method in real‐world scenarios. 相似文献
14.
15.
Hsin-Tsung Peng Cheng-Yu Lu William Hsu Jan-Ming Ho 《Expert systems with applications》2012,39(12):10521-10532
Members of the academic community have increasingly turned to digital libraries to search for the latest work of their peers. On account of their role in the academic community, it is very important that these digital libraries collect citations in a consistent, accurate, and up-to-date manner, yet they do not correctly compile citations for myriads of authors for various reasons including authors with the same name, a problem known as the “name ambiguity problem.” This problem occurs when multiple authors share the same name and particularly when names are simplified as in cases where names merely contain the first initial and the last name. This paper proposes a reliable and accurate pair-wise similarities approach to disambiguate names using supervised classification on Web correlations and authorship correlations. This approach makes use of Web correlations among citations assuming citations that co-refer on publication lists on the Web should to refer to the same author. This approach also makes use of authorship correlations assuming citations with the same rare author name refer to the same author, and furthermore, citations with the same full names of authors or e-mail addresses likely refer to the same author. These two types of correlations are measured in our approach using pair-wise similarity metrics. In addition, a binary classifier, as part of supervised classification, is applied to label matching pairs of citations using pair-wise similarity metrics, and these labels are then used to group citations into different clusters such that each cluster represents an individual author. Results show our approach greatly improves upon the name disambiguation accuracy and performance of other proposed approaches, especially in some name clusters with high degree of ambiguity. 相似文献
16.
17.
在社交网络中查找和收集个人信息可以建立一个包含目标履历、生活、爱好以及朋友等属性的信息体系,但是不同社交网络中存在大量同名用户。为了解决同名歧义问题,采用计算用户信息相似度,可以判断2个用户是否属于同一个人。由于文档中描述信息位置颠倒会导致计算机误判,为此,本文通过对莱文斯坦(Levenshtein)和词频相关字符串频率(TFRSF)方法融合计算词频和编辑距离,判断属性值是否相同。实验结果表明,本文提出的计算文本相似度方法在多种评价指标上准确性都有所提高,准确率(Precision)、召回率(Recall)、F1值(F1 Measure)均大于87%。 相似文献
18.
人名歧义是一种身份不确定的现象,指的是文本中具有相同姓名的字符串指向现实世界中的不同实体人物。人名消歧很长时间一直是一个具有挑战性的问题,关注网页里的人名消歧的问题。因为经典的K-means算法如果选择了一个差的随机初始聚类中心,算法会遇到局部收敛的问题,所以文章提出一种基于最大最小原则的改进的K-means算法来进行人名消歧。同时使用了WePS的训练数据作为实验的语料。实验结果表明,改进的方法比层次聚类方法有着更好的性能。 相似文献