首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this research we introduce a new approach for Arabic word sense disambiguation by utilizing Wikipedia as a lexical resource for disambiguation. The nearest sense for an ambiguous word is selected using Vector Space Model as a representation and cosine similarity between the word context and the retrieved senses from Wikipedia as a measure. Three experiments have been conducted to evaluate the proposed approach, two experiments use the first retrieved sentence for each sense from Wikipedia but they use different Vector Space Model representations while the third experiment uses the first paragraph for the retrieved sense from Wikipedia. The experiments show that using first paragraph is better than the first sentence and the use of TF-IDF is better than using abstract frequency in VSM. Also, the proposed approach is tested on English words and it gives better results using the first sentence retrieved from Wikipedia for each sense.  相似文献   

2.
A new type of documents called a “wiki page” is winning the Internet. This is expressed not only in an increase of the number of Internet pages of this type, but also in the popularity of Wiki projects (in particular, Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian, English, and German, is proposed and implemented. The architecture of the indexing system, including the software components GATE and Lemmatizer, is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf’s laws is tested for the Russian Wikipedia and Simple English Wikipedia.  相似文献   

3.
词语相关程度计算是语义计算的基础。维基百科是目前最大、更新最快的在线开放式百科全书,涵盖概念广,概念解释详细,蕴含了大量概念间关联关系,为语义计算提供了丰富的背景知识。然而,中文维基百科中存在严重的数据稀疏问题,降低了中文词语相关度计算方法的有效性。针对这一问题,该文利用机器学习技术,提出一种新的基于多种维基资源的词语相关度学习算法。在三个标准数据集上的实验结果验证了新算法的有效性,在已知最好结果的基础上提升了20%—40%。
  相似文献   

4.
The paper addresses the problem of automatic dictionary translation.The proposed method translates a dictionary by means of mining repositories in the source and target languages, without any directly given relationships connecting the two languages. It consists of two stages: (1) translation by lexical similarity, where words are compared graphically, and (2) translation by semantic similarity, where contexts are compared. In the experiments Polish and English version of Wikipedia were used as text corpora. The method and its phases are thoroughly analyzed. The results allow implementing this method in human-in-the-middle systems.  相似文献   

5.
基于Wikipedia的语义相关度计算   总被引:1,自引:0,他引:1       下载免费PDF全文
刘军  姚天昉 《计算机工程》2010,36(19):42-43
在意见挖掘中,为实现特殊领域知识的语义相关度计算,提出基于Wikipedia的语义相关度计算方法。在构建Wikipedia类别树的基础上,通过Wikipedia类别向量表示Wikipedia中的词汇,形成一部包含各种领域知识的Wikipedia词典,利用该词典计算语义相关度。实验结果表明,该方法的斯皮尔曼等级相关系数可达到0.77。  相似文献   

6.
自然语言词汇的语义相关度的计算需要获取大量的背景知识,而维基百科是当前规模最大的百科全书,其不仅是一个规模巨大的语料库,而且还是一个包含了大量人类背景知识和语义关系的知识库,研究表明,其是进行语义计算的理想资源,本文提出了一种将维基百科的链接结构和分类体系相结合计算中文词汇语义相关度的算法,算法只利用了维基百科的链接结构和分类体系,无需进行复杂的文本处理,计算所需的开销较小.在多个人工评测的数据集上的实验结果显示,获得了比单独使用链接结构或分类体系的算法更好的效果,在最好的情况下,Spearman相关系数提高了30.96%.  相似文献   

7.
针对目前大多数方面情感三元组提取方法存在着没有充分考虑语法结构和语义相关性的问题. 本文提出一种结合语法结构和语义信息的方面情感三元组提取模型, 首先提出使用依赖解析器得到所有依赖弧的概率矩阵构建语法图, 提取丰富的语法结构信息. 其次利用自注意力机制构建语义图, 表示单词与单词之间的语义相关性, 从而减低噪声词的干扰. 最后设计了一个相互仿射变换层, 让模型可以更好地交换语法图和语义图之间的相关特征, 提升模型情感三元组提取的表现. 在多个公开数据集上进行验证. 实验表明, 与现有的情感三元组提取模型相比, 精确度(P)、召回率(R)和F1值整体都有提高, 验证了结合语法结构和语义信息在方面情感三元组提取的有效性.  相似文献   

8.
With the development of mobile technology, the users browsing habits are gradually shifted from only information retrieval to active recommendation. The classification mapping algorithm between users interests and web contents has been become more and more difficult with the volume and variety of web pages. Some big news portal sites and social media companies hire more editors to label these new concepts and words, and use the computing servers with larger memory to deal with the massive document classification, based on traditional supervised or semi-supervised machine learning methods. This paper provides an optimized classification algorithm for massive web page classification using semantic networks, such as Wikipedia, WordNet. In this paper, we used Wikipedia data set and initialized a few category entity words as class words. A weight estimation algorithm based on the depth and breadth of Wikipedia network is used to calculate the class weight of all Wikipedia Entity Words. A kinship-relation association based on content similarity of entity was therefore suggested optimizing the unbalance problem when a category node inherited the probability from multiple fathers. The keywords in the web page are extracted from the title and the main text using N-gram with Wikipedia Entity Words, and Bayesian classifier is used to estimate the page class probability. Experimental results showed that the proposed method obtained good scalability, robustness and reliability for massive web pages.  相似文献   

9.
Wikipedia has become one of the largest online repositories of encyclopedic knowledge. Wikipedia editions are available for more than 200 languages, with entries varying from a few pages to more than 1 million articles per language. Embedded in each Wikipedia article is an abundance of links connecting the most important words or phrases in the text to other pages, thereby letting users quickly access additional information. An automatic text-annotation system combines keyword extraction and word-sense disambiguation to identify relevant links to Wikipedia pages.  相似文献   

10.
The Linked Hypernyms Dataset (LHD) provides entities described by Dutch, English and German Wikipedia articles with types in the DBpedia namespace. The types are extracted from the first sentences of Wikipedia articles using Hearst pattern matching over part-of-speech annotated text and disambiguated to DBpedia concepts. The dataset covers 1.3 million RDF type triples from English Wikipedia, out of which 1 million RDF type triples were found not to overlap with DBpedia, and 0.4 million with YAGO2s. There are about 770 thousand German and 650 thousand Dutch Wikipedia entities assigned a novel type, which exceeds the number of entities in the localized DBpedia for the respective language. RDF type triples from the German dataset have been incorporated to the German DBpedia. Quality assessment was performed altogether based on 16.500 human ratings and annotations. For the English dataset, the average accuracy is 0.86, for German 0.77 and for Dutch 0.88. The accuracy of raw plain text hypernyms exceeds 0.90 for all languages. The LHD release described and evaluated in this article targets DBpedia 3.8, LHD version for the DBpedia 3.9 containing approximately 4.5 million RDF type triples is also available.  相似文献   

11.
汉英机器翻译源语分析中词的识别   总被引:1,自引:1,他引:0  
汉英MT源语分析首先遇到的问题是词的识别。汉语中的“词”没有明确的定义,语素和词、词和词组、词组和句子,相互之间也没有清楚的界限。按照先分词、再句法分析的办法,会在分词时遇到构词问题和句法问题相互交错的困难。作者认为,可以把字作为源语句法分析的起始点,使词和词组的识别与句法分析同时进行。本文叙述了这种观点及其实现过程,并且以处理离合词为例,说明了识别的基本方法。  相似文献   

12.
In this work we study how people navigate the information network of Wikipedia and investigate (i) free-form navigation by studying all clicks within the English Wikipedia over an entire month and (ii) goal-directed Wikipedia navigation by analyzing wikigames, where users are challenged to retrieve articles by following links. To study how the organization of Wikipedia articles in terms of layout and links affects navigation behavior, we first investigate the characteristics of the structural organization and of hyperlinks in Wikipedia and then evaluate link selection models based on article structure and other potential influences in navigation, such as the generality of an article's topic. In free-form Wikipedia navigation, covering all Wikipedia usage scenarios, we find that click choices can be best modeled by a bias towards article structure, such as a tendency to click links located in the lead section. For the goal-directed navigation of wikigames, our findings confirm the zoom-out and the homing-in phases identified by previous work, where users are guided by generality at first and textual similarity to the target later. However, our interpretation of the link selection models accentuates that article structure is the best explanation for the navigation paths in all except these initial and final stages. Overall, we find evidence that users more frequently click on links that are located close to the top of an article. The structure of Wikipedia articles, which places links to more general concepts near the top, supports navigation by allowing users to quickly find the better-connected articles that facilitate navigation. Our results highlight the importance of article structure and link position in Wikipedia navigation and suggest that better organization of information can help make information networks more navigable.  相似文献   

13.
该文提出了一种从维基百科的可比语料中抽取对齐句子的方法。在获取了维基百科中英文数据库备份并进行一定处理后,重构成本地维基语料数据库。在此基础上,统计了词汇数据、构建了命名实体词典,并通过维基百科本身的对齐机制获得了双语可比语料文本。然后,该文在标注的过程中分析了维基百科语料的特点,以此为指导设计了一系列的特征,并确定了“对齐”、“部分对齐”、“不对齐”三分类体系,最终采用SVM分类器对维基百科语料和来自第三方的平行语料进行了句子对齐实验。实验表明:对于语言较规范的可比语料,分类器对对齐句的分类正确率可达到82%,对于平行语料,可以达到92%,这说明该方法是可行且有效的。  相似文献   

14.
机器翻译是双语言背景下的言语行为,是双语言知识的计算机处理。一个机器翻译系统的性能优劣,取决于它所使用的文法(当然,算法对于系统的影响也不可忽视)对相关语言知识,特别是对语法、语义的描写的深度和广度。本文简要介绍了《天语》英汉机器翻译系统(TYECT)的文法的基本内容。  相似文献   

15.
Field Association (FA) Terms—words or phrases that serve to identify document fields are effective in document classification, similar file retrieval and passage retrieval. But the problem lies in the lack of an effective method to extract and select relevant FA Terms to build a comprehensive dictionary of FA Terms. This paper presents a new method to extract, select and rank FA Terms from domain-specific corpora using part-of-speech (POS) pattern rules, corpora comparison and modified tf-idf weighting. Experimental evaluation on 21 fields using 306 MB of domain-specific corpora obtained from English Wikipedia dumps selected up to 2,517 FA Terms (single and compound) per field at precision and recall of 74–97 and 65–98. This is better than the traditional methods. The FA Terms dictionary constructed using this method achieved an average accuracy of 97.6% in identifying the fields of 10,077 test documents collected from Wikipedia, Reuters RCV1 corpus and 20 Newsgroup data set.  相似文献   

16.
随着2022年北京冬奥会的临近,有必要构建一个与冬奥会相关的垂直领域知识图谱,但目前网络上没有较完整的冬奥会相关术语集,因此,需要用集合扩展的方法对冬奥会术语集进行补充.近年来,集合扩展的方法主要基于Word2Vec进行研究,但扩展平均词频较低的冬奥会中文领域时效果并不理想.该文提出了中英文双语迭代扩展的方法,利用数量...  相似文献   

17.
该文在基本隐马尔克夫模型的基础之上,利用句法知识来改进词语对齐,把英语的短语结构树距离和基本隐马尔克夫模型相结合进行词语对齐。与基本隐马尔克夫模型相比,这个模型可以降低词语对齐的错误率,并且提高统计机器翻译系统BLEU值,从而提高机器翻译质量。  相似文献   

18.
维基百科作为一个以开放和用户协作编辑为特点的Web 2.0知识库系统,具有知识面覆盖度广,结构化程度高,信息更新速度快等优点。然而,维基百科的官方仅提供一些半结构化的数据文件,很多有用的结构化信息和数据,并不能直接地获取和利用。因此,该文首先从这些数据文件中抽取整理出多种结构化信息;然后,对维基百科中的各种信息建立了对象模型,并提供了一套开放的应用程序接口,大大降低了利用维基百科信息的难度;最后,利用维基百科中获取的信息,该文提出了一种基于链接所对应主题页面所属类别的词语语义相关度计算方法。  相似文献   

19.
This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.  相似文献   

20.
We study the problem of entity salience by proposing the design and implementation of Swat , a system that identifies the salient Wikipedia entities occurring in an input document. Swat consists of several modules that are able to detect and classify on‐the‐fly Wikipedia entities as salient or not, based on a large number of syntactic, semantic, and latent features properly extracted via a supervised process, which has been trained over millions of examples drawn from the New York Times corpus. The validation process is performed through a large experimental assessment, eventually showing that Swat improves known solutions over all publicly available datasets. We release Swat via an API that we describe and comment in the paper to ease its use in other software.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号