首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
International crime and terrorism have drawn increasing attention in recent years. Retrieving relevant information from criminal records and suspect communications is important in combating international crime and terrorism. However, most of this information is written in languages other than English and is stored in various locations. Information sharing between countries therefore presents the challenge of cross-lingual semantic interoperability. In this work, we propose a new approach – the associate constraint network – to generate a cross-lingual concept space from a parallel corpus, and benchmark it with a previously developed technique, the Hopfield network. The associate constraint network is a constraint programming based algorithm, and the problem of generating the cross-lingual concept space is formulated as a constraint satisfaction problem. Nodes and arcs in an associate constraint network represent extracted terms from parallel corpora and their associations. Constraints are defined for the nodes in the associate constraint network, and node consistency and network satisfaction are also defined. Backmarking is developed to search for a feasible solution. Our experimental results show that the associate constraint network outperforms the Hopfield network in precision, recall and efficiency. The cross-lingual concept space that is generated with this method can assist crime analysts to determine the relevance of criminals, crimes, locations and activities in multiple languages, which is information that is not available in traditional thesauri and dictionaries.  相似文献   

2.
Concepts and relations in ontologies and in other knowledge organisation systems are usually annotated with natural language labels. Most ontology matchers rely on such labels in element-level matching techniques. State-of-the-art approaches, however, tend to make implicit assumptions about the language used in labels (usually English) and are either domain-agnostic or are built for a specific domain. When faced with labels in different languages, most approaches resort to general-purpose machine translation services to reduce the problem to monolingual English-only matching. We investigate a thoroughly different and highly extensible solution based on semantic matching where labels are parsed by multilingual natural language processing and then matched using language-independent and domain aware background knowledge acting as an interlingua. The method is implemented in NuSM, the language and domain aware evolution of the SMATCH semantic matcher, and is evaluated against a translation-based approach. We also design and evaluate a fusion matcher that combines the outputs of the two techniques in order to boost precision or recall beyond the results produced by either technique alone.  相似文献   

3.
4.
跨语言文档聚类主要是将跨语言文档按照内容或者话题组织为不同的类簇。该文通过采用跨语言词相似度计算将单语广义向量空间模型(Generalized Vector Space Model, GVSM)拓展到跨语言文档表示中,即跨语言广义空间向量模型(Cross-Lingual Generalized Vector Space Model,CLGVSM),并且比较了不同相似度在文档聚类下的性能。同时提出了适用于GVSM的特征选择算法。实验证明,采用SOCPMI词汇相似度度量算法构造GVSM时,跨语言文档聚类的性能优于LSA。  相似文献   

5.
随着人们对互联网多语言信息需求的日益增长,跨语言词向量已成为一项重要的基础工具,并成功应用到机器翻译、信息检索、文本情感分析等自然语言处理领域。跨语言词向量是单语词向量的一种自然扩展,词的跨语言表示通过将不同的语言映射到一个共享的低维向量空间,在不同语言间进行知识转移,从而在多语言环境下对词义进行准确捕捉。近几年跨语言词向量模型的研究成果比较丰富,研究者们提出了较多生成跨语言词向量的方法。该文通过对现有的跨语言词向量模型研究的文献回顾,综合论述了近年来跨语言词向量模型、方法、技术的发展。按照词向量训练方法的不同,将其分为有监督学习、无监督学习和半监督学习三类方法,并对各类训练方法的原理和代表性研究进行总结以及详细的比较;最后概述了跨语言词向量的评估及应用,并分析了所面临的挑战和未来的发展方向。  相似文献   

6.
目的 由于缺乏图像与目标语言域的成对数据,现有的跨语言描述方法都是基于轴(源)语言转化为目标语言,由于转化过程中的语义噪音干扰,生成的句子存在不够流畅以及与图像视觉内容关联弱等问题,为此,本文提出了一种引入语义匹配和语言评价的跨语言图像描述模型。方法 首先,选择基于编码器—解码器的图像描述基准网络框架。其次,为了兼顾图像及其轴语言所包含的语义知识,构建了一个源域语义匹配模块;为了学习目标语言域的语言习惯,还构建了一个目标语言域评价模块。基于上述两个模块,对图像描述模型进行语义匹配约束和语言指导:1)图像&轴语言域语义匹配模块通过将图像、轴语言描述以及目标语言描述映射到公共嵌入空间来衡量各自模态特征表示的语义一致性。2)目标语言域评价模块依据目标语言风格,对所生成的描述句子进行语言评分。结果 针对跨语言的英文图像描述任务,本文在MS COCO(Microsoft common objects in context)数据集上进行了测试。与性能较好的方法相比,本文方法在BLEU(bilingual evaluation understudy)-2、BLEU-3、BLEU-4和METE...  相似文献   

7.
Modern Web search engines still have many limitations: search terms are not disambiguated, search terms in one query cannot be in different languages, the retrieved media items have to be in the same language as the search terms and search results are not integrated across a live stream of different media channels, including TV, online news and social media. The system described in this paper enables all of this by combining a media stream processing architecture with cross-lingual and cross-modal semantic annotation, search and recommendation. All those components were developed in the xLiMe project.  相似文献   

8.
Ontologies are widely considered as the building blocks of the semantic web, and with them, comes the data interoperability issue. As ontologies are not necessarily always labelled in the same natural language, one way to achieve semantic interoperability is by means of cross-lingual ontology mapping. Translation techniques are often used as an intermediate step to translate the conceptual labels within an ontology. This approach essentially removes the natural language barrier in the mapping environment and enables the application of monolingual ontology mapping tools. This paper shows that the key to this translation-based approach to cross-lingual ontology mapping lies with selecting appropriate ontology label translations in a given mapping context. Appropriateness of the translations in the context of cross-lingual ontology mapping differs from the ontology localisation point of view, as the former aims to generate correct mappings whereas the latter aims to adapt specifications of conceptualisations to target communities. This paper further demonstrates that the mapping outcome using the translation-based cross-lingual ontology mapping approach is conditioned on the translations selected for the intermediate label translation step. In particular, this paper presents the design, implementation and evaluation of a novel cross-lingual ontology mapping system: SOCOM++. SOCOM++ provides configurable properties that can be manipulated by a user in the process of selecting label translations in an effort to adjust the subsequent mapping outcome. It is shown through the evaluation that for the same pair of ontologies, the mappings between them can be adjusted by tuning the translations for the ontology labels. This finding is not yet shown in the previous research.  相似文献   

9.
Lexical databases following the wordnet paradigm capture information about words, word senses, and their relationships. A large number of existing tools and datasets are based on the original WordNet, so extending the landscape of resources aligned with WordNet leads to great potential for interoperability and to substantial synergies. Wordnets are being compiled for a considerable number of languages, however most have yet to reach a comparable level of coverage. We propose a method for automatically producing such resources for new languages based on WordNet, and analyse the implications of this approach both from a linguistic perspective as well as by considering natural language processing tasks. Our approach takes advantage of the original WordNet in conjunction with translation dictionaries. A small set of training associations is used to learn a statistical model for predicting associations between terms and senses. The associations are represented using a variety of scores that take into account structural properties as well as semantic relatedness and corpus frequency information. Although the resulting wordnets are imperfect in terms of their quality and coverage of language-specific phenomena, we show that they constitute a cheap and suitable alternative for many applications, both for monolingual tasks as well as for cross-lingual interoperability. Apart from analysing the resources directly, we conducted tests on semantic relatedness assessment and cross-lingual text classification with very promising results.  相似文献   

10.
由于现在缺乏多语言教学中的主观题自动评分, 针对这一问题提出了一种基于孪生网络和BERT模型的主观题自动评分系统. 主观题的问题文本和答案文本通过自然语言预处理BERT模型得到文本的句向量, BERT模型已经在大规模多种语言的语料上经过训练, 得到的文本向量包含了丰富的上下文语义信息, 并且能处理多种语言信息. 然后把...  相似文献   

11.
The creation and deployment of knowledge repositories for managing, sharing, and reusing tacit knowledge within an organization has emerged as a prevalent approach in current knowledge management practices. A knowledge repository typically contains vast amounts of formal knowledge elements, which generally are available as documents. To facilitate users' navigation of documents within a knowledge repository, knowledge maps, often created by document clustering techniques, represent an appealing and promising approach. Various document clustering techniques have been proposed in the literature, but most deal with monolingual documents (i.e., written in the same language). However, as a result of increased globalization and advances in Internet technology, an organization often maintains documents in different languages in its knowledge repositories, which necessitates multilingual document clustering (MLDC) to create organizational knowledge maps. Motivated by the significance of this demand, this study designs a Latent Semantic Indexing (LSI)-based MLDC technique capable of generating knowledge maps (i.e., document clusters) from multilingual documents. The empirical evaluation results show that the proposed LSI-based MLDC technique achieves satisfactory clustering effectiveness, measured by both cluster recall and cluster precision, and is capable of maintaining a good balance between monolingual and cross-lingual clustering effectiveness when clustering a multilingual document corpus.  相似文献   

12.
Berkeley FrameNet is a lexico-semantic resource for English based on the theory of frame semantics. It has been exploited in a range of natural language processing applications and has inspired the development of framenets for many languages. We present a methodological approach to the extraction and generation of a computational multilingual FrameNet-based grammar and lexicon. The approach leverages FrameNet-annotated corpora to automatically extract a set of cross-lingual semantico-syntactic valence patterns. Based on data from Berkeley FrameNet and Swedish FrameNet, the proposed approach has been implemented in Grammatical Framework (GF), a categorial grammar formalism specialized for multilingual grammars. The implementation of the grammar and lexicon is supported by the design of FrameNet, providing a frame semantic abstraction layer, an interlingual semantic application programming interface (API), over the interlingual syntactic API already provided by GF Resource Grammar Library. The evaluation of the acquired grammar and lexicon shows the feasibility of the approach. Additionally, we illustrate how the FrameNet-based grammar and lexicon are exploited in two distinct multilingual controlled natural language applications. The produced resources are available under an open source license.  相似文献   

13.
目前,在属性级情感分类任务上较为成熟的有标注数据集均为英文数据集,而有标注的中文数据集较少.为了能够更好地利用规模庞大但却缺乏成熟标注数据的中文语言数据集,针对跨语言属性级情感分类任务进行了研究.在跨语言属性级情感分类中,一个核心问题为如何构建不同语言的文本之间的联系.针对该问题,在传统的单语言情感分类模型的基础上,使用图神经网络模型对跨语言词-词、词-句之间的关系信息进行建模,从而有效地刻画两种语言数据集之间的联系.通过构建单语词-句之间的联系和双语词-句之间的联系,将不同语言的文本关联起来,并利用图神经网络进行建模,从而实现利用英文数据集预测中文数据集的跨语言神经网络模型.实验结果表明:相较于其他基线模型,所提出的模型在F1值指标上有着较大的提升,从而说明使用图神经网络建立的模型能够有效地应用于跨语言的应用场.  相似文献   

14.
A nonlinear semantic mapping procedure is proposed for cross-language document retrieval. The method relies on a nonlinear space reduction technique for constructing semantic embeddings of multilingual document collections. In the proposed method, an independent embedding is constructed for each language in the multilingual collection and the similarities among the resulting semantic representations are used for cross-language document retrieval. Two variants of the proposed method are implemented and compared with a standard cross-language information retrieval technique. It is shown that the proposed method outperforms the conventional one.  相似文献   

15.
Recent research on English word sense subjectivity has shown that the subjective aspect of an entity is a characteristic that is better delineated at the sense level, instead of the traditional word level. In this paper, we seek to explore whether senses aligned across languages exhibit this trait consistently, and if this is the case, we investigate how this property can be leveraged in an automatic fashion. We first conduct a manual annotation study to gauge whether the subjectivity trait of a sense can be robustly transferred across language boundaries. An automatic framework is then introduced that is able to predict subjectivity labeling for unseen senses using either cross-lingual or multilingual training enhanced with bootstrapping. We show that the multilingual model consistently outperforms the cross-lingual one, with an accuracy of over 73% across all iterations.  相似文献   

16.
We study the problem of extracting cross-lingual topics from non-parallel multilingual text datasets with partially overlapping thematic content (e.g., aligned Wikipedia articles in two different languages). To this end, we develop a new bilingual probabilistic topic model called comparable bilingual latent Dirichlet allocation (C-BiLDA), which is able to deal with such comparable data, and, unlike the standard bilingual LDA model (BiLDA), does not assume the availability of document pairs with identical topic distributions. We present a full overview of C-BiLDA, and show its utility in the task of cross-lingual knowledge transfer for multi-class document classification on two benchmarking datasets for three language pairs. The proposed model outperforms the baseline LDA model, as well as the standard BiLDA model and two standard low-rank approximation methods (CL-LSI and CL-KCCA) used in previous work on this task.  相似文献   

17.
跨语言句子语义相似度计算旨在计算不同语言句子之间的语义相似程度。近年来,前人提出了基于神经网络的跨语言句子语义相似度模型,这些模型多数使用卷积神经网络来捕获文本的局部语义信息,缺少对句子中远距离单词之间语义相关信息的获取。该文提出一种融合门控卷积神经网络和自注意力机制的神经网络结构,用于获取跨语言文本句子中的局部和全局语义相关关系,从而得到文本的综合语义表示。在SemEval-2017多个数据集上的实验结果表明,该文提出的模型能够从多个方面捕捉句子间的语义相似性,结果优于基准方法中基于纯神经网络的模型方法。  相似文献   

18.
Symbolic connectionism in natural language disambiguation   总被引:1,自引:0,他引:1  
Natural language understanding involves the simultaneous consideration of a large number of different sources of information. Traditional methods employed in language analysis have focused on developing powerful formalisms to represent syntactic or semantic structures along with rules for transforming language into these formalisms. However, they make use of only small subsets of knowledge. This article describes how to use the whole range of information through a neurosymbolic architecture which is a hybridization of a symbolic network and subsymbol vectors generated from a connectionist network. Besides initializing the symbolic network with prior knowledge, the subsymbol vectors are used to enhance the system's capability in disambiguation and provide flexibility in sentence understanding. The model captures a diversity of information including word associations, syntactic restrictions, case-role expectations, semantic rules and context. It attains highly interactive processing by representing knowledge in an associative network on which actual semantic inferences are performed. An integrated use of previously analyzed sentences in understanding is another important feature of our model. The model dynamically selects one hypothesis among multiple hypotheses. This notion is supported by three simulations which show the degree of disambiguation relies both on the amount of linguistic rules and the semantic-associative information available to support the inference processes in natural language understanding. Unlike many similar systems, our hybrid system is more sophisticated in tackling language disambiguation problems by using linguistic clues from disparate sources as well as modeling context effects into the sentence analysis. It is potentially more powerful than any systems relying on one processing paradigm  相似文献   

19.
HYBRID ALGORITHMS FOR THE CONSTRAINT SATISFACTION PROBLEM   总被引:11,自引:0,他引:11  
It might be said that there are five basic tree search algorithms for the constraint satisfaction problem (csp), namely, naive backtracking (BT), backjumping (BJ), conflict-directed backjumping (CBJ), backmarking (BM), and forward checking (FC). In broad terms, BT, BJ, and CBJ describe different styles of backward move (backtracking), whereas BT, BM, and FC describe different styles of forward move (labeling of variables). This paper presents an approach that allows base algorithms to be combined, giving us new hybrids. The base algorithms are described explicitly, in terms of a forward move and a backward move. It is then shown that the forward move of one algorithm may be combined with the backward move of another, giving a new hybrid. In total, four hybrids are presented: backmarking with backjumping (BMJ), backmarking with conflict-directed backjumping (BM-CBJ), forward checking with backjumping (FC-BJ), and forward checking with conflict-directed backjumping (FC-CBJ). The performances of the nine algorithms (BT, BJ, CBJ, BM, BMJ, BM-CBJ, FC, FC-BJ, FC-CBJ) are compared empirically, using 450 instances of the ZEBRA problem, and it is shown that FC-CBJ is by far the best of the algorithms examined.  相似文献   

20.

This paper proposes a multilingual audio information management system based on semantic knowledge in complex environments. The complex environment is defined by the limited resources (financial, material, human, and audio resources); the poor quality of the audio signal taken from an internet radio channel; the multilingual context (Spanish, French, and Basque that is in under-resourced situation in some areas); and the regular appearance of cross-lingual elements between the three languages. In addition to this, the system is also constrained by the requirements of the local multilingual industrial sector. We present the first evolutionary system based on a scalable architecture that is able to fulfill these specifications with automatic adaptation based on automatic semantic speech recognition, folksonomies, automatic configuration selection, machine learning, neural computing methodologies, and collaborative networks. As a result, it can be said that the initial goals have been accomplished and the usability of the final application has been tested successfully, even with non-experienced users.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号