首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
英中可比语料库中多词表达自动提取与对齐   总被引:3,自引:1,他引:2       下载免费PDF全文
多词表达(MWE)不仅用来提高当前机器翻译系统质量,而且也用于跨语言检索和数据挖掘等其他自然语言处理领域。为此,提出了基于语义模板与基于统计工具相结合的方法从三元组可比语料库中自动提取本族英语MWE。采用基于词表和分布方法计算词语间的相似度,扩大MWE覆盖范围。利用GIZA++对齐算法提取对译的中文MWE,依据统计方法计算互译概率信息,根据概率大小,选择最佳英汉MWE互译对。实验结果表明上述方法可以有效提高MWE提取和对齐的准确率。  相似文献   

2.
双语词典是跨语言信息检索以及机器翻译等自然语言处理应用中的一项重要资源。现有的基于可比语料库的双语词典提取算法不够成熟,抽取效果有待提高,而且大多数研究都集中在特定领域的专业术语抽取。针对此不足,提出了一种基于词向量与可比语料库的双语词典提取算法。首先给出了该算法的基本假设以及相关的研究方法,然后阐述了基于词向量利用词间关系矩阵从可比语料库中提取双语词典的具体步骤,最后将该抽取方法与经典的向量空间模型做对比,通过实验分析了上下文窗口大小、种子词典大小、词频等因素对两种模型抽取效果的影响。实验表明,与基于向量空间模型的方法相比,本算法的抽取效果有着明显的提升,尤其是对于高频词语其准确率提升最为显著。  相似文献   

3.
除了机器翻译,平行语料库对信息检索、信息抽取及知识获取等研究领域具有重要的作用,但是传统的平行语料库只是在句子级对齐,因而对跨语言自然语言处理研究的作用有限。鉴于此,以OntoNotes中英文平行语料库为基础,通过自动抽取、自动映射加人工标注相结合的方法,构建了一个面向信息抽取的高质量中英文平行语料库。该语料库不仅包含中英文实体及其相互关系,而且实现了中英文在实体和关系级别上的对齐。因此,该语料库将有助于中英文信息抽取的对比研究,揭示不同语言在语义表达上的差异,也为跨语言信息抽取的研究提供了一个有价值的平台。  相似文献   

4.
语料库作为基本的语言数据库和知识库,是各种自然语言处理方法实现的基础。随着统计方法在自然语言处理中的广泛应用,语料库建设已成为重要的研究课题。自动分词是句法分析的一项不可或缺的基础性工作,其性能直接影响句法分析。本文通过对85万字节藏语语料的统计分析和藏语词的分布特点、语法功能研究,介绍基于词典库的藏文自动分词系统的模型,给出了切分用词典库的结构、格分块算法和还原算法。系统的研制为藏文输入法研究、藏文电子词典建设、藏文字词频统计、搜索引擎的设计和实现、机器翻译系统的开发、网络信息安全、藏文语料库建设以及藏语语义分析研究奠定了基础。  相似文献   

5.
基于语料库与层次词典的自动文摘研究   总被引:2,自引:1,他引:1  
宋今  赵东岩 《软件学报》2000,11(3):308-314
自动文摘研究作为自然语言处理研究的一个重要且实用的分支,目前逐渐成为Internet信息检索等应用领域的重要研究课题之一.该文提出的基于语料库的文摘试图将传统的基地语言学分析的文摘方法和基于统计的文摘方法的优点结合在一起.基于语料库的文摘方法的实质即以系统外的分析代价换取系统内的算法效率.该文描述的算法给出了基于层次词典的关键字提取和基于语料库的自动文摘的实现.  相似文献   

6.
双语词典是跨语言信息检索以及机器翻译等自然语言处理应用的基础资源.为了利用双语语料库获取汉英词典,在对三种常见的基于共现信息的短语对译计算模型进行了研究之后,以对数相似性模型为基础,采用迭代策略实现了翻译词典获取.实验表明,该方法能够有效地提高词典获取的正确率,提高基于语料库的汉英词典编撰效率.  相似文献   

7.
回顾了语料库分类及可比语料库中翻译等价对抽取方法研究的历史。根据从可比语料库中提取翻译等价对所依据的基本假设:一个语言中一个词在对应到另外一种语言时其与周围词之间的共现搭配关系仍然被保持,采用双向等价对获取计算然后求交集、词加权因数TF(iw)*IDF(i)值计算、上下文词的词性信息利用的方法来提高翻译等价对提取正确率。描述了翻译等价对抽取实验步骤,并对实验结果进行了简要分析。实验结果表明上述方法可以有效提高翻译等价对计算结果的正确率。最后提出了需要进一研究的问题。  相似文献   

8.
王志娟  李福现 《计算机科学》2017,44(Z6):14-18, 28
跨语言命名实体对于机器翻译、跨语言信息抽取都具有重要意义,从命名实体的音译、基于平行/可比语料库的跨语言命名实体对齐、基于网络挖掘的跨语言命名实体对翻译抽取3个方面对跨语言命名实体翻译对抽取的研究现状进行了总结。音译是跨语言命名实体翻译对抽取的重点内容之一,基于深度学习的音译模型将是今后的研究重点。目前,跨语言平行/可比语料库的获取和标注直接影响基于语料库的跨语言命名实体对齐的深入研究。基于信息检索和维基百科的跨语言命名实体翻译对抽取研究将是跨语言命名实体翻译对抽取研究的趋势。  相似文献   

9.
英汉小句对齐语料库服务于英语和汉语小句的语法结构对应关系研究和应用,对于语言理论和语言翻译(包括人的翻译和机器翻译)有重要意义。前人的语法理论和相关语料库的工作对于小句复合体和小句的界定缺乏充分研究,在理论上有缺陷,难以支持自然语言处理的应用。该文首先为英汉小句对齐语料库的建设做理论准备。从近年提出的汉语小句复合体的理论出发,该文界定了成分共享的概念,基于话头共享和引语共享来界定英语的小句和小句复合体,使小句和小句复合体具有功能的完整性和单一性。在此基础上,该文设计了英汉小句对齐的标注体系,包括英语NT小句标注和汉语译文生成及组合。语料库的标注表明,在小句复合体层面上英汉翻译涉及到的结构变换,其部件可以限制为英语小句和话头、话体,无须涉及话头和话体内部的结构。基于这些工作的英汉小句对齐语料库为语言本体研究和英汉语言对比、英汉机器翻译等应用提供了结构化的标注样本。  相似文献   

10.
产品评论文本是情感分析的重要研究对象,目前已有的产品评论语料库大都较为粗疏,没有完整地标注出对象、属性、极性“三要素”,影响自动分析的应用场景。对此,该文构建了细颗粒度评论语料库,共包含9 343句汽车评论短文本,不仅人工标注了“三要素”的具体词语,而且将其对应到产品和属性的知识本体树上。此外,对无情感词的隐含表达、特殊文本(如建议文本、比较句等)也标注出对应的三元组并予以特殊标签。语料统计表明,对象和属性要素的共现率高达77.54%,验证了构建“三要素”齐全标注体系的必要性;基于该语料库的“三要素”自动标注实验F1值可达70.82%,验证了细颗粒度标注体系的可计算性以及语料库构建的规范性和应用价值。该语料库可以为细颗粒度情感分析研究提供基础数据。  相似文献   

11.
Most research in the automatic assessment of free text answers written by students address English language. This paper handles the assessment task in Arabic language. This research focuses on applying multiple similarity measures separately and in combination. Many aspects are introduced that depend on translation to overcome the lack of text processing resources in Arabic, such as extracting model answers automatically from an already built database and applying K-means clustering to scale the obtained similarity values. Additionally, this research presents the first benchmark Arabic data set that contains 610 students’ short answers together with their English translations.  相似文献   

12.
This essay discusses the claims made by Hofstadter (1997) that machine translation research, as a result of its commercial aims, is based on a ‘pure content’ view of language, which is unrealistic even for specialized domains. The paper argues that this simplified view of language and translation cannot be blamed solely on industrial research; it is a tenet of basic research MT as well, and one that MT shares with much of AI. The question of whether there is a fundamental difference between narrow domain and general translation is discussed in the light of Melby's (1995) notion of a ‘wall’ between terms and general vocabulary.  相似文献   

13.
一种面向口语的译文质量自动评价方法   总被引:1,自引:1,他引:0  
译文质量的自动评价对机器翻译研究具有十分重要的意义。但现有方法主要是针对书面语翻译,没有考虑到口语翻译的特征。因此,本文提出了一种面向口语的新型的自动评价方法,通过定义信息段、标注权重和设计多种匹配策略等方法,使自动评价结果与人工打分更为接近,同时也提高了评价过程对不同输出译文的适应能力。各项实验表明,该算法对译文质量变化具有较高的敏感度,而且可以对输出译文质量作出与手工评判较为接近的评价结果。  相似文献   

14.
从语言深度感知设计了一种汉语作文自动阅卷评分算法,抽象出语感特征的计算模型以模拟人类对自然语言的评价标准,弥补了早期作文自动阅卷中的机械统计式自然语言处理技术的不足。采用AC自动机对语感支撑要素,即作文作者的个人语言素材,进行快速分析。利用文本分词和主干提取等技术实现了对诸如句子流畅度等语句级评价,并将待评测作文的上下文结构与标准作文框架进行相似性比对,从而在作文自动评分系统中建立对作者语言运用能力的综合评价。实验结果表明,该算法增强了自动评分的语言能力评定的合理性,也更加贴切与专家校准后的人工评分样本。  相似文献   

15.

Over the years, different approaches to identify temporal and spatial conflicts in hypermedia applications has been proposed. Most of them are based on formal verification techniques and impose to the designers to follow a formal model or language to ensure application’s functional correctness. Furthermore, the error diagnose is hard to be interpreted by a non-specialist in this domain. In this paper, we present an approach which supports formal verification for documents written in markup languages. We proposed a method and built a verification toolchain that helps designers to verify time and spatial constraints in hypermedia applications. The input language is the designer language. Its translation towards the input of toolchain is automatic and transparent for the application designer. The errors scenarios provided by the verification tool are presented in a timeline way, easily understandable by the designer. The method and toolchain support different markup languages translated in the same intermediary language in order to facilitate the use of different verification tools in the same environment.

  相似文献   

16.
This paper describes a domain-limited system for speech understanding as well as for speech translation. An integrated semantic decoder directly converts the preprocessed speech signal into its semantic representation by a maximum a-posteriori classification. With the combination of probabilistic knowledge on acoustic, phonetic, syntactic, and semantic levels, the semantic decoder extracts the most probable meaning of the utterance. No separate speech recognition stage is needed because of the integration of the Viterbi-algorithm (calculating acoustic probabilities by the use of Hidden-Markov-Models) and a probabilistic chart parser (calculating semantic and syntactic probabilities by special models). The semantic structure is introduced as a representation of an utterance's meaning. It can be used as an intermediate level for a succeeding intention decoder (within a speech understanding system for the control of a running application by spoken inputs) as well as an interlingua-level for a succeeding language production unit (within an automatic speech translation system for the creation of spoken output in another language). Following the above principles and using the respective algorithms, speech understanding and speech translating front-ends for the domains ‘graphic editor’, ‘service robot’, ‘medical image visualisation’ and ‘scheduling dialogues’ could be successfully realised.  相似文献   

17.
This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. Eight research groups comprising the A-STAR members participated in the experiments, covering nine languages, i.e., eight Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, and Chinese) and English. Each A-STAR member contributed one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. The system was designed to translate common spoken utterances of travel conversations from a given source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. It covers travel expressions including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we describe the issues of developing spoken language technologies for Asian languages, and discuss the difficulties involved in connecting different heterogeneous spoken language translation systems through Web servers. This paper also presents speech-translation results including subjective evaluation, from the first A-STAR field testing which was carried out in July 2009.  相似文献   

18.
Plant recognition is closely related to people’s life. The operation of the traditional plant identification method is complicated, and is unfavorable for popularization. The rapid development of computer image processing and pattern recognition technology makes it possible for computer’s automatic recognition of plant species based on image processing. There are more and more researchers drawing their attention on the computer’s automatic identification technology based on plant images in recent years. Based on this, we have carried on a wide range of research and analysis on the plant identification method based on image processing in recent years. First of all, the research significance and history of plant recognition technologies are introduced in this paper; secondly, the main technologies and steps of plant recognition are reviewed; thirdly, more than 30 leaf features (including 16 shape features, 11 texture features, four color features), and then SVM was used to evaluate these features and their fusion features, and 8 commonly used classifiers are introduced in detail. Finally, the paper is ended with a conclusion of the insufficient of plant identification technologies and a prediction of future development.  相似文献   

19.
Text normalization is a critical step in the variety of tasks involving speech and language technologies. It is one of the vital components of natural language processing, text-to-speech synthesis and automatic speech recognition. Convolutional neural networks (CNNs) have proven their superior performance to recurrent architectures in various application scenarios, like neural machine translation, however their ability in text normalization was not exploited yet. In this paper we investigate and propose a novel CNNs based text normalization method. Training, inference times, accuracy, precision, recall, and F1-score were evaluated on an open-source dataset. The performance of CNNs is evaluated and compared with a variety of different long short-term memory (LSTM) and Bi-LSTM architectures with the same dataset.  相似文献   

20.
口语自动翻译系统技术评析   总被引:1,自引:1,他引:0  
近几年来,随着信息技术的发展,口语自动翻译技术成为新的研究热点。目前国际上一些著名大学和研究机构甚至企业,都纷纷加入这一高技术的竞争行列,我国在相关技术方面也进行了卓有成效的研究。本文对目前自动口语翻译研究的技术现状进行了全面综述和分析,并对一些具体问题作了深入探讨。作者希望本文作出的分析和讨论的问题,能够对我国的自动口语翻译研究提供有益的参考。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号