首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
基于双语语料的单个源语词汇和目标语多词单元的对齐   总被引:4,自引:0,他引:4  
多词单元包括固定搭配、多词习语和多词术语等。本文提供了一个基于双语口语语料库的自动对齐单个源语词汇和目标语多词单元的算法,算法一方面通过计算对应于同一个源语词汇,多个目标语词汇之间的互信息和t值的归一化差值的大小来衡量目标语多个词语之间的关联程度以提取多词单元,另一方面通过计算互信息和t值的平均值作为多词单元和单个源语词汇之间互为相互翻译的衡量程度,用局部最优、首尾禁用词过滤以及长词优先等策略很好地解决了这个问题。另外,对短语翻译词典的分级,有效地减少了高级别词典中非正确翻译项的数目,使得翻译词典具有更好的实用性。  相似文献   

2.
针对为汉藏辅助翻译系统建立汉藏多词单元翻译词典这一任务,该文提出了CMWEPM模型。该模型首先依据关联度和结合度来确定汉语语料中多词单元的边界,然后根据词对齐信息分别抽取严格和约束多词单元等价对,从而形成汉藏多词单元等价对。CMWEPM模型根据不同长度和频次对多词单元进行分类,并为不同类型设定不同阈值,最终提高了汉藏多词单元等价对的召回率,从而能够间接地提高汉藏辅助翻译系统的翻译质量。  相似文献   

3.
In this work, we present an extension of n-gram-based translation models based on factored language models (FLMs). Translation units employed in the n-gram-based approach to statistical machine translation (SMT) are based on mappings of sequences of raw words, while translation model probabilities are estimated through standard language modeling of such bilingual units. Therefore, similar to other translation model approaches (phrase-based or hierarchical), the sparseness problem of the units being modeled leads to unreliable probability estimates, even under conditions where large bilingual corpora are available. In order to tackle this problem, we extend the n-gram-based approach to SMT by tightly integrating more general word representations, such as lemmas and morphological classes, and we use the flexible framework of FLMs to apply a number of different back-off techniques. In this work, we show that FLMs can also be successfully applied to translation modeling, yielding more robust probability estimates that integrate larger bilingual contexts during the translation process.  相似文献   

4.
One of the deficiencies of mutual information is its poor capacity to measure association of words with unsymmetrical co-occurrence, which has large amounts for multi-word expression in texts. Moreover, threshold setting, which is decisive for success of practical implementation of mutual information for multi-word extraction, brings about many parameters to be predefined manually in the process of extracting multiword expressions with different number of individual words. In this paper, we propose a new method as EMICO (Enhanced Mutual Information and Collocation Optimization) to extract substantival multiword expression from text. Specifically, enhanced mutual information is proposed to measure the association of words and collocation optimization is proposed to automatically determine the number of individual words contained in a multiword expression when the multiword expression occurs in a candidate set. Our experiments showed that EMICO significantly improves the performance of substantival multiword expression extraction in comparison with a classic extraction method based on mutual information.  相似文献   

5.
本文基于语义选择与信息特征设计了英语自动化机器翻译系统。通过语义信息特征制定了机器翻译流程,以GIZA++为载体进行翻译,利用伯克利对准器对齐词语,基于反向转换语法,详细阐述汉语语言模式与英语翻译语言模式的结构关联特性,以语句动静配置,实现自动化机器翻译。最后通过系统测试,结果表明,与传统机器翻译系统相比,准确率显著提高,这就表明基于语义选择与信息特征的英语自动化机器翻译系统的翻译准确率较高,可为英汉机器翻译奠定坚实的基础支持。  相似文献   

6.
在机器译文自动评价中,匹配具有相同语义、不同表达方式的词或短语是其中一个很大的挑战。许多研究工作提出从双语平行语料或可比语料中抽取复述来增强机器译文和人工译文的匹配。然而双语平行语料或可比语料不仅构建成本高,而且对少数语言对难以大量获取。我们提出通过构建词的Markov网络,从目标语言的单语文本中抽取复述的方法,并利用该复述提高机器译文自动评价方法与人工评价方法的相关性。在WMT14 Metrics task上的实验结果表明,我们从单语文本中提取复述方法的性能与从双语平行语料中提取复述方法的性能具有很强的可比性。因此,该文提出的方法可在保证复述质量的同时,降低复述抽取的成本。
  相似文献   

7.
在以国际标准编码存储的传统蒙古文电子文本中,拼写错误十分普遍。人工校对这些错误不仅速度慢而且成本高。该文提出了一种基于统计翻译框架的传统蒙古文自动拼写校对方法,将拼写校对看作是从错误词到正确词的翻译。该文使用改进的基于短语的统计机器翻译模型来构建拼写校对模型,然后对测试文本进行校对。实验结果表明,该方法可以快速、有效地校对拼写错误,而且不依赖于特定语言的语法知识。使用该方法对包含1 026个正确词、1 102个错误词的测试集进行拼写校对,校对后文本中的正确词所占比例最高可达97.55%。  相似文献   

8.
近年来,端到端的神经机器翻译方法由于翻译准确率高,模型结构简单等优点已经成为机器翻译研究的重点,但其依然存在一个主要的缺点,该模型倾向于反复翻译某些源词,而错误地忽略掉部分词。针对这种情况,采用在端到端模型的基础上添加重构器的方法。首先利用Word2vec技术对蒙汉双语数据集进行向量化表示,然后预训练端到端的蒙汉神经机器翻译模型,最后对基于编码器-解码器重构框架的蒙汉神经机器翻译模型进行训练。将基于注意力机制的蒙汉神经机器翻译模型作为基线系统。实验结果表明,该框架显著提高了蒙汉机器翻译的充分性,比传统的基于注意力机制的蒙汉机器翻译模型具有更好的翻译效果。  相似文献   

9.
国内外基于语料库的翻译研究主要集中在翻译共性、翻译规范、译者风格和翻译培训等涉及翻译理论和翻译实践方面的研究;提出的基于三元组可比语料库的自动语言剖析技术扩大了该研究领域的内涵,使其包括面向自然语言处理的应用研究。从工程可实现性考虑,创新性地提出了建造三元组可比语料库,利用n-元词串、关键词簇和语义多词表达等自动抽取技术,通过对比中式英语表达,发掘英语本族语言模型,实现改进和发展机器翻译、跨语言信息检索等自然语言处理应用的目标。  相似文献   

10.
Machine translation is traditionally formulated as the transduction of strings of words from the source to the target language. As a result, additional lexical processing steps such as morphological analysis, transliteration, and tokenization are required to process the internal structure of words to help cope with data-sparsity issues that occur when simply dividing words according to white spaces. In this paper, we take a different approach: not dividing lexical processing and translation into two steps, but simply viewing translation as a single transduction between character strings in the source and target languages. In particular, we demonstrate that the key to achieving accuracies on a par with word-based translation in the character-based framework is the use of a many-to-many alignment strategy that can accurately capture correspondences between arbitrary substrings. We build on the alignment method proposed in Neubig et al. (Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. Portland, Oregon, pp. 632–641, 2011), improving its efficiency and accuracy with a focus on character-based translation. Using a many-to-many aligner imbued with these improvements, we demonstrate that the traditional framework of phrase-based machine translation sees large gains in accuracy over character-based translation with more naive alignment methods, and achieves comparable results to word-based translation for two distant language pairs.  相似文献   

11.
基于多策略分析的复杂长句翻译处理算法   总被引:2,自引:1,他引:2  
在实用机器翻译系统的研究开发中,复杂长句的翻译处理是其面临的一个主要难题。本文提出一种多语种通用的基于多策略分析的复杂长句翻译处理算法,该算法通过基于实例模式匹配和规则分析相结合的方法,综合利用源语言句子中多种相关的语言特征,包括语法语义特征、句子长度、标点符号、功能词以及上下文语境条件等对复杂长句进行切分简化处理和译文的复合生成。另一方面,通过对不同语种设计相同的知识表示形式,实现该算法对不同语种翻译系统的通用性。  相似文献   

12.
在神经机器翻译过程中,低频词是影响翻译模型性能的一个关键因素。由于低频词在数据集中出现次数较少,训练经常难以获得准确的低频词表示,该问题在低资源翻译中的影响更为突出。该文提出了一种低频词表示增强的低资源神经机器翻译方法。该方法的核心思想是利用单语数据上下文信息来学习低频词的概率分布,并根据该分布重新计算低频词的词嵌入,然后在所得词嵌入的基础上重新训练Transformer模型,从而有效缓解低频词表示不准确问题。该文分别在汉越和汉蒙两个语言对四个方向上分别进行实验,实验结果表明,该文提出的方法相对于基线模型均有显著的性能提升。  相似文献   

13.
EBMT系统中的多词单元翻译词典获取研究   总被引:2,自引:0,他引:2  
EBMT系统是一种基于语料库的机器翻译方法,其主要思想是通过类比原理进行翻译。如何从语料库中提取出一个实用的翻译词典进行系统的辅助翻译已经越来越多的引起关注。本文探讨了如何结合阈值和关联度提取的方法获取多词单元翻译词典,在这两种方法中,阈值提取受主观影响太大,关联值提取效率太低,都不能很好的满足翻译词典提取的要求。本文提出的算法利用阈值提取出备选多词单元,其中提出了四点规则弱化主观影响且保证全面覆盖所有多词单元,降低了阈值本身所带来的不精确度的影响,然后对计算结果进行三层过滤,进一步提高了准确率;该算法还合并了单词译成多词单元和多词单元互译两部分词典的提取,提高了工作效率。  相似文献   

14.
藏汉机器翻译技术跟汉英机器翻译技术有所不同,其中,很重要的一个方面,藏语更依赖于格助词等虚词在句子中的作用,格助词种类繁多,用法差异很大。针对藏语格助词进行分析,在藏语短语句法树库的基础上,加入了藏语本体特征的语义信息,形成融合藏语语义信息的藏汉机器翻译方法。通过对比基于短语和句法的实验分析,该方法可以很好地应用于藏汉机器翻译系统。  相似文献   

15.
神经机器翻译在语料丰富的语种上取得了良好的翻译效果,但是在汉语-越南语这类双语资源稀缺的语种上性能不佳,通过对现有小规模双语语料进行词级替换生成伪平行句对可以较好地缓解此类问题。考虑到汉越词级替换中易存在一词多译问题,该文对基于更大粒度的替换进行了研究,提出了一种基于短语替换的汉越伪平行句对生成方法。利用小规模双语语料进行短语抽取构建短语对齐表,并通过在维基百科中抽取的实体词组对其进行扩充,在对双语数据的汉语和越南语分别进行短语识别后,利用短语对齐表中与识别出的短语相似性较高的短语对进行替换,以此实现短语级的数据增强,并将生成的伪平行句对与原始数据一起训练最终的神经机器翻译模型。在汉-越翻译任务上的实验结果表明,通过短语替换生成的伪平行句对可以有效提高汉-越神经机器翻译的性能。  相似文献   

16.
Unknown words are one of the key factors that greatly affect the translation quality.Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words.However, these approaches have two disadvantages.On the one hand, they usually rely on many additional resources such as bilingual web data;on the other hand, they cannot guarantee good reordering and lexical selection of surrounding words.This paper gives a new perspective on handling unknown words in statistical machine translation (SMT).Instead of making great efforts to find the translation of unknown words, we focus on determining the semantic function of the unknown word in the test sentence and keeping the semantic function unchanged in the translation process.In this way, unknown words can help the phrase reordering and lexical selection of their surrounding words even though they still remain untranslated.In order to determine the semantic function of an unknown word, we employ the distributional semantic model and the bidirectional language model.Extensive experiments on both phrase-based and linguistically syntax-based SMT models in Chinese-to-English translation show that our method can substantially improve the translation quality.  相似文献   

17.
Short texts are typically composed of small number of words, most of which are abbreviations, typos and other kinds of noise. This makes the noise to signal ratio relatively high for this specific category of text. A high proportion of noise in the data is undesirable for analysis procedures as well as machine learning applications. Text normalization techniques are used to reduce the noise and improve the quality of text for processing and analysis purposes. In this work, we propose a combination of statistical and rule-based techniques to normalize short texts. More specifically, we focus our attention on SMS messages. We base our normalization approach on a statistical machine translation system which translates from noisy data to clean data. This system is trained on a small manually annotated set. Then, we study several automatic methods to extract more general rules from the normalizations generated with the statistical machine translation system. We illustrate the proposed methodology by conducting some experiments with a SMS Haitian-Créole data collection. In order to evaluate the performance of our methodology we use several Haitian-Créole dictionaries, the well-known perplexity criteria and the achieved reduction of vocabulary.  相似文献   

18.
面向机器辅助翻译的汉语语块自动抽取研究   总被引:1,自引:1,他引:1  
本文提出了一种统计和规则相结合的语块抽取方法。本文使用Nagao串频统计算法进行基于词语的串频统计,进一步分别利用统计方法、语块边界过滤规则对2-gram到10-gram语块进行过滤,得到候选语块,取得了令人满意的结果。通过实验发现,在统计方法中互信息和信息熵相结合的方法较单一的互信息方法好;在语块边界规则过滤方法中语块左右边界规则和停用词对语块抽取的结果有较大影响。实验结果表明统计和过滤规则相结合的方法要优于纯粹的统计方法。应用本文方法,再辅以人工校对,可以方便地获取重复出现的多词语块。在机器辅助翻译系统中,使用现有的语块抽取方法抽取重复的语言单位,就可以方便地建设翻译记忆库,提高翻译的工作效率。  相似文献   

19.
In this paper, we address the problem of the exploitation of text phrases in a multilingual context. We propose a technique to benefit from multi-word units in adhoc document retrieval, whatever the language of the document collection. We present principles to optimize the performance improvement obtained through this approach. The work is validated through retrieval experiments conducted on Chinese, Japanese, Korean and English.  相似文献   

20.
目前,大多数机器翻译自动评测方法都没有考虑在未匹配的词语中可能包含被忽略的信息。本文提出一种在参考译文和待评测译文之间自动搜索模糊匹配词对的方法,并给出相似度的计算方法。模糊匹配和计算相似度的整个过程将通过一个例子进行说明。实验表明,我们的方法能够较好地找到被忽略的、有意义的词对。更重要的是,通过引入模糊匹配,BLEU 的性能得到显著的提高。模糊匹配可以用来提高其他机器翻译自动评测方法的性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号