共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper presents an extended, harmonised account of our previous work on combining subsentential alignments from phrase-based
statistical machine translation (SMT) and example-based MT (EBMT) systems to create novel hybrid data-driven systems capable
of outperforming the baseline SMT and EBMT systems from which they were derived. In previous work, we demonstrated that while
an EBMT system is capable of outperforming a phrase-based SMT (PBSMT) system constructed from freely available resources,
a hybrid ‘example-based’ SMT system incorporating marker chunks and SMT subsentential alignments is capable of outperforming
both baseline translation models for French–English translation. In this paper, we show that similar gains are to be had from
constructing a hybrid ‘statistical’ EBMT system. Unlike the previous research, here we use the Europarl training and test
sets, which are fast becoming the standard data in the field. On these data sets, while all hybrid ‘statistical’ EBMT variants
still fall short of the quality achieved by the baseline PBSMT system, we show that adding the marker chunks to create a hybrid
‘example-based’ SMT system outperforms the two baseline systems from which it is derived. Furthermore, we provide further
evidence in favour of hybrid systems by adding an SMT target-language model to the EBMT system, and demonstrate that this
too has a positive effect on translation quality. We also show that many of the subsentential alignments derived from the
Europarl corpus are created by either the PBSMT or the EBMT system, but not by both. In sum, therefore, despite the obvious
convergence of the two paradigms, the crucial differences between SMT and EBMT contribute positively to the overall translation
quality. The central thesis of this paper is that any researcher who continues to develop an MT system using either of these
approaches will benefit further from integrating the advantages of the other model; dogged adherence to one approach will
lead to inferior systems being developed. 相似文献
2.
We describe a novel approach to MT that combines the strengths of the two leading corpus-based approaches: Phrasal SMT and
EBMT. We use a syntactically informed decoder and reordering model based on the source dependency tree, in combination with
conventional SMT models to incorporate the power of phrasal SMT with the linguistic generality available in a parser. We show
that this approach significantly outperforms a leading string-based Phrasal SMT decoder and an EBMT system. We present results
from two radically different language pairs, and investigate the sensitivity of this approach to parse quality by using two
distinct parsers and oracle experiments. We also validate our automated bleu scores with a small human evaluation. 相似文献
3.
Aaron B. Phillips 《Machine Translation》2011,25(2):161-177
The Cunei machine translation platform is an open-source system for data-driven machine translation. Our platform is a synthesis
of the traditional example-based MT (EBMT) and statistical MT (SMT) paradigms. What makes Cunei unique is that it measures
the relevance of each translation instance with a distance function. This distance function, represented as a log-linear model,
operates over one translation instance at a time and enables us to score the translation instance relative to the specified
input and/or the current target hypothesis. We describe how our system, Cunei, scores features individually for each translation
instance and how it efficiently performs parameter tuning over the entire feature space. We also compare Cunei with three
other open-source MT systems (Moses, CMU-EBMT, and Marclator). In our experiments involving Korean–English and Czech–English
translation Cunei clearly outperforms the traditional EBMT and SMT systems. 相似文献
4.
John Hutchins 《Machine Translation》2005,19(3-4):197-211
In the last decade the dominant models of MT have been data-driven or corpus-based. Of the two main trends, statistical machine
translation and example-based machine translation (EBMT), the latter is much less clearly defined. In a review of the recently
published collection edited by Michael Carl and Andy Way, this essay surveys the basic processes, methods, main problems and
tasks of EBMT, and attempts to provide a definition of the essence of EBMT in comparison with statistical MT and traditional
rule-based MT.
Recent Advances in Example-based Machine Translation. Edited by Michael Carl and Andy Way. Dordrecht: Kluwer Academic Publishers, 2003. xxxi, 482pp. (Text, Speech and Language
Technology, vol. 21) ISBN: 1-4020-1400-7 (hardback), 1-4020-1401-5 (paperback). 相似文献
5.
We propose a novel approach to cross-lingual language model and translation lexicon adaptation for statistical machine translation
(SMT) based on bilingual latent semantic analysis. Bilingual LSA enables latent topic distributions to be efficiently transferred
across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework,
model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying
the inferred distribution to an n-gram language model of the target language and translation lexicon via marginal adaptation. The background phrase table is
enhanced with the additional phrase scores computed using the adapted translation lexicon. The proposed framework also features
rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach is evaluated
on the Chinese–English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST scores.
Improvement in both scores is observed on both systems when the adapted language model and the adapted translation lexicon
are applied individually. When the adapted language model and the adapted translation lexicon are applied simultaneously,
the gain is additive. At the 95% confidence interval of the unadapted baseline system, the gain in both scores is statistically
significant using the medium-scale SMT system, while the gain in the NIST score is statistically significant using the GALE
SMT system. 相似文献
6.
多策略汉日机器翻译系统中的核心技术研究 总被引:1,自引:0,他引:1
多策略的机器翻译是当今机器翻译系统的一个发展方向。该文论述了一个多策略的汉日机器翻译系统中各翻译核心子系统所使用的核心技术和算法,其中包含了使用词法分析、句法分析和语义角色标注的汉语分析子系统、利用双重索引技术的基于翻译记忆技术的机器翻译子系统、以句法树片段为模板的基于实例模式的机器翻译子系统以及综合了配价模式和断段分析的机器翻译子系统。翻译记忆子系统的测试结果表明其具有高效的特性;实例模式子系统在1 559个句子的封闭测试中达到99%的准确率,在1 500个句子的开放测试中达到85%的准确率;配价模式子系统在3 059个句子的测试中达到了89%的准确率。 相似文献
7.
D. Ortiz-Martínez I. García-Varea F. Casacuberta 《Pattern recognition letters》2008,29(8):1145-PRintPerclntel
Statistical machine translation (SMT) has proven to be an interesting pattern recognition framework for automatically building machine translations systems from available parallel corpora. In the last few years, research in SMT has been characterized by two significant advances. First, the popularization of the so called phrase-based statistical translation models, which allows to incorporate local contextual information to the translation models. Second, the availability of larger and larger parallel corpora, which are composed of millions of sentence pairs, and tens of millions of running words. Since phrase-based models basically consists in statistical dictionaries of phrase pairs, their estimation from very large corpora is a very costly task that yields a huge number of parameters which are to be stored in memory. The handling of millions of model parameters and a similar number of training samples have become a bottleneck in the field of SMT, as well as in other well-known pattern recognition tasks such as speech recognition or handwritten recognition, just to name a few. In this paper, we propose a general framework that deals with the scaling problem in SMT without introducing significant time overhead by means of the combination of different scaling techniques. This new framework is based on the use of counts instead of probabilities, and on the concept of cache memory. 相似文献
8.
基于词类串的汉语句子结构相似度计算方法 总被引:9,自引:1,他引:9
句子相似度的衡量是基于实例机器翻译研究中最重要的一个内容。对于基于实例的汉英机器翻译研究,汉语句子相似度衡量的准确性,直接影响到最后翻译结果的输出。本文提出了一种汉语句子结构相似性的计算方法。该方法比较两个句子的词类信息串,进行最优匹配,得到一个结构相似性的值。在小句子集上的初步实验结果表明,该方法可行,有效,符合人的直观判断。 相似文献
9.
This paper describes an example-based machine translation (EBMT) method based on tree–string correspondence (TSC) and statistical
generation. In this method, the translation example is represented as a TSC, which is a triple consisting of a parse tree
in the source language, a string in the target language, and the correspondence between the leaf node of the source-language
tree and the substring of the target-language string. For an input sentence to be translated, it is first parsed into a tree.
Then the TSC forest which best matches the input tree is searched for. Finally the translation is generated using a statistical
generation model to combine the target-language strings of the TSCs. The generation model consists of three features: the
semantic similarity between the tree in the TSC and the input tree, the translation probability of translating the source
word into the target word, and the language-model probability for the target-language string. Based on the above method, we
build an English-to-Chinese MT system. Experimental results indicate that the performance of our system is comparable with
phrase-based statistical MT systems. 相似文献
10.
基于实例的机器翻译(example-based machine translation,简称EBMT)使用预处理过的双语例句作为主要翻译资源,通过编辑与待翻译句子匹配的翻译实例来生成译文.在EBMT系统中,翻译实例选择及译文选择对系统性能影响较大.提出利用统计搭配模型来增强EBMT系统中翻译实例选择及译文选择的能力,提高译文质量.首先,使用单语统计词对齐从单语语料中训练统计搭配模型.然后,利用该模型从3个方面提高EBMT的性能:(1)利用统计搭配模型估计待翻译句子与翻译实例之间的匹配度,从而增强系统的翻译实例选择能力;(2)通过引入候选译文与上下文之间搭配强度的估计来提高译文选择能力;(3)使用统计搭配模型检测翻译实例中被替换词的搭配词,同时根据新的替换词及上下文对搭配词进行矫正,进一步提高EBMT系统的译文质量.为了验证所提出的方法,在基于词的EBMT系统上评价了英汉翻译的译文质量.与基线系统相比,所提出的方法使译文的BLEU得分提高了4.73~6.48个百分点.在半结构化的EBMT系统上进一步检验了基于统计搭配模型的译文选择方法,从实验结果来看,该方法使译文的BLEU得分提高了1.82个百分点.同时,人工评价结果显示,改进后的半结构化EBMT系统的译文能够表达原文的大部分信息,并且具有较高的流利度. 相似文献
11.
12.
In statistical machine translation (SMT), re-ranking of huge amount of randomly generated translation hypotheses is one of the essential components in determining the quality of translation result. In this work, a novel re-ranking modelling framework called cascaded re-ranking modelling (CRM) is proposed by cascading a classification model and a regression model. The proposed CRM effectively and efficiently selects the good but rare hypotheses in order to alleviate simultaneously the issues of translation quality and computational cost. CRM can be partnered with any classifier such as support vector machines (SVM) and extreme learning machine (ELM). Compared to other state-of-the-art methods, experimental results show that CRM partnered with ELM (CRM-ELM) can raise at most 11.6% of translation quality over the popular benchmark Chinese–English corpus (IWSLT 2014) and French–English parallel corpus (WMT 2015) with extremely fast training time for huge corpus. 相似文献
13.
汉语分词是搭建汉语到其他语言的统计机器翻译系统的一项重要工作。从单语语料中训练得到的传统分词模型并不一定完全适合机器翻译[1]。该文提出了一种基于单语和双语知识的适应于统计机器翻译系统的分词方法。首先利用对齐可信度的概念从双语字对齐语料中抽取可信对齐集合,然后根据可信对齐集合对双语语料中的中文部分重新分词;接着将重新分词的结果和单语分词工具的分词结果相融合,得到新的分词结果,并将其作为训练语料,利用条件随机场模型训练出一个融合了单双语知识的分词工具。该文用该工具对机器翻译所需的训练集、开发集和测试集进行分词,并在基于短语的统计机器翻译系统上进行实验。实验结果表明,该文所提的方法提高了系统性能。 相似文献
14.
一种基于实例的汉英机器翻译策略 总被引:3,自引:0,他引:3
介绍了一种基于实例的汉英机器翻译策略,重点讨论了汉英双语语料库的设计和基于该语料库的汉语句子的匹配算法。在进行汉语句子的匹配时,根据汉语的特点直接采用汉字的匹配,而没有进行汉语句子的分词。另外,匹配时确定匹配片断的边界也是基于实例机器翻译的难点之一,在这方面也采取了相应的解决方法。没有对翻译句子的连接装配进行更深入的研究,这是因为该翻译策略是用于多翻译引擎系统的,它要与其它翻译策略配合使用,以提高翻译结果的正确率。基于实例的机器翻译需要大量的双语语料库作为翻译时的依据,而人工建设大型语料库费时费力,所以尝试采用计算机进行汉英双语语料库的自动建立,包括篇章对齐和单词级的对齐。 相似文献
15.
汉蒙语形态差异性及平行语料库规模小制约了汉蒙统计机器翻译性能的提升。该文将蒙古语形态信息引入汉蒙统计机器翻译中,通过将蒙古语切分成词素的形式,构造汉语词和蒙古语词素,以及蒙古语词素和蒙古语的映射关系,弥补汉蒙形态结构上的非对称性,并将词素作为中间语言,通过训练汉语—蒙古语词素以及蒙古语词素-蒙古语统计机器翻译系统,构建出新的短语翻译表和调序模型,并采用多路径解码及多特征的方式融入汉蒙统计机器翻译。实验结果表明,将基于词素媒介构建出的短语翻译表和调序模型引入现有统计机器翻译方法,使得译文在BLEU值上比基线系统有了明显提高,一定程度上消解了数据稀疏和形态差异对汉蒙统计机器翻译的影响。该方法是一种通用的方法,通过词素和短语两个层面信息的结合,实现了两种语言在形态结构上的对称,不仅适用于汉蒙统计机器翻译,还适用于形态非对称且低资源的语言对。 相似文献
16.
基于实例的机器翻译是一种重要的机器翻译技术,句子相似度的衡量是基于实例机器翻译研究中最重要的一个内容。对于基于实例的维吾尔语机器翻译研究,维吾尔语句子相似度衡量的准确性,直接影响到最后翻译结果的输出。提出了一种维吾尔语句子相似度的计算方法,采用的基于词形特征的粗选算法、散列单词倒排索引能够有效提高算法的查找速度,快速从语料库中筛选出候选句子集合;多策略精选算法中采用基于维吾尔语词频的单词区分度算法、连续单词序列抽取算法,可以有效衡量两个维吾尔语句子的相似程度,实验结果证明算法是有效的。 相似文献
17.
18.
In this paper, we present a hybrid architecture for developing a system combination model that works in three layers to achieve better translated outputs. In the first layer, we have various machine translation models (i.e. Neural Machine Translation (NMT), Statistical Machine Translation (SMT), etc.). In the second layer, the outputs of these models are combined to leverage the advantages of both the systems (i.e SMT and NMT systems) by using the statistical approach and neural-based approach. But each approach has some advantages and limitations. So, instead of selecting an individual combined system’s output as the final one, we apply these outputs in the final layer to produce the target output by assigning appropriate preferences to SMT based and neural-based combinations. Though there are some techniques for system combination but no such approach exists which uses preferences from various system combination models (statistical and neural) for the purpose of better assembling. Empirical results show improved performance in the terms of translation accuracy. Our experiments on two benchmark datasets of English–Hindi and Hindi–English pairs show that the proposed model performs significantly better than the participating models. Apparently, the efficacy of proposed model is significantly better than the state-of-the art machine translation combination systems (6.10 and 4.69 BLEU points for English-to-Hindi, and Hindi-to-English, respectively). 相似文献
19.
20.
在当前的基于统计的翻译方法中,双语语料库的规模、词对齐的准确率对于翻译系统的性能有很大的影响。虽然大规模语料库可以改善词语对齐的准确度,提高系统的性能,但同时会以增加系统的负载为代价,因此目前对于统计机器翻译方法的研究在使用大规模语料库的基础上,同时寻求其他可以提高系统性能的方法。针对以上问题,提出一种把双语词典应用在统计机器翻译中的方法,不仅优化了词对齐的准确率,而且得出质量更高的翻译结果,在一定程度上缓解了数据稀疏问题。 相似文献