共查询到20条相似文献,搜索用时 19 毫秒
1.
This paper presents an extended, harmonised account of our previous work on combining subsentential alignments from phrase-based
statistical machine translation (SMT) and example-based MT (EBMT) systems to create novel hybrid data-driven systems capable
of outperforming the baseline SMT and EBMT systems from which they were derived. In previous work, we demonstrated that while
an EBMT system is capable of outperforming a phrase-based SMT (PBSMT) system constructed from freely available resources,
a hybrid ‘example-based’ SMT system incorporating marker chunks and SMT subsentential alignments is capable of outperforming
both baseline translation models for French–English translation. In this paper, we show that similar gains are to be had from
constructing a hybrid ‘statistical’ EBMT system. Unlike the previous research, here we use the Europarl training and test
sets, which are fast becoming the standard data in the field. On these data sets, while all hybrid ‘statistical’ EBMT variants
still fall short of the quality achieved by the baseline PBSMT system, we show that adding the marker chunks to create a hybrid
‘example-based’ SMT system outperforms the two baseline systems from which it is derived. Furthermore, we provide further
evidence in favour of hybrid systems by adding an SMT target-language model to the EBMT system, and demonstrate that this
too has a positive effect on translation quality. We also show that many of the subsentential alignments derived from the
Europarl corpus are created by either the PBSMT or the EBMT system, but not by both. In sum, therefore, despite the obvious
convergence of the two paradigms, the crucial differences between SMT and EBMT contribute positively to the overall translation
quality. The central thesis of this paper is that any researcher who continues to develop an MT system using either of these
approaches will benefit further from integrating the advantages of the other model; dogged adherence to one approach will
lead to inferior systems being developed. 相似文献
2.
This paper describes an example-based machine translation (EBMT) method based on tree–string correspondence (TSC) and statistical
generation. In this method, the translation example is represented as a TSC, which is a triple consisting of a parse tree
in the source language, a string in the target language, and the correspondence between the leaf node of the source-language
tree and the substring of the target-language string. For an input sentence to be translated, it is first parsed into a tree.
Then the TSC forest which best matches the input tree is searched for. Finally the translation is generated using a statistical
generation model to combine the target-language strings of the TSCs. The generation model consists of three features: the
semantic similarity between the tree in the TSC and the input tree, the translation probability of translating the source
word into the target word, and the language-model probability for the target-language string. Based on the above method, we
build an English-to-Chinese MT system. Experimental results indicate that the performance of our system is comparable with
phrase-based statistical MT systems. 相似文献
3.
We propose a novel approach to cross-lingual language model and translation lexicon adaptation for statistical machine translation
(SMT) based on bilingual latent semantic analysis. Bilingual LSA enables latent topic distributions to be efficiently transferred
across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework,
model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying
the inferred distribution to an n-gram language model of the target language and translation lexicon via marginal adaptation. The background phrase table is
enhanced with the additional phrase scores computed using the adapted translation lexicon. The proposed framework also features
rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach is evaluated
on the Chinese–English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST scores.
Improvement in both scores is observed on both systems when the adapted language model and the adapted translation lexicon
are applied individually. When the adapted language model and the adapted translation lexicon are applied simultaneously,
the gain is additive. At the 95% confidence interval of the unadapted baseline system, the gain in both scores is statistically
significant using the medium-scale SMT system, while the gain in the NIST score is statistically significant using the GALE
SMT system. 相似文献
4.
Aaron B. Phillips 《Machine Translation》2011,25(2):161-177
The Cunei machine translation platform is an open-source system for data-driven machine translation. Our platform is a synthesis
of the traditional example-based MT (EBMT) and statistical MT (SMT) paradigms. What makes Cunei unique is that it measures
the relevance of each translation instance with a distance function. This distance function, represented as a log-linear model,
operates over one translation instance at a time and enables us to score the translation instance relative to the specified
input and/or the current target hypothesis. We describe how our system, Cunei, scores features individually for each translation
instance and how it efficiently performs parameter tuning over the entire feature space. We also compare Cunei with three
other open-source MT systems (Moses, CMU-EBMT, and Marclator). In our experiments involving Korean–English and Czech–English
translation Cunei clearly outperforms the traditional EBMT and SMT systems. 相似文献
5.
The string-to-tree model is one of the most successful syntax-based statistical machine translation (SMT) models. It models the grammaticality of the output via target-side syntax. However, it does not use any semantic information and tends to produce translations containing semantic role confusions and error chunk sequences. In this paper, we propose two methods to use semantic roles to improve the performance of the string-to-tree translation model: (1) adding role labels in the syntax tree; (2) constructing a semantic role tree, and then incorporating the syntax information into it. We then perform string-to-tree machine translation using the newly generated trees. Our methods enable the system to train and choose better translation rules using semantic information. Our experiments showed significant improvements over the state-of-the-art string-to-tree translation system on both spoken and news corpora, and the two proposed methods surpass the phrase-based system on large-scale training data. 相似文献
6.
Gideon Kotzé Vincent Vandeghinste Scott Martens Jörg Tiedemann 《Language Resources and Evaluation》2017,51(2):249-282
We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the non-terminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present non-terminal alignment evaluation scores for a variety of tree alignment approaches. Finally, based on the parallel treebanks created by these approaches, we evaluate the MT system itself and compare the scores with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data. 相似文献
7.
Stefanie Geldbach 《Machine Translation》1999,14(3-4):217-230
Anaphora resolution in machine translation involves two aspects:(1) the identification of the antecedent, i.e., the determinationof co-reference relations between anaphor and antecedent; and (2)the translation of the anaphor, i.e., the selection of theappropriate target-language equivalent. The identification ofthe antecedent is essentially a monolingual, language-pairindependent problem which is usually solved during analysis. Theselection of the target-language equivalent, on the other hand,can be regarded as a language-pair dependent task which has to betackled during transfer and generation. In this paper, theproblems of anaphora translation are discussed for the languagepair Russian–German. Although in most cases source-languageanaphoric pronouns correspond to target-language anaphoricpronouns, in some cases this straightforward equation does nothold. Two cases of such translation discrepancies are treatedhere: zero anaphora and pronominal PPs. The differences in thedistribution of zero anaphora and pronominal PPs in Russian andGerman are described, and solutions to these translation problems basedon the Russian–German MT system T1 are presented. 相似文献
8.
Phrase-based decoding is conceptually simple and straightforward to implement, at the cost of drastically oversimplified reordering
models. Syntactically aware models make it possible to capture linguistically relevant relationships in order to improve word
order, but they can be more complex to implement and optimise. In this paper, we explore a new middle ground between phrase-based
and syntactically informed statistical MT, in the form of a model that supplements conventional, non-hierarchical phrase-based
techniques with linguistically informed reordering based on syntactic dependency trees. The key idea is to exploit linguistically-informed
hierchical structures only for those dependencies that cannot be captured within a single flat phrase. For very local dependencies
we leverage the success of conventional phrase-based approaches, which provide a sequence of target-language words appropriately
ordered and ready-made with any agreement morphology. Working with dependency trees rather than constituency trees allows
us to take advantage of the flexibility of phrase-based systems to treat non-constituent fragments as phrases. We do impose
a requirement—that the fragment be a novel sort of “dependency constituent”—on what can be translated as a phrase, but this
is much weaker than the requirement that phrases be traditional linguistic constituents, which has often proven too restrictive
in MT systems. 相似文献
9.
句型转换的消歧与译文生成处理研究 总被引:5,自引:1,他引:5
句型转换的消歧和译文生成处理是混合式汉英机器翻译的两个重要阶段。本文主要工作有:第一,针对自然语言在各个层次上普遍存在的歧义性,对语言歧义的问题进行分析,论述了一些具体消歧方法;第二,建立了汉英机器翻译的时态转换及相关匹配规则,探讨了译文生成的处理。 相似文献
10.
短语表是基于短语的统计机器翻译系统的一个核心组成部分,基于启发式方法抽取到的短语表受单词对齐错误和未对齐词的影响严重,同时抽取到的短语也并非句法意义上的短语。该文提出一种基于EM(Expectation-maximization)算法的双语句法短语抽取方法来抽取双语句法短语,此方法可以通过不断迭代的方式使各参数值达到最优。通过加入双语句法短语、增加新特征、重新训练三种不同的方法,将获得的双语句法短语与基于短语的统计机器翻译方法结合以提高统计机器翻译系统的性能。结果表明: 三种方法都不同程度提高了译文的BLEU(BiLingual Evaluation Understudy)值,其中增加新特征方法提高了0.64个点。 相似文献
11.
Michael Carl 《Machine Translation》2005,19(3-4):229-249
According to the system theory of von Bertalanffy (1968), Bertalanffy, a “system” is an entity that can be distinguished from
its environment and that consists of several parts. System theory investigates the role of the parts, their interaction and
the relation of the whole with its environment. System theory of the second order examines how an observer relates to the
system. This paper traces some of the recent discussion of example-based machine translation (EBMT) and compares a number
of EBMT and statistical MT systems. It is found that translation examples are linguistic systems themselves that consist of
words, phrases and other constituents. Two properties of Luhmann’s (2002) system theory are discussed in this context: EBMT
has focussed on the properties of structures suited for translation and the design of their reentry points, and SMT develops connectivity operators which select the most likely continuations of structures. While technically the SMT and EBMT approaches complement
each other, the principal distinguishing characteristic results from different sets of values which SMT and EBMT followers
prefer. 相似文献
12.
Gorka Labaka Cristina España-Bonet Lluís Màrquez Kepa Sarasola 《Machine Translation》2014,28(2):91-125
This article presents a hybrid architecture which combines rule-based machine translation (RBMT) with phrase-based statistical machine translation (SMT). The hybrid translation system is guided by the rule-based engine. Before the transfer step, a varied set of partial candidate translations is calculated with the SMT system and used to enrich the tree-based representation with more translation alternatives. The final translation is constructed by choosing the most probable combination among the available fragments using monotone statistical decoding following the order provided by the rule-based system. We apply the hybrid model to a pair of distantly related languages, Spanish and Basque, and perform extensive experimentation on two different corpora. According to our empirical evaluation, the hybrid approach outperforms the best individual system across a varied set of automatic translation evaluation metrics. Following some output analysis to better understand the behaviour of the hybrid system, we explore the possibility of adding alternative parse trees and extra features to the hybrid decoder. Finally, we present a twofold manual evaluation of the translation systems studied in this paper, consisting of (i) a pairwise output comparison and (ii) a individual task-oriented evaluation using HTER. Interestingly, the manual evaluation shows some contradictory results with respect to the automatic evaluation; humans tend to prefer the translations from the RBMT system over the statistical and hybrid translations. 相似文献
13.
Although some progress has been made on the quality of Machine Translation in recent years, there is still a significant potential for quality improvement. There has also been a shift in paradigm of machine translation, from “classical” rule-based systems like METAL or LMT1 towards example-based or statistical MT.2 It seems to be time now to evaluate the progress and compare the results of these efforts, and draw conclusions for further improvements of MT quality.The paper starts with a comparison between statistical MT (henceforth: SMT) and rule-based MT (henceforth: RMT) systems, and describes the set-up and the evaluation results; the second section analyses the strengths and weaknesses of the respective approaches, and the third one discusses models of an architecture for a hybrid system. 相似文献
14.
In this work, we present an extension of n-gram-based translation models based on factored language models (FLMs). Translation units employed in the n-gram-based approach to statistical machine translation (SMT) are based on mappings of sequences of raw words, while translation
model probabilities are estimated through standard language modeling of such bilingual units. Therefore, similar to other
translation model approaches (phrase-based or hierarchical), the sparseness problem of the units being modeled leads to unreliable
probability estimates, even under conditions where large bilingual corpora are available. In order to tackle this problem,
we extend the n-gram-based approach to SMT by tightly integrating more general word representations, such as lemmas and morphological classes,
and we use the flexible framework of FLMs to apply a number of different back-off techniques. In this work, we show that FLMs
can also be successfully applied to translation modeling, yielding more robust probability estimates that integrate larger
bilingual contexts during the translation process. 相似文献
15.
针对汉语—维吾尔语的统计机器翻译系统中存在的语义无关性问题,提出基于神经网络机器翻译方法的双语关联度优化模型。该模型利用注意力机制捕获词对齐信息,引入双语短语间的语义相关性和内部词汇匹配度,预测双语短语的生成概率并将其作为双语关联度,以优化统计翻译模型中的短语翻译得分。在第十一届全国机器翻译研讨会(CWMT 2015)汉维公开机器翻译数据集上的实验结果表明,与基线系统相比,在使用较小规模的训练数据和词汇表的条件下,所提方法可以有效地同时提高短语级别和句子级别的机器翻译任务性能,分别获得最高2.49和0.59的BLEU值提升。 相似文献
16.
Information on subcategorization and selectional restrictions in a valency dictionary is important for natural language processing
tasks such as monolingual parsing, accurate rule-based machine translation and automatic summarization. In this paper we present
an efficient method of assigning valency information and selectional restrictions to entries in a bilingual dictionary, based
on information in an existing valency dictionary. The method is based on two assumptions: words with similar meaning have
similar subcategorization frames and selectional restrictions; and words with the same translations have similar meanings.
Based on these assumptions, new valency entries are constructed for words in a plain bilingual dictionary, using entries with
similar source-language meaning and the same target-language translations. We evaluate the effects of various measures of
semantic similarity. 相似文献
17.
Guy De Pauw Peter Waiganjo Wagacha Gilles-Maurice de Schryver 《Language Resources and Evaluation》2011,45(3):331-344
Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned
parallel corpora. This paper presents ongoing research on the development and application of the sawa corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties
of finding appropriate and easily accessible data for this language pair. In the data annotation phase, the corpus was semi-automatically
sentence and word-aligned and morphosyntactic information was added to both the English and Swahili portion of the corpus.
The annotated parallel corpus allows us to investigate two possible uses. We describe experiments with the projection of part-of-speech
tagging annotation from English onto Swahili, as well as the development of a basic statistical machine translation system
for this language pair, using the parallel corpus and a consolidated database of existing English—Swahili translation dictionaries.
We particularly focus on the difficulties of translating English into the morphologically more complex Bantu language of Swahili. 相似文献
18.
19.
在特定领域的汉英机器翻译系统开发过程中,大量新词的出现导致汉语分词精度下降,而特定领域缺少标注语料使得有监督学习技术的性能难以提高。这直接导致抽取的翻译知识中出现很多错误,严重影响翻译质量。为解决这个问题,该文实现了基于生语料的领域自适应分词模型和双语引导的汉语分词,并提出融合多种分词结果的方法,通过构建格状结构(Lattice)并使用动态规划算法得到最佳汉语分词结果。为了验证所提方法,我们在NTCIR-10的汉英数据集上进行了评价实验。实验结果表明,该文提出的融合多种分词结果的汉语分词方法在分词精度F值和统计机器翻译的BLEU值上均得到了提高。 相似文献