共查询到20条相似文献,搜索用时 500 毫秒
1.
This paper presents an extended, harmonised account of our previous work on combining subsentential alignments from phrase-based
statistical machine translation (SMT) and example-based MT (EBMT) systems to create novel hybrid data-driven systems capable
of outperforming the baseline SMT and EBMT systems from which they were derived. In previous work, we demonstrated that while
an EBMT system is capable of outperforming a phrase-based SMT (PBSMT) system constructed from freely available resources,
a hybrid ‘example-based’ SMT system incorporating marker chunks and SMT subsentential alignments is capable of outperforming
both baseline translation models for French–English translation. In this paper, we show that similar gains are to be had from
constructing a hybrid ‘statistical’ EBMT system. Unlike the previous research, here we use the Europarl training and test
sets, which are fast becoming the standard data in the field. On these data sets, while all hybrid ‘statistical’ EBMT variants
still fall short of the quality achieved by the baseline PBSMT system, we show that adding the marker chunks to create a hybrid
‘example-based’ SMT system outperforms the two baseline systems from which it is derived. Furthermore, we provide further
evidence in favour of hybrid systems by adding an SMT target-language model to the EBMT system, and demonstrate that this
too has a positive effect on translation quality. We also show that many of the subsentential alignments derived from the
Europarl corpus are created by either the PBSMT or the EBMT system, but not by both. In sum, therefore, despite the obvious
convergence of the two paradigms, the crucial differences between SMT and EBMT contribute positively to the overall translation
quality. The central thesis of this paper is that any researcher who continues to develop an MT system using either of these
approaches will benefit further from integrating the advantages of the other model; dogged adherence to one approach will
lead to inferior systems being developed. 相似文献
2.
Francis Bond Stephan Oepen Eric Nichols Dan Flickinger Erik Velldal Petter Haugereid 《Machine Translation》2011,25(2):87-105
This paper summarizes ongoing efforts to provide software infrastructure (and methodology) for open-source machine translation
that combines a deep semantic transfer approach with advanced stochastic models. The resulting infrastructure combines precise
grammars for parsing and generation, a semantic-transfer based translation engine and stochastic controllers. We provide both
a qualitative and quantitative experience report from instantiating our general architecture for Japanese–English MT using
only open-source components, including HPSG-based grammars of English and Japanese. 相似文献
3.
Michael Carl 《Machine Translation》2005,19(3-4):229-249
According to the system theory of von Bertalanffy (1968), Bertalanffy, a “system” is an entity that can be distinguished from
its environment and that consists of several parts. System theory investigates the role of the parts, their interaction and
the relation of the whole with its environment. System theory of the second order examines how an observer relates to the
system. This paper traces some of the recent discussion of example-based machine translation (EBMT) and compares a number
of EBMT and statistical MT systems. It is found that translation examples are linguistic systems themselves that consist of
words, phrases and other constituents. Two properties of Luhmann’s (2002) system theory are discussed in this context: EBMT
has focussed on the properties of structures suited for translation and the design of their reentry points, and SMT develops connectivity operators which select the most likely continuations of structures. While technically the SMT and EBMT approaches complement
each other, the principal distinguishing characteristic results from different sets of values which SMT and EBMT followers
prefer. 相似文献
4.
We propose a novel approach to cross-lingual language model and translation lexicon adaptation for statistical machine translation
(SMT) based on bilingual latent semantic analysis. Bilingual LSA enables latent topic distributions to be efficiently transferred
across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework,
model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying
the inferred distribution to an n-gram language model of the target language and translation lexicon via marginal adaptation. The background phrase table is
enhanced with the additional phrase scores computed using the adapted translation lexicon. The proposed framework also features
rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach is evaluated
on the Chinese–English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST scores.
Improvement in both scores is observed on both systems when the adapted language model and the adapted translation lexicon
are applied individually. When the adapted language model and the adapted translation lexicon are applied simultaneously,
the gain is additive. At the 95% confidence interval of the unadapted baseline system, the gain in both scores is statistically
significant using the medium-scale SMT system, while the gain in the NIST score is statistically significant using the GALE
SMT system. 相似文献
5.
John Hutchins 《Machine Translation》2005,19(3-4):197-211
In the last decade the dominant models of MT have been data-driven or corpus-based. Of the two main trends, statistical machine
translation and example-based machine translation (EBMT), the latter is much less clearly defined. In a review of the recently
published collection edited by Michael Carl and Andy Way, this essay surveys the basic processes, methods, main problems and
tasks of EBMT, and attempts to provide a definition of the essence of EBMT in comparison with statistical MT and traditional
rule-based MT.
Recent Advances in Example-based Machine Translation. Edited by Michael Carl and Andy Way. Dordrecht: Kluwer Academic Publishers, 2003. xxxi, 482pp. (Text, Speech and Language
Technology, vol. 21) ISBN: 1-4020-1400-7 (hardback), 1-4020-1401-5 (paperback). 相似文献
6.
This paper describes an example-based machine translation (EBMT) method based on tree–string correspondence (TSC) and statistical
generation. In this method, the translation example is represented as a TSC, which is a triple consisting of a parse tree
in the source language, a string in the target language, and the correspondence between the leaf node of the source-language
tree and the substring of the target-language string. For an input sentence to be translated, it is first parsed into a tree.
Then the TSC forest which best matches the input tree is searched for. Finally the translation is generated using a statistical
generation model to combine the target-language strings of the TSCs. The generation model consists of three features: the
semantic similarity between the tree in the TSC and the input tree, the translation probability of translating the source
word into the target word, and the language-model probability for the target-language string. Based on the above method, we
build an English-to-Chinese MT system. Experimental results indicate that the performance of our system is comparable with
phrase-based statistical MT systems. 相似文献
7.
Andy Way 《Machine Translation》2010,24(3-4):177-208
A very useful service to the example-based machine translation (EBMT) community was provided by Harold Somers in his summary article which appeared in 1999, and was extended in our 2003 book Recent advances in example-based machine translation. As well as providing a comprehensive review of the paradigm, Somers gives a categorisation of the different instantiations of the basic model. In this paper, we provide a complementary view to that of Somers. Today’s EBMT systems learn by analogy. Perhaps even more so than statistical models of translation, one might view these systems as being incapable of forgetting. We researchers and system developers, on the other hand, often forget or are ignorant of techniques and models presented in prior research. The primary aim of this paper is to try to ensure that golden nuggets from past (now quite distantly so) EBMT research papers are gathered together and presented here for a new generation of researchers keen to operate in the paradigm, especially given the spate of recent open-source releases of EBMT systems. We revisit the findings of the previous main research papers, relate them to some of the major research efforts which have taken place since then, and examine especially the prophecies given in the older pieces of work to see the extent to which they have been borne out in the newer research. Given the strong convergence between the leading corpus-based approaches to MT, especially since the introduction of phrase-based statistical MT, a further hope is that these findings may also prove useful to researchers and developers in other areas of MT. 相似文献
8.
The last few years have witnessed an increasing interest in hybridizing surface-based statistical approaches and rule-based
symbolic approaches to machine translation (MT). Much of that work is focused on extending statistical MT systems with symbolic
knowledge and components. In the brand of hybridization discussed here, we go in the opposite direction: adding statistical
bilingual components to a symbolic system. Our base system is Generation-heavy machine translation (GHMT), a primarily symbolic
asymmetrical approach that addresses the issue of Interlingual MT resource poverty in source-poor/target-rich language pairs by exploiting symbolic and statistical target-language resources.
GHMT’s statistical components are limited to target-language models, which arguably makes it a simple form of a hybrid system. We extend the hybrid nature of GHMT by adding statistical bilingual components. We also describe the details of retargeting
it to Arabic–English MT. The morphological richness of Arabic brings several challenges to the hybridization task. We conduct
an extensive evaluation of multiple system variants. Our evaluation shows that this new variant of GHMT—a primarily symbolic
system extended with monolingual and bilingual statistical components—has a higher degree of grammaticality than a phrase-based
statistical MT system, where grammaticality is measured in terms of correct verb-argument realization and long-distance dependency
translation. 相似文献
9.
The dissemination of statistical machine translation (SMT) systems in the professional translation industry is still limited by the lack of reliability of SMT outputs, the quality of which varies to a great extent. A critical piece of information would be for MT systems to automatically assess their output translations with automatically derived quality measures. Predicting quality measures was indeed the goal of a shared task at the Workshop on SMT in 2012. In this contribution, we first report our results for this shared task, detailing the features that we found to be the most predictive of quality. In the latter part, we reexamine the shared task data and protocol and show that several factors actually contributed to the difficulty of the task, and discuss alternative evaluation designs. 相似文献
10.
Statistical machine translation systems are usually trained on large amounts of bilingual text (used to learn a translation
model), and also large amounts of monolingual text in the target language (used to train a language model). In this article
we explore the use of semi-supervised model adaptation methods for the effective use of monolingual data from the source language
in order to improve translation quality. We propose several algorithms with this aim, and present the strengths and weaknesses
of each one. We present detailed experimental evaluations on the French–English EuroParl data set and on data from the NIST
Chinese–English large-data track. We show a significant improvement in translation quality on both tasks. 相似文献
11.
We describe a novel approach to MT that combines the strengths of the two leading corpus-based approaches: Phrasal SMT and
EBMT. We use a syntactically informed decoder and reordering model based on the source dependency tree, in combination with
conventional SMT models to incorporate the power of phrasal SMT with the linguistic generality available in a parser. We show
that this approach significantly outperforms a leading string-based Phrasal SMT decoder and an EBMT system. We present results
from two radically different language pairs, and investigate the sensitivity of this approach to parse quality by using two
distinct parsers and oracle experiments. We also validate our automated bleu scores with a small human evaluation. 相似文献
12.
A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required
amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available
comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations
are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters
are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity
between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail
removal. The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable
improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used
to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel
corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is
easier to implement and gives equally good improvements. 相似文献
13.
In this paper, we present a hybrid architecture for developing a system combination model that works in three layers to achieve better translated outputs. In the first layer, we have various machine translation models (i.e. Neural Machine Translation (NMT), Statistical Machine Translation (SMT), etc.). In the second layer, the outputs of these models are combined to leverage the advantages of both the systems (i.e SMT and NMT systems) by using the statistical approach and neural-based approach. But each approach has some advantages and limitations. So, instead of selecting an individual combined system’s output as the final one, we apply these outputs in the final layer to produce the target output by assigning appropriate preferences to SMT based and neural-based combinations. Though there are some techniques for system combination but no such approach exists which uses preferences from various system combination models (statistical and neural) for the purpose of better assembling. Empirical results show improved performance in the terms of translation accuracy. Our experiments on two benchmark datasets of English–Hindi and Hindi–English pairs show that the proposed model performs significantly better than the participating models. Apparently, the efficacy of proposed model is significantly better than the state-of-the art machine translation combination systems (6.10 and 4.69 BLEU points for English-to-Hindi, and Hindi-to-English, respectively). 相似文献
14.
多策略汉日机器翻译系统中的核心技术研究 总被引:1,自引:0,他引:1
多策略的机器翻译是当今机器翻译系统的一个发展方向。该文论述了一个多策略的汉日机器翻译系统中各翻译核心子系统所使用的核心技术和算法,其中包含了使用词法分析、句法分析和语义角色标注的汉语分析子系统、利用双重索引技术的基于翻译记忆技术的机器翻译子系统、以句法树片段为模板的基于实例模式的机器翻译子系统以及综合了配价模式和断段分析的机器翻译子系统。翻译记忆子系统的测试结果表明其具有高效的特性;实例模式子系统在1 559个句子的封闭测试中达到99%的准确率,在1 500个句子的开放测试中达到85%的准确率;配价模式子系统在3 059个句子的测试中达到了89%的准确率。 相似文献
15.
Mikel L. Forcada Mireia Ginestí-Rosell Jacob Nordfalk Jim O’Regan Sergio Ortiz-Rojas Juan Antonio Pérez-Ortiz Felipe Sánchez-Martínez Gema Ramírez-Sánchez Francis M. Tyers 《Machine Translation》2011,25(2):127-144
Apertium is a free/open-source platform for rule-based machine translation. It is being widely used to build machine translation
systems for a variety of language pairs, especially in those cases (mainly with related-language pairs) where shallow transfer
suffices to produce good quality translations, although it has also proven useful in assimilation scenarios with more distant
pairs involved. This article summarises the Apertium platform: the translation engine, the encoding of linguistic data, and
the tools developed around the platform. The present limitations of the platform and the challenges posed for the coming years
are also discussed. Finally, evaluation results for some of the most active language pairs are presented. An appendix describes
Apertium as a free/open-source project. 相似文献
16.
Marta R. Costa-jussà José A. R. Fonollosa Enric Monte 《Language Resources and Evaluation》2011,45(2):165-179
Statistical machine translation (SMT) is based on alignment models which learn from bilingual corpora the word correspondences
between source and target language. These models are assumed to be capable of learning reorderings. However, the difference
in word order between two languages is one of the most important sources of errors in SMT. In this paper, we show that SMT
can take advantage of inductive learning in order to solve reordering problems. Given a word alignment, we identify those
pairs of consecutive source blocks (sequences of words) whose translation is swapped, i.e. those blocks which, if swapped,
generate a correct monotonic translation. Afterwards, we classify these pairs into groups, following recursively a co-occurrence
block criterion, in order to infer reorderings. Inside the same group, we allow new internal combination in order to generalize
the reorder to unseen pairs of blocks. Then, we identify the pairs of blocks in the source corpora (both training and test)
which belong to the same group. We swap them and we use the modified source training corpora to realign and to build the final
translation system. We have evaluated our reordering approach both in alignment and translation quality. In addition, we have
used two state-of-the-art SMT systems: a Phrased-based and an Ngram-based. Experiments are reported on the EuroParl task,
showing improvements almost over 1 point in the standard MT evaluation metrics (mWER and BLEU). 相似文献
17.
Example-Based Machine Translation (EBMT) is a corpus based approach to Machine Translation (MT), that utilizes the translation
by analogy concept. In our EBMT system, translation templates are extracted automatically from bilingual aligned corpora by
substituting the similarities and differences in pairs of translation examples with variables. In the earlier versions of
the discussed system, the translation results were solely ranked using confidence factors of the translation templates. In
this study, we introduce an improved ranking mechanism that dynamically learns from user feedback. When a user, such as a
professional human translator, submits his evaluation of the generated translation results, the system learns “context-dependent
co-occurrence rules” from this feedback. The newly learned rules are later consulted, while ranking the results of the subsequent
translations. Through successive translation-evaluation cycles, we expect that the output of the ranking mechanism complies
better with user expectations, listing the more preferred results in higher ranks. We also present the evaluation of our ranking
method which uses the precision values at top results and the BLEU metric. 相似文献
18.
The CMU-EBMT machine translation system 总被引:1,自引:1,他引:0
Ralf D. Brown 《Machine Translation》2011,25(2):179-195
19.
In this article, the first public release of GREAT as an open-source, statistical machine translation (SMT) software toolkit
is described. GREAT is based on a bilingual language modelling approach for SMT, which is so far implemented for n-gram models based on the framework of stochastic finite-state transducers. The use of finite-state models is motivated by
their simplicity, their versatility, and the fact that they present a lower computational cost, if compared with other more
expressive models. Moreover, if translation is assumed to be a subsequential process, finite-state models are enough for modelling
the existing relations between a source and a target language. GREAT includes some characteristics usually present in state-of-the-art
SMT, such as phrase-based translation models or a log-linear framework for local features. Experimental results on a well-known
corpus such as Europarl are reported in order to validate this software. A competitive translation quality is achieved, yet
using both a lower number of model parameters and a lower response time than the widely-used, state-of-the-art SMT system
Moses. 相似文献