首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 500 毫秒
1.
This paper presents an extended, harmonised account of our previous work on combining subsentential alignments from phrase-based statistical machine translation (SMT) and example-based MT (EBMT) systems to create novel hybrid data-driven systems capable of outperforming the baseline SMT and EBMT systems from which they were derived. In previous work, we demonstrated that while an EBMT system is capable of outperforming a phrase-based SMT (PBSMT) system constructed from freely available resources, a hybrid ‘example-based’ SMT system incorporating marker chunks and SMT subsentential alignments is capable of outperforming both baseline translation models for French–English translation. In this paper, we show that similar gains are to be had from constructing a hybrid ‘statistical’ EBMT system. Unlike the previous research, here we use the Europarl training and test sets, which are fast becoming the standard data in the field. On these data sets, while all hybrid ‘statistical’ EBMT variants still fall short of the quality achieved by the baseline PBSMT system, we show that adding the marker chunks to create a hybrid ‘example-based’ SMT system outperforms the two baseline systems from which it is derived. Furthermore, we provide further evidence in favour of hybrid systems by adding an SMT target-language model to the EBMT system, and demonstrate that this too has a positive effect on translation quality. We also show that many of the subsentential alignments derived from the Europarl corpus are created by either the PBSMT or the EBMT system, but not by both. In sum, therefore, despite the obvious convergence of the two paradigms, the crucial differences between SMT and EBMT contribute positively to the overall translation quality. The central thesis of this paper is that any researcher who continues to develop an MT system using either of these approaches will benefit further from integrating the advantages of the other model; dogged adherence to one approach will lead to inferior systems being developed.  相似文献   

2.
This paper summarizes ongoing efforts to provide software infrastructure (and methodology) for open-source machine translation that combines a deep semantic transfer approach with advanced stochastic models. The resulting infrastructure combines precise grammars for parsing and generation, a semantic-transfer based translation engine and stochastic controllers. We provide both a qualitative and quantitative experience report from instantiating our general architecture for Japanese–English MT using only open-source components, including HPSG-based grammars of English and Japanese.  相似文献   

3.
According to the system theory of von Bertalanffy (1968), Bertalanffy, a “system” is an entity that can be distinguished from its environment and that consists of several parts. System theory investigates the role of the parts, their interaction and the relation of the whole with its environment. System theory of the second order examines how an observer relates to the system. This paper traces some of the recent discussion of example-based machine translation (EBMT) and compares a number of EBMT and statistical MT systems. It is found that translation examples are linguistic systems themselves that consist of words, phrases and other constituents. Two properties of Luhmann’s (2002) system theory are discussed in this context: EBMT has focussed on the properties of structures suited for translation and the design of their reentry points, and SMT develops connectivity operators which select the most likely continuations of structures. While technically the SMT and EBMT approaches complement each other, the principal distinguishing characteristic results from different sets of values which SMT and EBMT followers prefer.  相似文献   

4.
We propose a novel approach to cross-lingual language model and translation lexicon adaptation for statistical machine translation (SMT) based on bilingual latent semantic analysis. Bilingual LSA enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bilingual LSA framework, model adaptation can be performed by, first, inferring the topic posterior distribution of the source text and then applying the inferred distribution to an n-gram language model of the target language and translation lexicon via marginal adaptation. The background phrase table is enhanced with the additional phrase scores computed using the adapted translation lexicon. The proposed framework also features rapid bootstrapping of LSA models for new languages based on a source LSA model of another language. Our approach is evaluated on the Chinese–English MT06 test set using the medium-scale SMT system and the GALE SMT system measured in BLEU and NIST scores. Improvement in both scores is observed on both systems when the adapted language model and the adapted translation lexicon are applied individually. When the adapted language model and the adapted translation lexicon are applied simultaneously, the gain is additive. At the 95% confidence interval of the unadapted baseline system, the gain in both scores is statistically significant using the medium-scale SMT system, while the gain in the NIST score is statistically significant using the GALE SMT system.  相似文献   

5.
In the last decade the dominant models of MT have been data-driven or corpus-based. Of the two main trends, statistical machine translation and example-based machine translation (EBMT), the latter is much less clearly defined. In a review of the recently published collection edited by Michael Carl and Andy Way, this essay surveys the basic processes, methods, main problems and tasks of EBMT, and attempts to provide a definition of the essence of EBMT in comparison with statistical MT and traditional rule-based MT. Recent Advances in Example-based Machine Translation. Edited by Michael Carl and Andy Way. Dordrecht: Kluwer Academic Publishers, 2003. xxxi, 482pp. (Text, Speech and Language Technology, vol. 21) ISBN: 1-4020-1400-7 (hardback), 1-4020-1401-5 (paperback).  相似文献   

6.
This paper describes an example-based machine translation (EBMT) method based on tree–string correspondence (TSC) and statistical generation. In this method, the translation example is represented as a TSC, which is a triple consisting of a parse tree in the source language, a string in the target language, and the correspondence between the leaf node of the source-language tree and the substring of the target-language string. For an input sentence to be translated, it is first parsed into a tree. Then the TSC forest which best matches the input tree is searched for. Finally the translation is generated using a statistical generation model to combine the target-language strings of the TSCs. The generation model consists of three features: the semantic similarity between the tree in the TSC and the input tree, the translation probability of translating the source word into the target word, and the language-model probability for the target-language string. Based on the above method, we build an English-to-Chinese MT system. Experimental results indicate that the performance of our system is comparable with phrase-based statistical MT systems.  相似文献   

7.
Andy Way 《Machine Translation》2010,24(3-4):177-208
A very useful service to the example-based machine translation (EBMT) community was provided by Harold Somers in his summary article which appeared in 1999, and was extended in our 2003 book Recent advances in example-based machine translation. As well as providing a comprehensive review of the paradigm, Somers gives a categorisation of the different instantiations of the basic model. In this paper, we provide a complementary view to that of Somers. Today’s EBMT systems learn by analogy. Perhaps even more so than statistical models of translation, one might view these systems as being incapable of forgetting. We researchers and system developers, on the other hand, often forget or are ignorant of techniques and models presented in prior research. The primary aim of this paper is to try to ensure that golden nuggets from past (now quite distantly so) EBMT research papers are gathered together and presented here for a new generation of researchers keen to operate in the paradigm, especially given the spate of recent open-source releases of EBMT systems. We revisit the findings of the previous main research papers, relate them to some of the major research efforts which have taken place since then, and examine especially the prophecies given in the older pieces of work to see the extent to which they have been borne out in the newer research. Given the strong convergence between the leading corpus-based approaches to MT, especially since the introduction of phrase-based statistical MT, a further hope is that these findings may also prove useful to researchers and developers in other areas of MT.  相似文献   

8.
The last few years have witnessed an increasing interest in hybridizing surface-based statistical approaches and rule-based symbolic approaches to machine translation (MT). Much of that work is focused on extending statistical MT systems with symbolic knowledge and components. In the brand of hybridization discussed here, we go in the opposite direction: adding statistical bilingual components to a symbolic system. Our base system is Generation-heavy machine translation (GHMT), a primarily symbolic asymmetrical approach that addresses the issue of Interlingual MT resource poverty in source-poor/target-rich language pairs by exploiting symbolic and statistical target-language resources. GHMT’s statistical components are limited to target-language models, which arguably makes it a simple form of a hybrid system. We extend the hybrid nature of GHMT by adding statistical bilingual components. We also describe the details of retargeting it to Arabic–English MT. The morphological richness of Arabic brings several challenges to the hybridization task. We conduct an extensive evaluation of multiple system variants. Our evaluation shows that this new variant of GHMT—a primarily symbolic system extended with monolingual and bilingual statistical components—has a higher degree of grammaticality than a phrase-based statistical MT system, where grammaticality is measured in terms of correct verb-argument realization and long-distance dependency translation.  相似文献   

9.
The dissemination of statistical machine translation (SMT) systems in the professional translation industry is still limited by the lack of reliability of SMT outputs, the quality of which varies to a great extent. A critical piece of information would be for MT systems to automatically assess their output translations with automatically derived quality measures. Predicting quality measures was indeed the goal of a shared task at the Workshop on SMT in 2012. In this contribution, we first report our results for this shared task, detailing the features that we found to be the most predictive of quality. In the latter part, we reexamine the shared task data and protocol and show that several factors actually contributed to the difficulty of the task, and discuss alternative evaluation designs.  相似文献   

10.
Statistical machine translation systems are usually trained on large amounts of bilingual text (used to learn a translation model), and also large amounts of monolingual text in the target language (used to train a language model). In this article we explore the use of semi-supervised model adaptation methods for the effective use of monolingual data from the source language in order to improve translation quality. We propose several algorithms with this aim, and present the strengths and weaknesses of each one. We present detailed experimental evaluations on the French–English EuroParl data set and on data from the NIST Chinese–English large-data track. We show a significant improvement in translation quality on both tasks.  相似文献   

11.
We describe a novel approach to MT that combines the strengths of the two leading corpus-based approaches: Phrasal SMT and EBMT. We use a syntactically informed decoder and reordering model based on the source dependency tree, in combination with conventional SMT models to incorporate the power of phrasal SMT with the linguistic generality available in a parser. We show that this approach significantly outperforms a leading string-based Phrasal SMT decoder and an EBMT system. We present results from two radically different language pairs, and investigate the sensitivity of this approach to parse quality by using two distinct parsers and oracle experiments. We also validate our automated bleu scores with a small human evaluation.  相似文献   

12.
A parallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic–English and French–English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.  相似文献   

13.
In this paper, we present a hybrid architecture for developing a system combination model that works in three layers to achieve better translated outputs. In the first layer, we have various machine translation models (i.e. Neural Machine Translation (NMT), Statistical Machine Translation (SMT), etc.). In the second layer, the outputs of these models are combined to leverage the advantages of both the systems (i.e SMT and NMT systems) by using the statistical approach and neural-based approach. But each approach has some advantages and limitations. So, instead of selecting an individual combined system’s output as the final one, we apply these outputs in the final layer to produce the target output by assigning appropriate preferences to SMT based and neural-based combinations. Though there are some techniques for system combination but no such approach exists which uses preferences from various system combination models (statistical and neural) for the purpose of better assembling. Empirical results show improved performance in the terms of translation accuracy. Our experiments on two benchmark datasets of English–Hindi and Hindi–English pairs show that the proposed model performs significantly better than the participating models. Apparently, the efficacy of proposed model is significantly better than the state-of-the art machine translation combination systems (6.10 and 4.69 BLEU points for English-to-Hindi, and Hindi-to-English, respectively).  相似文献   

14.
多策略汉日机器翻译系统中的核心技术研究   总被引:1,自引:0,他引:1  
多策略的机器翻译是当今机器翻译系统的一个发展方向。该文论述了一个多策略的汉日机器翻译系统中各翻译核心子系统所使用的核心技术和算法,其中包含了使用词法分析、句法分析和语义角色标注的汉语分析子系统、利用双重索引技术的基于翻译记忆技术的机器翻译子系统、以句法树片段为模板的基于实例模式的机器翻译子系统以及综合了配价模式和断段分析的机器翻译子系统。翻译记忆子系统的测试结果表明其具有高效的特性;实例模式子系统在1 559个句子的封闭测试中达到99%的准确率,在1 500个句子的开放测试中达到85%的准确率;配价模式子系统在3 059个句子的测试中达到了89%的准确率。  相似文献   

15.
Apertium: a free/open-source platform for rule-based machine translation   总被引:1,自引:1,他引:0  
Apertium is a free/open-source platform for rule-based machine translation. It is being widely used to build machine translation systems for a variety of language pairs, especially in those cases (mainly with related-language pairs) where shallow transfer suffices to produce good quality translations, although it has also proven useful in assimilation scenarios with more distant pairs involved. This article summarises the Apertium platform: the translation engine, the encoding of linguistic data, and the tools developed around the platform. The present limitations of the platform and the challenges posed for the coming years are also discussed. Finally, evaluation results for some of the most active language pairs are presented. An appendix describes Apertium as a free/open-source project.  相似文献   

16.
Statistical machine translation (SMT) is based on alignment models which learn from bilingual corpora the word correspondences between source and target language. These models are assumed to be capable of learning reorderings. However, the difference in word order between two languages is one of the most important sources of errors in SMT. In this paper, we show that SMT can take advantage of inductive learning in order to solve reordering problems. Given a word alignment, we identify those pairs of consecutive source blocks (sequences of words) whose translation is swapped, i.e. those blocks which, if swapped, generate a correct monotonic translation. Afterwards, we classify these pairs into groups, following recursively a co-occurrence block criterion, in order to infer reorderings. Inside the same group, we allow new internal combination in order to generalize the reorder to unseen pairs of blocks. Then, we identify the pairs of blocks in the source corpora (both training and test) which belong to the same group. We swap them and we use the modified source training corpora to realign and to build the final translation system. We have evaluated our reordering approach both in alignment and translation quality. In addition, we have used two state-of-the-art SMT systems: a Phrased-based and an Ngram-based. Experiments are reported on the EuroParl task, showing improvements almost over 1 point in the standard MT evaluation metrics (mWER and BLEU).  相似文献   

17.
Example-Based Machine Translation (EBMT) is a corpus based approach to Machine Translation (MT), that utilizes the translation by analogy concept. In our EBMT system, translation templates are extracted automatically from bilingual aligned corpora by substituting the similarities and differences in pairs of translation examples with variables. In the earlier versions of the discussed system, the translation results were solely ranked using confidence factors of the translation templates. In this study, we introduce an improved ranking mechanism that dynamically learns from user feedback. When a user, such as a professional human translator, submits his evaluation of the generated translation results, the system learns “context-dependent co-occurrence rules” from this feedback. The newly learned rules are later consulted, while ranking the results of the subsequent translations. Through successive translation-evaluation cycles, we expect that the output of the ranking mechanism complies better with user expectations, listing the more preferred results in higher ranks. We also present the evaluation of our ranking method which uses the precision values at top results and the BLEU metric.  相似文献   

18.
The CMU-EBMT machine translation system   总被引:1,自引:1,他引:0  
  相似文献   

19.
In this article, the first public release of GREAT as an open-source, statistical machine translation (SMT) software toolkit is described. GREAT is based on a bilingual language modelling approach for SMT, which is so far implemented for n-gram models based on the framework of stochastic finite-state transducers. The use of finite-state models is motivated by their simplicity, their versatility, and the fact that they present a lower computational cost, if compared with other more expressive models. Moreover, if translation is assumed to be a subsequential process, finite-state models are enough for modelling the existing relations between a source and a target language. GREAT includes some characteristics usually present in state-of-the-art SMT, such as phrase-based translation models or a log-linear framework for local features. Experimental results on a well-known corpus such as Europarl are reported in order to validate this software. A competitive translation quality is achieved, yet using both a lower number of model parameters and a lower response time than the widely-used, state-of-the-art SMT system Moses.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号