Survey of data-selection methods in statistical machine translation |
| |
Authors: | Sauleh Eetemadi William Lewis Kristina Toutanova Hayder Radha |
| |
Affiliation: | 1.Michigan State University,East Lansing,USA;2.Microsoft Research,Redmond,USA |
| |
Abstract: | Statistical machine translation has seen significant improvements in quality over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. We now find ourselves, however, the victims of our own success, in that it has become increasingly difficult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e. data-to-models takes an inordinate amount of time). Moreover, the training data has a wide quality spectrum. A variety of methods for data cleaning and data selection have been developed to address these issues. Each of these methods employs a search or filtering algorithm to select a subset of the data, given a defined set of feature functions. In this paper we provide a comparative overview of research in this area based on application scenario, feature functions and search method. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|