Survey of data-selection methods in statistical machine translation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Survey of data-selection methods in statistical machine translation

Authors:	Sauleh Eetemadi William Lewis Kristina Toutanova Hayder Radha

Affiliation:	1.Michigan State University,East Lansing,USA;2.Microsoft Research,Redmond,USA

Abstract:	Statistical machine translation has seen significant improvements in quality over the past several years. The single biggest factor in this improvement has been the accumulation of ever larger stores of data. We now find ourselves, however, the victims of our own success, in that it has become increasingly difficult to train on such large sets of data, due to limitations in memory, processing power, and ultimately, speed (i.e. data-to-models takes an inordinate amount of time). Moreover, the training data has a wide quality spectrum. A variety of methods for data cleaning and data selection have been developed to address these issues. Each of these methods employs a search or filtering algorithm to select a subset of the data, given a defined set of feature functions. In this paper we provide a comparative overview of research in this area based on application scenario, feature functions and search method.

Keywords:
本文献已被 SpringerLink 等数据库收录！