首页 | 本学科首页   官方微博 | 高级检索  

Data augmentation for low-resource languages NMT guided by constrained sampling
Authors:Mieradilijiang Maimaiti  Yang Liu  Huanbo Luan  Maosong Sun
Abstract:Data augmentation (DA) is a ubiquitous approach for several text generation tasks. Intuitively, in the machine translation paradigm, especially in low-resource languages scenario, many DA methods have appeared. The most commonly used methods are building pseudocorpus by randomly sampling, omitting, or replacing some words in the text. However, previous approaches hardly guarantee the quality of augmented data. In this study, we try to augment the corpus by introducing a constrained sampling method. Additionally, we also build the evaluation framework to select higher quality data after augmentation. Namely, we use the discriminator submodel to mitigate syntactic and semantic errors to some extent. Experimental results show that our augmentation method consistently outperforms all the previous state-of-the-art methods on both small and large-scale corpora in eight language pairs from four corpora by 2.38–4.18 bilingual evaluation understudy points.
Keywords:artificial intelligence  constrained sampling  data augmentation  low-resource languages  natural language processing  neural machine translation
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号