首页 | 本学科首页   官方微博 | 高级检索  
     

基于同义词数据增强的汉越神经机器翻译方法
引用本文:尤丛丛,高盛祥,余正涛,毛存礼,潘润海. 基于同义词数据增强的汉越神经机器翻译方法[J]. 计算机工程与科学, 2021, 43(8): 1497-1502. DOI: 10.3969/j.issn.1007-130X.2021.08.019
作者姓名:尤丛丛  高盛祥  余正涛  毛存礼  潘润海
作者单位:(1.昆明理工大学信息工程与自动化学院,云南 昆明 650500;2.昆明理工大学云南省人工智能重点实验室,云南 昆明 650500)
基金项目:国家重点研发计划(2019QY1801,2019QY1802,2019QY1800);国家自然科学基金(61761026,61972186,61732005,61672271,61762056);云南省高新技术产业专项(201606);云南省自然科学基金(2018FB104);昆明理工大学省级人培项目(KKSY201703005)
摘    要:汉越平行语料库的资源稀缺,很大程度上影响了汉越机器翻译效果.数据增强是提升汉越机器翻译的有效途径,基于双语词典的词汇替换数据增强是当前较为流行的方法.由于汉语-越南语属于低资源语言对,双语词典难以获得,而通过单语词向量获取低频词的同义词较为容易.因此,提出一种基于低频词的同义词替换的数据增强方法.该方法利用小规模的平行...

关 键 词:汉越  数据增强  同义词替换  神经机器翻译
收稿时间:2020-02-18
修稿时间:2020-07-12

A Chinese-Vietnamese neural machine translation method based on synonym data augmentation
YOU Cong-cong,GAO Sheng-xiang,YU Zheng-tao,MAO Cun-li,PAN Run-hai. A Chinese-Vietnamese neural machine translation method based on synonym data augmentation[J]. Computer Engineering & Science, 2021, 43(8): 1497-1502. DOI: 10.3969/j.issn.1007-130X.2021.08.019
Authors:YOU Cong-cong  GAO Sheng-xiang  YU Zheng-tao  MAO Cun-li  PAN Run-hai
Affiliation:(1.Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500;2.Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China)
Abstract:The scarcity of resources in the Chinese-Vietnamese parallel corpus greatly affects the effect of Chinese-Vietnamese machine translation. Data enhancement is an effective way to improveChinese-Vietnamese machine translation. Bilingual dictionary-based vocabulary replacement and data enhancement is currently a more popular method. Since Chinese-Vietnamesebilingualism is a low-resource languages, bilingual dictionaries are difficult to obtain, and synonyms for low-frequency words are easier to obtain from monolingual word vectors. Therefore, we propose a data enhancement method based on synonym replacement of low-frequency words. This method uses a small-scale parallel corpus. Firstly, by learning monolingual word vectors, a synonym list of low-frequency words at one end is obtained. Then, low-frequency words are replaced with synonyms. Secondly, the language model is used to filter the replaced sentences. Finally, The filtered sentence is matched with the sentence in the language on the other side to obtain an extended parallel corpus. The experimental results of Chinese-Vietnamese translation experiments show that the proposed method achieves good results, and the extended method improves the BLEU value by 1.8 and 1.1, compared with the baseline and back translation methods.
Keywords:Chinese-Vietnamese  data augmentation  synonym substitution  neural machine translation  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与科学》浏览原始摘要信息
点击此处可从《计算机工程与科学》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号