首页 | 本学科首页   官方微博 | 高级检索  
     

探索低资源的迭代式复述生成增强方法
引用本文:张琳,刘明童,张玉洁,徐金安,陈钰枫. 探索低资源的迭代式复述生成增强方法[J]. 智能系统学报, 2022, 17(4): 680-687. DOI: 10.11992/tis.202106032
作者姓名:张琳  刘明童  张玉洁  徐金安  陈钰枫
作者单位:北京交通大学 计算机与信息技术学院,北京 100044
摘    要:复述生成旨在同一语言内将给定句子转换成语义一致表达不同的句子。目前,基于深度神经网络的复述生成模型的成功依赖于大规模的复述平行语料,当面向新的语言或新的领域时,模型性能急剧下降。面对这一困境,提出低资源的迭代式复述生成增强方法,最大化利用单语语料和小规模复述平行语料迭代式训练复述生成模型并生成复述伪数据,以此增强模型性能。此外,提出了句子流畅性、语义相近性和表达多样性为基准设计的伪数据筛选算法,选取高质量的复述伪数据参与每轮模型的迭代训练。在公开数据集Quora上的实验结果表明,提出的方法仅利用30%的复述语料在语义和多样性指标上均超过了基线模型,验证了所提方法的有效性。

关 键 词:低资源  迭代式  复述生成  数据增强  筛选算法  神经网络模型  编码–解码框架  注意力机制

Explore the low-resource iterative paraphrase generation enhancement method
ZHANG Lin,LIU Mingtong,ZHANG Yujie,XU Jin’an,CHEN Yufeng. Explore the low-resource iterative paraphrase generation enhancement method[J]. CAAL Transactions on Intelligent Systems, 2022, 17(4): 680-687. DOI: 10.11992/tis.202106032
Authors:ZHANG Lin  LIU Mingtong  ZHANG Yujie  XU Jin’an  CHEN Yufeng
Affiliation:School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
Abstract:Paraphrase generation aims to convert a given sentence into semantically consistent different sentences within the same language. At present, the success of deep neural network-based paraphrase generation models depends on large-scale paraphrase parallel corpora. When faced with new languages or new domains, the model’s performance drops sharply. We propose a low-resource iterative paraphrase generation enhancement method faced with this dilemma, which maximizes the use of monolingual and small-scale paraphrase parallel corpora to train the paraphrase generation model iteratively and generate paraphrase pseudo data to enhance the model performance. Furthermore, we propose a pseudo data screening algorithm based on fluency, semantic similarity, and expression diversity to select high-quality paraphrased pseudo data in each round of iterative training of the model. Experimental results on Quora, a public dataset, show that our proposed method exceeds the baseline model in semantic and diversity indicators using only 30% of the paraphrase corpus, which verifies the effectiveness of the proposed method.
Keywords:low-resource   iterative   paraphrase generation   data enhancement   screening algorithm   neural networks model   encoder-decoder   attention mechanism
点击此处可从《智能系统学报》浏览原始摘要信息
点击此处可从《智能系统学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号