首页 | 本学科首页   官方微博 | 高级检索  
     

跨模态信息融合的端到端语音翻译
引用本文:刘宇宸,宗成庆.跨模态信息融合的端到端语音翻译[J].软件学报,2023,34(4):1837-1849.
作者姓名:刘宇宸  宗成庆
作者单位:模式识别国家重点实验室 (中国科学院自动化研究所), 北京 100190;中国科学院大学 人工智能学院, 北京 100049
基金项目:国家自然科学基金重点项目(U1836221)
摘    要:语音翻译旨在将一种语言的语音翻译成另一种语言的语音或文本. 相比于级联式翻译系统, 端到端的语音翻译方法具有时间延迟低、错误累积少和存储空间小等优势, 因此越来越多地受到研究者们的关注. 但是, 端到端的语音翻译方法不仅需要处理较长的语音序列, 提取其中的声学信息, 而且需要学习源语言语音和目标语言文本之间的对齐关系, 从而导致建模困难, 且性能欠佳. 提出一种跨模态信息融合的端到端的语音翻译方法, 该方法将文本机器翻译与语音翻译模型深度结合, 针对语音序列长度与文本序列长度不一致的问题, 通过过滤声学表示中的冗余信息, 使过滤后的声学状态序列长度与对应的文本序列尽可能一致; 针对对齐关系难学习的问题, 采用基于参数共享的方法将文本机器翻译模型嵌入到语音翻译模型中, 并通过多任务训练方法学习源语言语音与目标语言文本之间的对齐关系. 在公开的语音翻译数据集上进行的实验表明, 所提方法可以显著提升语音翻译的性能.

关 键 词:语音翻译  神经机器翻译  端到端模型  多模态学习
收稿时间:2020/12/29 0:00:00
修稿时间:2021/3/13 0:00:00

End-to-end Speech Translation by Integrating Cross-modal Information
LIU Yu-Chen,ZONG Cheng-Qing.End-to-end Speech Translation by Integrating Cross-modal Information[J].Journal of Software,2023,34(4):1837-1849.
Authors:LIU Yu-Chen  ZONG Cheng-Qing
Affiliation:National Laboratory of Pattern Recognition (Institute of Automation, Chinese Academy of Sciences), Beijing 100190, China;School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Abstract:Speech translation aims to translate the speech in one language into the speech or text in another language. Compared with the pipeline system, the end-to-end speech translation model has the advantages of low latency, less error propagation, and small storage, so it has attracted much attention. However, the end-to-end model not only requires to process the long speech sequence and extract the acoustic information, but also needs to learn the alignment relationship between the source speech and the target text, leading to modeling difficulty with poor performance. This study proposes an end-to-end speech translation model with cross-modal information fusion, which deeply combines text-based machine translation model with speech translation model. For the length inconsistency between the speech and the text, a redundancy filter is proposed to remove the redundant acoustic information, making the length of filtered acoustic representation consistent with the corresponding text. For learning the alignment relationship, the parameter sharing method is applied to embed the whole machine translation model into the speech translation model with multi-task training. Experimental results on public speech translation data sets show that the proposed method can significantly improve the model performance.
Keywords:speech translation  neural machine translation  end-to-end model  multi-modal learning
点击此处可从《软件学报》浏览原始摘要信息
点击此处可从《软件学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号