基于CycleGAN的语音可懂度关键技术 Key Technologies of Speech Intelligibility Based on CycleGAN期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于CycleGAN的语音可懂度关键技术

引用本文：	肖晶,刘佳奇,李登实,赵兰馨,王前瑞.基于CycleGAN的语音可懂度关键技术[J].计算机系统应用,2022,31(6):1-9.

作者姓名：	肖晶刘佳奇李登实赵兰馨王前瑞

作者单位：	武汉大学计算机学院国家多媒体软件工程技术研究中心, 武汉 430072;武汉大学多媒体与网络通信工程湖北省重点实验室, 武汉 430072,江汉大学人工智能学院, 武汉 430056

摘要：	语音可懂度增强是一种在嘈杂环境中再现清晰语音的感知增强技术. 许多研究通过说话风格转换(SSC)来增强语音可懂度, 这种方法仅依靠伦巴第效应, 因此在强噪声干扰下效果不佳. SSC还利用简单的线性变换对基频(F0)的转换进行建模, 并且只映射很少维的梅尔倒谱系数(MCEPs). 因为F0和MCEPs是语音的两个重要特征, 对这些特征进行充分的建模是非常必要的. 因此本文进行了一个创新性研究即通过连续小波变换(CWT)将F0分解为10维来描述不同时间尺度的语音, 以实现F0的有效转换, 而且使用20维表示MCEPs实现MCEPs的转换. 除此之外, 还利用iMetricGAN网络来优化强噪声中的语音可懂度指标. 实验结果表明, 提出的基于CycleGAN使用CWT和iMetricGAN的非平行语音风格转换方法(NS-CiC)在客观和主观评价上均显著提高了强噪声环境下的语音可懂度.
关键词：	深度学习可懂度增强连续小波变换 iMetricGAN CycleGAN
收稿时间：	2021/9/14 0:00:00
修稿时间：	2021/10/14 0:00:00
Key Technologies of Speech Intelligibility Based on CycleGAN

XIAO Jing,LIU Jia-Qi,LI Deng-Shi,ZHAO Lan-Xin,WANG Qian-Rui.Key Technologies of Speech Intelligibility Based on CycleGAN[J].Computer Systems& Applications,2022,31(6):1-9.

Authors:	XIAO Jing LIU Jia-Qi LI Deng-Shi ZHAO Lan-Xin WANG Qian-Rui

Affiliation:	National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan 430072, China;Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, Wuhan 430072, China;School of Artificial Intelligence, Jianghan University, Wuhan 430056, China

Abstract:	Speech intelligibility enhancement is a perceptual enhancement technique for clean speech reproduced in noisy environments. Speaking style conversion (SSC) is used in many studies to achieve speech intelligibility, which relies solely on the Lombard effect and thus demonstrates poor performance with strong noise interference. In addition, the SSC method models the conversion of fundamental frequency (F0) with a straight forward linear transform and only maps Mel-frequency cepstral coefficients (MFCCs) with few dimensions. As F0 and MFCCs are critical aspects of hierarchical intonation, adequate modeling of these features is essential. Therefore, we use the continuous wavelet transform (CWT) to decompose F0 into ten dimensions to describe speech at different time scales for effective F0 conversion and represent MFCCs with 20 dimensions for MFCC conversion. Furthermore, we utilize an iMetricGAN to optimize speech intelligibility metrics in strong noise. The experimental results show that in objective and subjective evaluations, the proposed non-parallel speech style conversion method using CWT and iMetricGAN based on CycleGAN (NS-CiC) significantly increases speech intelligibility in robust noise environments.

Keywords:	deep learning intelligibility?enhancement continuous?wavelet?transform (CWT) iMetricGAN CycleGAN
本文献已被万方数据等数据库收录！
	点击此处可从《计算机系统应用》浏览原始摘要信息
	点击此处可从《计算机系统应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏