首页 | 本学科首页   官方微博 | 高级检索  
     

基于i向量和变分自编码相对生成对抗网络的语音转换
引用本文:李燕萍,曹盼,左宇涛,张燕,钱博.基于i向量和变分自编码相对生成对抗网络的语音转换[J].自动化学报,2022,48(7):1824-1833.
作者姓名:李燕萍  曹盼  左宇涛  张燕  钱博
作者单位:1.南京邮电大学通信与信息工程学院 南京 210003
基金项目:国家自然科学青年基金(61401227);国家自然科学基金(61872199,61872424);
摘    要:提出一种基于i向量和变分自编码相对生成对抗网络的语音转换方法, 实现了非平行文本条件下高质量的多对多语音转换. 性能良好的语音转换系统, 既要保持重构语音的自然度, 又要兼顾转换语音的说话人个性特征是否准确. 首先为了改善合成语音自然度, 利用生成性能更好的相对生成对抗网络代替基于变分自编码生成对抗网络模型中的Wasserstein生成对抗网络, 通过构造相对鉴别器的方式, 使得鉴别器的输出依赖于真实样本和生成样本间的相对值, 克服了Wasserstein生成对抗网络性能不稳定和收敛速度较慢等问题. 进一步为了提升转换语音的说话人个性相似度, 在解码阶段, 引入含有丰富个性信息的i向量, 以充分学习说话人的个性化特征. 客观和主观实验表明, 转换后的语音平均梅尔倒谱失真距离值较基准模型降低4.80%, 平均意见得分值提升5.12%, ABX 值提升8.60%, 验证了该方法在语音自然度和个性相似度两个方面均有显著的提高, 实现了高质量的语音转换.

关 键 词:语音转换    相对生成对抗网络    i向量    非平行文本    变分自编码器    多对多
收稿时间:2019-10-23

Voice Conversion Based on i-vector With Variational Autoencoding Relativistic Standard Generative Adversarial Network
Affiliation:1.School of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 2100032.Jinling Institute of Technology, Nanjing 2111693.Nanjing Institute of Electronic Technology, Nanjing 210039
Abstract:This paper proposes a novel voice conversion method based on i-vector and variational autoencoding relativistic standard generative adversarial network, which can realize high-quality many-to-many voice conversion for non-parallel corpora. A high performance voice conversion method should not only ensure speech naturalness, but also take into account speaker similarity of converted speech. Firstly, in order to improve the speech naturalness, the Wasserstein generative adversarial network in the voice conversion model based on variational autoencoding generative adversarial network is replaced by the relativistic standard generative adversarial network, which makes the output of the discriminator depend on the relativistic standard value between real and generated samples by constructing a relativistic standard discriminator, overcoming the unstable performance and slow convergence rate. Furthermore, i-vector representing speaker characteristics is adopted as speaker representation for many-to-many voice conversion in addition to traditional one-hot vector, thus significantly improving speaker similarity of converted speech. Sufficient objective and subjective experiments show that the average value of mel-cepstral distortion is decreased by 4.80%, the mean opinion score is increased by 5.12%, and ABX is increased by 8.60% compared with baseline variational autoencoding wasserstein generative adversarial network method which demonstrate that the proposed method has a great improvement on both speech naturalness and speaker similarity.
Keywords:
点击此处可从《自动化学报》浏览原始摘要信息
点击此处可从《自动化学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号