文本无关说话人识别的一种多尺度特征提取方法 A Multiscale Feature Extraction Method for Text-independent Speaker Recognition期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

文本无关说话人识别的一种多尺度特征提取方法

引用本文：	陈志高,李鹏,肖润秋,黎塔,王文超.文本无关说话人识别的一种多尺度特征提取方法[J].电子与信息学报,2021,43(11):3266-3271.

作者姓名：	陈志高李鹏肖润秋黎塔王文超

作者单位：	1.中国科学院声学研究所语言声学与内容理解重点实验室北京 1001902.中国科学院大学北京 1000493.国家计算机网络应急技术处理协调中心北京 100029

基金项目：	国家自然科学基金(11590772, 11590774, 11590770)

摘要：	近些年来，多种基于卷积神经网络(CNNs)的模型结构表现出越来越强的多尺度特征表达能力，在说话人识别的各项任务中取得了持续的性能提升。然而，目前大多数方法只能利用更深更宽的网络结构来提升性能。该文引入一种更高效的多尺度说话人特征提取框架Res2Net，并对它的模块结构进行了改进。它以一种更细粒化的工作方式，获得多种感受野的组合，从而获得多种不同尺度组合的特征表达。实验表明，该方法在参数量几乎不变的情况下，等错误率(EER)相较ResNet有20%的下降，并且在VoxCeleb, SITW等多种不同录制环境和识别任务中都有稳定的性能提升，证明了该方法的高效性和鲁棒性。改进后的全连接模块结构能更充分利用训练信息，在数据充足和任务复杂时性能提升明显。具体代码可以在https://github.com/czg0326/Res2Net-Speaker-Recognition获得。
关键词：	说话人识别多尺度特征鲁棒性高效性
收稿时间：	2020-10-26
A Multiscale Feature Extraction Method for Text-independent Speaker Recognition

Zhigao CHEN,Peng LI,Runqiu XIAO,Ta LI,Wenchao WANG.A Multiscale Feature Extraction Method for Text-independent Speaker Recognition[J].Journal of Electronics & Information Technology,2021,43(11):3266-3271.

Authors:	Zhigao CHEN Peng LI Runqiu XIAO Ta LI Wenchao WANG

Affiliation:	1.Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China2.University of Chinese Academy of Sciences, Beijing 100049, China3.National Computer Network Emergency Response Technical Team/ Coordination Center of China, Beijing 100029, China

Abstract:	Recently in speaker recognition tasks, consistent performance gains have been continually achieved by various Convolutional Neural Networks (CNNs), which have shown increasingly stronger multiscale representation abilities. However, most existing methods enhance their strength with more layers and deeper structures. In this paper, a unique multiscale backbone architecture, Res2Net, is introduced for speaker recognition tasks, and its blocks are modified for assessment. This architecture works at a more granular level than most layer-wise networks. It improves the system by combining many equivalent receptive fields, resulting in a combination of different feature scales. The experiments results demonstrate that this architecture steadily achieves a 20% improvement on the Equal Error Rate (EER) over the baseline without additional computational burden. Its effectiveness and robustness are also verified in different environments and tasks, such as VoxCeleb and Speakers In The Wild (SITW). The modified full-connection block can make sure a more sufficient use of information and improves the performance obviously in more complex tasks. The code is available at https://github.com/czg0326/Res2Net-Speaker-Recognition.

Keywords:

	点击此处可从《电子与信息学报》浏览原始摘要信息
	点击此处可从《电子与信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏