基于双路径递归网络与Conv-TasNet的多头注意力机制视听语音分离 Multi-Head Attention Time Domain Audiovisual Speech Separation Based on Dual-Path Recurrent Network and Conv-TasNet期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于双路径递归网络与Conv-TasNet的多头注意力机制视听语音分离

引用本文：	兰朝凤, 蒋朋威, 陈欢, 赵世龙, 郭小霞, 韩玉兰, 韩闯. 基于双路径递归网络与Conv-TasNet的多头注意力机制视听语音分离[J]. 电子与信息学报, 2024, 46(3): 1005-1012. doi: 10.11999/JEIT230260

作者姓名：	兰朝凤蒋朋威陈欢赵世龙郭小霞韩玉兰韩闯

作者单位：	1.哈尔滨理工大学测控技术与通信工程学院哈尔滨 150080;;2.哈尔滨工大卫星技术有限公司哈尔滨 150023;;3.中国舰船研究设计中心武汉 430064

基金项目：	国家自然科学基金(11804068)；;黑龙江省自然科学基金(LH2020F033)~~；

摘要：	目前的视听语音分离模型基本是将视频特征和音频特征进行简单拼接，没有充分考虑各个模态的相互关系，导致视觉信息未被充分利用，分离效果不理想。该文充分考虑视觉特征、音频特征之间的相互联系，采用多头注意力机制，结合卷积时域分离模型(Conv-TasNet)和双路径递归神经网络(DPRNN)，提出多头注意力机制时域视听语音分离(MHATD-AVSS)模型。通过音频编码器与视觉编码器获得音频特征与视频的唇部特征，并采用多头注意力机制将音频特征与视觉特征进行跨模态融合，得到融合视听特征，将其经DPRNN分离网络，获得不同说话者的分离语音。利用客观语音质量评估(PESQ)、短时客观可懂度(STOI)及信噪比(SNR)评价指标，在VoxCeleb2数据集进行实验测试。研究表明，当分离两位、3位或4位说话者的混合语音时，该文方法与传统分离网络相比，SDR提高量均在1.87 dB以上，最高可达2.29 dB。由此可见，该文方法能考虑音频信号的相位信息，更好地利用视觉信息与音频信息的相关性，提取更为准确的音视频特性，获得更好的分离效果。
关键词：	语音分离视听融合跨模态注意力双路径递归网络 Conv-TasNet
收稿时间：	2023-04-12
修稿时间：	2023-09-05
Multi-Head Attention Time Domain Audiovisual Speech Separation Based on Dual-Path Recurrent Network and Conv-TasNet

LAN Chaofeng, JIANG Pengwei, CHEN Huan, ZHAO Shilong, GUO Xiaoxia, HAN Yulan, HAN Chuang. Multi-Head Attention Time Domain Audiovisual Speech Separation Based on Dual-Path Recurrent Network and Conv-TasNet[J]. Journal of Electronics & Information Technology, 2024, 46(3): 1005-1012. doi: 10.11999/JEIT230260

Authors:	LAN Chaofeng JIANG Pengwei CHEN Huan ZHAO Shilong GUO Xiaoxia HAN Yulan HAN Chuang

Affiliation:	1. School of Measurement and Control Technology and Communication Engineering, Harbin University of Science and Technology, Harbin 150080, China;;2. Harbin Institute of Technology Satellite Technology Co., Ltd., Harbin 150023, China;;3. China Ship Design and Research Center, Wuhan 430064, China

Abstract:	The current audiovisual speech separation model is essentially the simple splicing of video and audio features without fully considering the interrelationship of each modality, resulting in the underutilization of visual information and unsatisfactory separation effects. The article adequately considers the interconnection between visual features and audio features, adopts a multi-headed attention mechanism, and combines the Convolutional Time-domain audio separation Network (Conv-TasNet) and Dual-Path Recurrent Neural Network (DPRNN), the Multi-Head Attention Time Domain AudioVisual Speech Separation (MHATD-AVSS) model is proposed. The audio encoder and the visual encoder are used to obtain the audio features and the lip features of the video, and the multi-head attention mechanism is used to cross-modality fuse the audio features with the visual features to obtain the audiovisual fusion features, which are passed through the DPRNN separation network to obtain the separated speech of different speakers. The Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and Signal-to-Noise Ratio (SNR) evaluation metrics are used for experimental testing in the VoxCeleb2 dataset. The research shows that when separating the mixed speech of two, three, or four speakers, the SDR improvement of the method in this paper is above 1.87 dB and up to 2.29 dB compared with the traditional separation network. In summary, this article shows that the method can consider the phase information of the audio signal, better use the correlation between visual information and audio information, extract more accurate audio and video characteristics, and obtain better separation effects.

Keywords:	Speech separation Audiovisual fusion Cross-modal attention Dual-path recurrent network Conv-TasNet

	点击此处可从《电子与信息学报》浏览原始摘要信息
	点击此处可从《电子与信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏