基于音视频的情感识别方法研究 Method Research on Multimodal Emotion Recognition Based on Audio and Video期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于音视频的情感识别方法研究

引用本文：	林淑瑞,张晓辉,郭敏,张卫强,王贵锦.基于音视频的情感识别方法研究[J].信号处理,2021,37(10):1889-1898.

作者姓名：	林淑瑞张晓辉郭敏张卫强王贵锦

作者单位：	清华大学电子工程系,北京国家信息科学技术研究中心

基金项目：	NSFC-通用技术基础研究联合基金重点项目（U1836219）

摘要：	近年来，情感计算逐渐成为人机交互发展突破的关键，而情感识别作为情感计算的重要部分，也受到了广泛的关注。本文实现了基于ResNet18的面部表情识别系统和基于HGFM架构的语音情感识别模型，通过调整参数，训练出了性能较好的模型。在此基础上，通过特征级融合和决策级融合这两种多模态融合策略，实现了包含视频和音频信号的多模态情感识别系统，展现了多模态情感识别系统性能的优越性。两种不同融合策略下的音视频情感识别模型相比视频模态和音频模态，在准确率上都有一定的提升，验证了多模态模型往往比最优的单模态模型的识别性能更好的结论。本文所实现的模型取得了较好的情感识别性能，融合后的音视频双模态模型的准确率达到了76.84%，与现有最优模型相比提升了3.50%，在与现有的音视频情感识别模型的比较中具有性能上的优势。
关键词：	情感识别深度学习多模态融合残差网络分层粒度和特征模型
收稿时间：	2021-08-23
Method Research on Multimodal Emotion Recognition Based on Audio and Video

Affiliation:	Beijing National Research Center for Information Science and Technology, Department of Electronic Engineering, Tsinghua UniversityShenzhen International Graduate School, Tsinghua University

Abstract:	In recent years, affective computing has gradually become one of the keys to the development of human-computer interaction. Emotion recognition, as an important part of affective computing, has also received extensive attention. Residual network is one of the most widely used networks and HGFM has better accuracy and robustness. This paper implemented facial expression recognition system based on ResNet18 and speech emotion recognition model based on HGFM. By adjusting the parameters, the model with better performance was trained. On this basis, we realized the multimodal system included video and audio by multimodal fusion strategies, namely feature-level fusion and decision-level fusion. It showed the superiority of the multimodal emotion recognition system performance. The feature-level fusion spliced the features of visual and audio mode into a large feature vector and then sent it into the classifier for classification and recognition. For the decision-level fusion, after the prediction probability of visual and audio mode was obtained through classifiers, the weight of each mode and the fusion strategy were determined according to the reliability of each mode, and the classification result was obtained after fusion. It was found that both two audio-visual emotion recognition models using different fusion strategies had improvements in accuracy compared with video modal model and audio modal model. The conclusion that the multimodal model is better than the optimal single-mode model was verified. The accuracy of the fused audio-visual bimodal model reached 76.84%, which was 3.50% higher than the existing optimal model. The model achieved in this paper has better performance in emotion recognition and has advantages in performance compared with the existing audio-visual emotion recognition models.

Keywords:

	点击此处可从《信号处理》浏览原始摘要信息
	点击此处可从《信号处理》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏