结合时间注意力机制和单模态标签自动生成策略的自监督多模态情感识别 Self-supervised Multimodal Emotion Recognition Combining Temporal Attention Mechanism and Unimodal Label Automatic Generation Strategy期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

结合时间注意力机制和单模态标签自动生成策略的自监督多模态情感识别

引用本文：	孙强, 王姝玉. 结合时间注意力机制和单模态标签自动生成策略的自监督多模态情感识别[J]. 电子与信息学报, 2024, 46(2): 588-601. doi: 10.11999/JEIT231107

作者姓名：	孙强王姝玉

作者单位：	1.西安理工大学自动化与信息工程学院通信工程系西安 710048;;2.西安市无线光通信与网络研究重点实验室西安 710048

基金项目：	西安市科技计划项目(22GXFW0086)；

摘要：	大多数多模态情感识别方法旨在寻求一种有效的融合机制，构建异构模态的特征，从而学习到具有语义一致性的特征表示。然而，这些方法通常忽略了模态间情感语义的差异性信息。为解决这一问题，提出了一种多任务学习框架，联合训练1个多模态任务和3个单模态任务，分别学习多模态特征间的情感语义一致性信息和各个模态所含情感语义的差异性信息。首先，为了学习情感语义一致性信息，提出了一种基于多层循环神经网络的时间注意力机制(TAM)，通过赋予时间序列特征向量不同的权重来描述情感特征的贡献度。然后，针对多模态融合，在语义空间进行了逐语义维度的细粒度特征融合。其次，为了有效学习各个模态所含情感语义的差异性信息，提出了一种基于模态间特征向量相似度的自监督单模态标签自动生成策略(ULAG)。通过在CMU-MOSI, CMU-MOSEI, CH-SIMS 3个数据集上的大量实验结果证实，提出的TAM-ULAG模型具有很强的竞争力：在分类指标($ Ac{c_2} $, $ {F_1} $)和回归指标(MAE, Corr)上与基准模型的指标相比均有所提升；对于二分类识别准确率，在CMU-MOSI和CMU-MOSEI数据集上分别为87.2%和85.8%，而在CH-SIMS数据集上达到81.47%。这些研究结果表明，同时学习多模态间的情感语义一致性信息和各模态情感语义的差异性信息，有助于提高自监督多模态情感识别方法的性能。
关键词：	多模态情感识别自监督标签生成多任务学习时间注意力机制多模态融合
收稿时间：	2023-10-11
修稿时间：	2024-01-30
Self-supervised Multimodal Emotion Recognition Combining Temporal Attention Mechanism and Unimodal Label Automatic Generation Strategy

SUN Qiang, WANG Shuyu. Self-supervised Multimodal Emotion Recognition Combining Temporal Attention Mechanism and Unimodal Label Automatic Generation Strategy[J]. Journal of Electronics & Information Technology, 2024, 46(2): 588-601. doi: 10.11999/JEIT231107

Authors:	SUN Qiang WANG Shuyu

Affiliation:	1. Department of Communication Engineering, School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China;;2. Xi’an Key Laboratory of Wireless Optical Communication and Network Research, Xi’an 710048, China

Abstract:	Most multimodal emotion recognition methods aim to find an effective fusion mechanism to construct the features from heterogeneous modalities, so as to learn the feature representation with semantic consistency. However, these methods usually ignore the emotionally semantic differences between modalities. To solve this problem, one multi-task learning framework is proposed. By training one multimodal task and three unimodal tasks jointly, the emotionally semantic consistency information among multimodal features and the emotionally semantic difference information contained in each modality are respectively learned. Firstly, in order to learn the emotionally semantic consistency information, one Temporal Attention Mechanism (TAM) based on a multilayer recurrent neural network is proposed. The contribution degree of emotional features is described by assigning different weights to time series feature vectors. Then, for multimodal fusion, the fine-grained feature fusion per semantic dimension is carried out in the semantic space. Secondly, one self-supervised Unimodal Label Automatic Generation (ULAG) strategy based on the inter-modal feature vector similarity is proposed in order to effectively learn the difference information of emotional semantics in each modality. A large number of experimental results on three datasets CMU-MOSI, CMU-MOSEI, CH-SIMS, confirm that the proposed TAM-ULAG model has strong competitiveness, and has improved the classification indices ($ Ac{c_2} $, $ {F_1} $) and regression index (MAE, Corr) compared with the current benchmark models. For binary classification, the recognition rate is 87.2% and 85.8% on the CMU-MOSEI and CMU-MOSEI datasets, and 81.47% on the CH-SIMS dataset. The results show that simultaneously learning the emotionally semantic consistency information and the emotionally semantic difference information for each modality is helpful in improving the performance of self-supervised multimodal emotion recognition method.

Keywords:	Multimodal emotion recognition Self-supervised label generation Multi-task learning Temporal Attention mechanism Multimodal fusion

	点击此处可从《电子与信息学报》浏览原始摘要信息
	点击此处可从《电子与信息学报》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏