基于听说知识融合网络的多模态对话情绪识别 Listening and speaking knowledge fusion network for multi-modal emotion recognition in conversation期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于听说知识融合网络的多模态对话情绪识别

引用本文：	刘琴,谢珺,胡勇,郝戍峰,郝雅卉.基于听说知识融合网络的多模态对话情绪识别[J].控制与决策,2024,39(6):2031-2040.

作者姓名：	刘琴谢珺胡勇郝戍峰郝雅卉

作者单位：	太原理工大学信息与计算机学院,太原 030024;北京航空航天大学新媒体艺术与设计学院,北京 100191;太原理工大学大数据学院,太原 030024

基金项目：	虚拟现实技术与系统国家重点实验室(北京航空航天大学)开放课题基金项目(VRLAB2022C11)；山西省基础研究计划青年科学研究项目(20210302124168)；山西省留学人员科技活动择优资助项目(20220009)；山西省重点研发计划项目(202102020101004).

摘要：	多模态对话情绪识别旨在根据多模态对话语境判别出目标话语所表达的情绪类别,是构建共情对话系统的基础任务.现有工作中大多数方法仅考虑多模态对话本身信息,忽略了对话中与倾听者和说话者相关的知识信息,从而限制了目标话语情绪特征的捕捉.为解决该问题,提出一种基于听说知识融合网络的多模态对话情绪识别模型(LSKFN),引入与倾听者和说话者相关的外部常识知识,实现多模态上下文信息和知识信息的有机融合.LSKFN包含多模态上下文感知、听说知识融合、情绪信息汇总和情绪决策4个阶段,分别用于提取多模态上下文特征、融入听说知识特征、消除冗余特征和预测情绪分布.在两个公开数据集上的实验结果表明,与其他基准模型相比,LSKFN能够为目标话语提取到更加丰富的情绪特征,并且获得较好的对话情绪识别效果.
关键词：	情感计算对话情绪识别多模态特征外部常识知识上下文语义知识特征融合
Listening and speaking knowledge fusion network for multi-modal emotion recognition in conversation

LIU Qin,XIE Jun,HU Yong,HAO Shu-feng,HAO Ya-hui.Listening and speaking knowledge fusion network for multi-modal emotion recognition in conversation[J].Control and Decision,2024,39(6):2031-2040.

Authors:	LIU Qin XIE Jun HU Yong HAO Shu-feng HAO Ya-hui

Affiliation:	College of Information and Computer,Taiyuan University of Technology,Taiyuan 030024,China;College of New Media Art and Design,Beihang University,Beijing 100191,China;College of Data Science,Taiyuan University of Technology,Taiyuan 030024,China

Abstract:	Multi-modal emotion recognition in conversation aims to identify the emotion of the target utterance according to the multi-modal conversation context, which is the primary task of building empathetic dialogue systems(EDS). Existing works only consider multi-modal conversation itself while ignoring the knowledge information about the listener and the speaker, leading to the limit in capturing the emotional features of the target utterance. To solve this problem, a listening and speaking knowledge fusion network(LSKFN) is proposed, which introduces the external common sense knowledge and fuses it with multi-modal context efficiently. The proposed LSKFN consists of four stages, which are used to extract multi-modal context features, integrate listening and speaking knowledge features, eliminate redundant features, and predict emotional probability distribution. Experimental results on two public multi-modal conversation datasets demonstrate that the LSKFN can extract richer emotional features for the target utterance, and obtain better emotional recognition performance compared with other benchmark models.

Keywords:

	点击此处可从《控制与决策》浏览原始摘要信息
	点击此处可从《控制与决策》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏