首页 | 本学科首页   官方微博 | 高级检索  
     

基于语音和视频图像的多模态情感识别研究
引用本文:王传昱,李为相,陈震环. 基于语音和视频图像的多模态情感识别研究[J]. 计算机工程与应用, 2021, 57(23): 163-170. DOI: 10.3778/j.issn.1002-8331.2104-0306
作者姓名:王传昱  李为相  陈震环
作者单位:南京工业大学 电气工程与控制科学学院,南京 211816
摘    要:情感识别依靠分析生理信号、行为特征等分析情感类别,是人工智能重要研究领域之一。为提高情感识别的准确性和实时性,提出基于语音与视频图像的多模态情感识别方法。视频图像模态基于局部二值直方图法(LBPH)+稀疏自动编码器(SAE)+改进卷积神经网络(CNN)实现;语音模态基于改进深度受限波尔兹曼机(DBM)和改进长短时间记忆网络(LSTM)实现;使用SAE获得更多图像的细节特征,用DBM获得声音特征的深层表达;使用反向传播算法(BP)优化DBM和LSTM的非线性映射能力,使用全局均值池化(GAP)提升CNN和LSTM的响应速度并防止过拟合。单模态识别后,两个模态的识别结果基于权值准则在决策层融合,给出所属情感分类及概率。实验结果表明,融合识别策略提升了识别准确率,在中文自然视听情感数据库(cheavd)2.0的测试集达到74.9%的识别率,且可以对使用者的情感进行实时分析。

关 键 词:特征融合  多模态融合  表情识别  语音情绪识别  深度学习  

Reserch of Multi-modal Emotion Recognition Based on Voice and Video Images
WANG Chuanyu,LI Weixiang,CHEN Zhenhuan. Reserch of Multi-modal Emotion Recognition Based on Voice and Video Images[J]. Computer Engineering and Applications, 2021, 57(23): 163-170. DOI: 10.3778/j.issn.1002-8331.2104-0306
Authors:WANG Chuanyu  LI Weixiang  CHEN Zhenhuan
Affiliation:Colloge of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing 211816, China
Abstract:Emotion recognition is one of the important research fields of artificial intelligence, which relies on the analysis of physiological signals and behavioral characteristics to analyze emotion categories. In order to improve the accuracy of emotion recognition, a multi-modal emotion recognition method based on voice and video images is proposed. The video image modality is realized by using the Local Binary Patterns Histograms method(LBPH) and Sparse Auto-Encoder(SAE) and the improved Convolutional Neural Network(CNN). The voice modality is realized by using the improved Deep-restricted Boltzmann Machine(DBM) and the improved Long-Short Term Memory(LSTM). More detailed features of the image can be obtained by using SAE, deep expression of sound characteristics can be obtained by using DBM, the Back Propagation method(BP) are used to optimize the nonlinear mapping capability of DBM and LSTM, the Global Average Pooling(GAP) method are used to improve the response speed of CNN and LSTM and prevent overfitting. After single modality identification, the recognition results of the two modalities are fused at the decision level?layer based on the weight criterion, and the probabilities of different emotion types will be given. The experimental results show that compared with the traditional single-modal emotion recognition, the method proposed can improve the recognition accuracy, and achieves a recognition rate of 74.9% in the test set of the Chinese natural audio-visual emotion database(cheavd) 2.0. It can also be used for real-time analysis of emotions.
Keywords:feature fusion  multimodal fusion  emotion recognition  speech emotion recognition  deep learning  
本文献已被 万方数据 等数据库收录!
点击此处可从《计算机工程与应用》浏览原始摘要信息
点击此处可从《计算机工程与应用》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号