首页 | 本学科首页   官方微博 | 高级检索  
     

基于3D和1D多特征融合的语音情感识别算法
引用本文:徐华南,周晓彦,姜万,李大鹏.基于3D和1D多特征融合的语音情感识别算法[J].声学技术,2021,40(4):496-502.
作者姓名:徐华南  周晓彦  姜万  李大鹏
作者单位:南京信息工程大学电子与信息工程学院, 江苏南京 210044
基金项目:国家自然基金(61902064、81971282);中央基本科研业务费(2242018K3DN01)。
摘    要:针对语音情感识别任务中特征提取单一、分类准确率低等问题,提出一种3D和1D多特征融合的情感识别方法,对特征提取算法进行改进。在3D网络,综合考虑空间特征学习和时间依赖性构造,利用双线性卷积神经网络(Bilinear Convolutional Neural Network,BCNN)提取空间特征,长短期记忆网络(Short-Term Memory Network,LSTM)和注意力(attention)机制提取显著的时间依赖特征。为降低说话者差异的影响,计算语音的对数梅尔特征(Log-Mel)和一阶差分、二阶差分特征合成3D Log-Mel特征集。在1D网络,利用一维卷积和LSTM的框架。最后3D和1D多特征融合得到判别性强的情感特征,利用softmax函数进行情感分类。在IEMOCAP和EMO-DB数据库上实验,平均识别率分别为61.22%和85.69%,同时与提取单一特征的3D和1D算法相比,多特征融合算法具有更好的识别性能。

关 键 词:语音情感识别  双线性卷积网络  长短期记忆网络  注意力(attention)  多特征融合
收稿时间:2020/3/31 0:00:00
修稿时间:2020/7/2 0:00:00

Speech emotion recognition algorithm based on 3D and 1D multi-feature fusion
XU Huanan,ZHOU Xiaoyan,JIANG Wan,LI Dapeng.Speech emotion recognition algorithm based on 3D and 1D multi-feature fusion[J].Technical Acoustics,2021,40(4):496-502.
Authors:XU Huanan  ZHOU Xiaoyan  JIANG Wan  LI Dapeng
Affiliation:College of Electronic and Information Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, Jiangsu, China
Abstract:In order to solve the problems of single feature extraction and low classification accuracy in speech emotion recognition task, a 3D and 1D multiple feature fusion method for emotion recognition is proposed in this paper to improve the feature extraction algorithm. In 3D network, the spatial feature learning and time-dependent construction are considered. The bilinear convolutional neural network (BCNN) is used to extract spatial features, the short-term memory network (LSTM) and the attention mechanism are used to extract significant time-dependent features. In order to reduce the influence of speaker differences, the Log-Mel features of speech signal and the first-order differential and the second- order differential features are computed to synthesize the 3D Log-Mel feature set. In 1D network, the 1D convolution and LSTM network are used. Finally, 3D and 1D features are fused to obtain discriminative emotional features, and the emotions are classified by using softmax functions. The average recognition rates are 61.22% and 85.69% respectively on IEMOCAP and EMO-DB databases, and the multi-feature fusion algorithm has better recognition performance than the 3D and 1D algorithm for single feature extraction.
Keywords:speech emotion recognition  bilinear convolutional neural network (BCNN)  long short-term memory (LSTM)  attention mechanism  multi-feature fusion
点击此处可从《声学技术》浏览原始摘要信息
点击此处可从《声学技术》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号