首页 | 本学科首页   官方微博 | 高级检索  
     

基于多头注意力机制的端到端语音情感识别
引用本文:杨磊,赵红东,于快快.基于多头注意力机制的端到端语音情感识别[J].计算机应用,2022,42(6):1869-1875.
作者姓名:杨磊  赵红东  于快快
作者单位:河北工业大学 电子信息工程学院,天津 300401
光电信息控制和安全技术重点实验室,天津 300308
基金项目:光电信息控制和安全技术重点实验室基金资助项目(614210701041705)
摘    要:针对语音情感数据集规模小且数据维度高的特点,为解决传统循环神经网络(RNN)长程依赖消失和卷积神经网络(CNN)关注局部信息导致输入序列内部各帧之间潜在关系没有被充分挖掘的问题,提出一个基于多头注意力(MHA)和支持向量机(SVM)的神经网络MHA-SVM用于语音情感识别(SER)。首先将原始音频数据输入MHA网络来训练MHA的参数并得到MHA的分类结果;然后将原始音频数据再次输入到预训练好的MHA中用于提取特征;最后通过全连接层后使用SVM对得到的特征进行分类获得MHA-SVM的分类结果。充分评估MHA模块中头数和层数对实验结果的影响后,发现MHA-SVM在IEMOCAP数据集上的识别准确率最高达到69.6%。实验结果表明同基于RNN和CNN的模型相比,基于MHA机制的端到端模型更适合处理SER任务。

关 键 词:语音情感识别  多头注意力  卷积神经网络  支持向量机  端到端  
收稿时间:2021-04-14
修稿时间:2021-07-19

End-to-end speech emotion recognition based on multi-head attention
Lei YANG,Hongdong ZHAO,Kuaikuai YU.End-to-end speech emotion recognition based on multi-head attention[J].journal of Computer Applications,2022,42(6):1869-1875.
Authors:Lei YANG  Hongdong ZHAO  Kuaikuai YU
Affiliation:School of Electronics and Information Engineering,Hebei University of Technology,Tianjin 300401,China
Science and Technology on Electro-Optical Information Security Control Laboratory,Tianjin 300308,China
Abstract:Aiming at the characteristics of small size and high data dimensionality of speech emotion datasets, to solve the problem of long-range dependence disappearance in traditional Recurrent Neural Network (RNN) and insufficient excavation of potential relationship between frames within the input sequence because of focus on local information of Convolutional Neural Network (CNN), a new neural network MAH-SVM based on Multi-Head Attention (MHA) and Support Vector Machine (SVM) was proposed for Speech Emotion Recognition (SER). First, the original audio data were input into the MHA network to train the parameters of MHA and obtain the classification results of MHA. Then, the same original audio data were input into the pre-trained MHA again for feature extraction. Finally, these obtained features were fed into SVM after the fully connected layer to obtain classification results of MHA-SVM. After fully evaluating the effect of the heads and layers in the MHA module on the experimental results, it was found that MHA-SVM achieved the highest recognition accuracy of 69.6% on IEMOCAP dataset. Experimental results indicate that the end-to-end model based on MHA mechanism is more suitable for SER tasks compared with models based on RNN and CNN.
Keywords:Speech Emotion Recognition (SER)  Multi-Head Attention (MHA)  Convolutional Neural Network (CNN)  Support Vector Machine (SVM)  end-to-end  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号