首页 | 本学科首页   官方微博 | 高级检索  
     

利用语音与文本特征融合改善语音情感识别
引用本文:冯亚琴,沈凌洁,胡婷婷,王蔚. 利用语音与文本特征融合改善语音情感识别[J]. 数据采集与处理, 2019, 34(4): 625-631
作者姓名:冯亚琴  沈凌洁  胡婷婷  王蔚
作者单位:南京师范大学教育科学学院,南京,210097
基金项目:国家社会科学基金 BCA150054国家社会科学基金(BCA150054)资助项目。
摘    要:情感识别在人机交互中具有重要意义,为了提高情感识别准确率,将语音与文本特征融合。语音特征采用了声学特征和韵律特征,文本特征采用了基于情感词典的词袋特征(Bag-of-words,BoW)和N-gram模型。将语音与文本特征分别进行特征层融合与决策层融合,比较它们在IEMOCAP四类情感识别的效果。实验表明,语音与文本特征融合比单一特征在情感识别中表现更好;决策层融合比在特征层融合识别效果好。且基于卷积神经网络(Convolutional neural network,CNN)分类器,语音与文本特征在决策层融合中不加权平均召回率(Unweighted average recall,UAR)达到了68.98%,超过了此前在IEMOCAP数据集上的最好结果。

关 键 词:情感识别  声学特征  韵律特征  文本特征  特征融合
收稿时间:2018-01-21
修稿时间:2018-04-04

Using speech and text features fusion to improve speech emotion recognition
Feng Yaqin,Shen Lingjie,Hu Tingting and. Using speech and text features fusion to improve speech emotion recognition[J]. Journal of Data Acquisition & Processing, 2019, 34(4): 625-631
Authors:Feng Yaqin  Shen Lingjie  Hu Tingting and
Affiliation:School of Education Science,Nanjing Normal University,School of Education Science,Nanjing Normal University,School of Education Science,Nanjing Normal University,
Abstract:Emotion recognition has an important significance in human-computer interaction. The purpose of this study was to improve the accuracy of emotion recognition by fusing speech and text features. Speech features were acoustic features and phonological features, and the text features were the traditional Bag-of-Words (BoW) features based on emotion dictionary and N-gram model. We used these features to emotion recognition and compared their performance on the IEMOCAP data-sets. We also compared the effects of different features fusion methods, including feature-layer fusion and decision-layer fusion. Experiment results show that the performance of the fusion of speech and text features is better than that of single features; the performance of the decision-layer fusion of speech and text features is better than that of feature-layer fusion. At the same time, based on the CNN classifier, UAR of the decision-layer fusion with three features reaches 68.98%, surpassing the previous best results on the IEMOCAP data sets.
Keywords:emotion recognition  acoustic features  phonological features  text features  feature fusion
点击此处可从《数据采集与处理》浏览原始摘要信息
点击此处可从《数据采集与处理》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号