利用语音与文本特征融合改善语音情感识别 Using speech and text features fusion to improve speech emotion recognition期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

利用语音与文本特征融合改善语音情感识别

引用本文：	冯亚琴,沈凌洁,胡婷婷,王蔚. 利用语音与文本特征融合改善语音情感识别[J]. 数据采集与处理, 2019, 34(4): 625-631

作者姓名：	冯亚琴沈凌洁胡婷婷王蔚

作者单位：	南京师范大学教育科学学院，南京，210097

基金项目：	国家社会科学基金 BCA150054国家社会科学基金（BCA150054）资助项目。

摘要：	情感识别在人机交互中具有重要意义，为了提高情感识别准确率，将语音与文本特征融合。语音特征采用了声学特征和韵律特征，文本特征采用了基于情感词典的词袋特征（Bag-of-words,BoW）和N-gram模型。将语音与文本特征分别进行特征层融合与决策层融合，比较它们在IEMOCAP四类情感识别的效果。实验表明，语音与文本特征融合比单一特征在情感识别中表现更好；决策层融合比在特征层融合识别效果好。且基于卷积神经网络（Convolutional neural network，CNN）分类器，语音与文本特征在决策层融合中不加权平均召回率（Unweighted average recall，UAR)达到了68.98%，超过了此前在IEMOCAP数据集上的最好结果。
关键词：	情感识别声学特征韵律特征文本特征特征融合
收稿时间：	2018-01-21
修稿时间：	2018-04-04
Using speech and text features fusion to improve speech emotion recognition

Feng Yaqin,Shen Lingjie,Hu Tingting and. Using speech and text features fusion to improve speech emotion recognition[J]. Journal of Data Acquisition & Processing, 2019, 34(4): 625-631

Authors:	Feng Yaqin Shen Lingjie Hu Tingting and

Affiliation:	School of Education Science,Nanjing Normal University,School of Education Science,Nanjing Normal University,School of Education Science,Nanjing Normal University,

Abstract:	Emotion recognition has an important significance in human-computer interaction. The purpose of this study was to improve the accuracy of emotion recognition by fusing speech and text features. Speech features were acoustic features and phonological features, and the text features were the traditional Bag-of-Words (BoW) features based on emotion dictionary and N-gram model. We used these features to emotion recognition and compared their performance on the IEMOCAP data-sets. We also compared the effects of different features fusion methods, including feature-layer fusion and decision-layer fusion. Experiment results show that the performance of the fusion of speech and text features is better than that of single features; the performance of the decision-layer fusion of speech and text features is better than that of feature-layer fusion. At the same time, based on the CNN classifier, UAR of the decision-layer fusion with three features reaches 68.98%, surpassing the previous best results on the IEMOCAP data sets.

Keywords:	emotion recognition acoustic features phonological features text features feature fusion

	点击此处可从《数据采集与处理》浏览原始摘要信息
	点击此处可从《数据采集与处理》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏