期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Deep learning approaches for speech emotion recognition: state of the art and research challenges

Jahangir Rashid Teh Ying Wah Hanif Faiqa Mujtaba Ghulam 《Multimedia Tools and Applications》2021,80(16):23745-23812

Speech emotion recognition (SER) systems identify emotions from the human voice in the areas of smart healthcare, driving a vehicle, call centers, automatic translation systems, and human-machine interaction. In the classical SER process, discriminative acoustic feature extraction is the most important and challenging step because discriminative features influence the classifier performance and decrease the computational time. Nonetheless, current handcrafted acoustic features suffer from limited capability and accuracy in constructing a SER system for real-time implementation. Therefore, to overcome the limitations of handcrafted features, in recent years, variety of deep learning techniques have been proposed and employed for automatic feature extraction in the field of emotion prediction from speech signals. However, to the best of our knowledge, there is no in-depth review study is available that critically appraises and summarizes the existing deep learning techniques with their strengths and weaknesses for SER. Hence, this study aims to present a comprehensive review of deep learning techniques, uniqueness, benefits and their limitations for SER. Moreover, this review study also presents speech processing techniques, performance measures and publicly available emotional speech databases. Furthermore, this review also discusses the significance of the findings of the primary studies. Finally, it also presents open research issues and challenges that need significant research efforts and enhancements in the field of SER systems.

相似文献

2.

语音情感识别研究进展综述 总被引：6，自引：2，他引：6

韩文静李海峰阮华斌马琳《软件学报》2014,25(1):37-50

对语音情感识别的研究现状和进展进行了归纳和总结,对未来语音情感识别技术发展趋势进行了展望. 从5个角度逐步展开进行归纳总结,即情感描述模型、具有代表性的情感语音库、语音情感特征提取、语音情感识别算法研究和语音情感识别技术应用,旨在尽可能全面地对语音情感识别技术进行细致的介绍与分析,为相关研究人员提供有价值的学术参考;最后,立足于研究现状的分析与把握,对当前语音情感识别领域所面临的挑战与发展趋势进行了展望.侧重于对语音情感识别研究的主流方法和前沿进展进行概括、比较和分析. 相似文献

3.

基于异构并行神经网络的语音情感识别

张会云黄鹤鸣《计算机工程》2022,48(4):113-118

提取能表征语音情感的特征并构建具有较强鲁棒性和泛化性的声学模型是语音情感识别系统的核心。面向语音情感识别构建基于注意力机制的异构并行卷积神经网络模型AHPCL,采用长短时记忆网络提取语音情感的时间序列特征,使用卷积操作提取语音空间谱特征,通过将时间信息和空间信息相结合共同表征语音情感,提高预测结果的准确率。利用注意力机制,根据不同时间序列特征对语音情感的贡献程度分配权重,实现从大量特征信息中选择出更能表征语音情感的时间序列。在CASIA、EMODB、SAVEE等3个语音情感数据库上提取音高、过零率、梅尔频率倒谱系数等低级描述符特征,并计算这些低级描述符特征的高级统计函数共得到219维的特征作为输入进行实验验证。结果表明,AHPCL模型在3个语音情感数据库上分别取得了86.02%、84.03%、64.06%的未加权平均召回率,相比LeNet、DNN-ELM和TSFFCNN基线模型具有更强的鲁棒性和泛化性。相似文献

4.

语音情感识别研究综述

下载免费PDF全文

高庆吉赵志华徐达邢志伟《智能系统学报》2020,15(1):1-13

针对语音情感识别研究体系进行综述。这一体系包括情感描述模型、情感语音数据库、特征提取与降维、情感分类与回归算法4个方面的内容。本文总结离散情感模型、维度情感模型和两模型间单向映射的情感描述方法；归纳出情感语音数据库选择的依据；细化了语音情感特征分类并列出了常用特征提取工具；最后对特征提取和情感分类与回归的常用算法特点进行凝练并总结深度学习研究进展，并提出情感语音识别领域需要解决的新问题、预测了发展趋势。相似文献

5.

TC-Net: A Modest & Lightweight Emotion Recognition System Using Temporal Convolution Network

Muhammad Ishaq Mustaqeem Khan Soonil Kwon 《计算机系统科学与工程》2023,46(3):3355-3369

Speech signals play an essential role in communication and provide an efficient way to exchange information between humans and machines. Speech Emotion Recognition (SER) is one of the critical sources for human evaluation, which is applicable in many real-world applications such as healthcare, call centers, robotics, safety, and virtual reality. This work developed a novel TCN-based emotion recognition system using speech signals through a spatial-temporal convolution network to recognize the speaker’s emotional state. The authors designed a Temporal Convolutional Network (TCN) core block to recognize long-term dependencies in speech signals and then feed these temporal cues to a dense network to fuse the spatial features and recognize global information for final classification. The proposed network extracts valid sequential cues automatically from speech signals, which performed better than state-of-the-art (SOTA) and traditional machine learning algorithms. Results of the proposed method show a high recognition rate compared with SOTA methods. The final unweighted accuracy of 80.84%, and 92.31%, for interactive emotional dyadic motion captures (IEMOCAP) and berlin emotional dataset (EMO-DB), indicate the robustness and efficiency of the designed model. 相似文献

6.

语音情感识别研究综述

下载免费PDF全文

罗德虎冉启武杨超豆旺《计算机工程与应用》2022,58(21):40-52

语音是人们传递信息内容的同时又表达情感态度的媒介,语音情感识别是人机交互的重要组成部分。由语音情感识别的概念和历史发展进程入手,从6个角度逐步展开对语音情感识别研究体系进行综述。分析常用的情感描述模型,归纳常用的情感语音数据库和不同类型数据库的特点,研究语音情感特征的提取技术。通过比对3种语音情感识别方法的众多学者的多方面研究,得出语音情感识别方法可期望应用场景的态势,展望语音情感识别技术的挑战和发展趋势。相似文献

7.

Emotion recognition from speech: a review

Shashidhar G. Koolagudi K. Sreenivasa Rao 《International Journal of Speech Technology》2012,15(2):99-117

Emotion recognition from speech has emerged as an important research area in the recent past. In this regard, review of existing work on emotional speech processing is useful for carrying out further research. In this paper, the recent literature on speech emotion recognition has been presented considering the issues related to emotional speech corpora, different types of speech features and models used for recognition of emotions from speech. Thirty two representative speech databases are reviewed in this work from point of view of their language, number of speakers, number of emotions, and purpose of collection. The issues related to emotional speech databases used in emotional speech recognition are also briefly discussed. Literature on different features used in the task of emotion recognition from speech is presented. The importance of choosing different classification models has been discussed along with the review. The important issues to be considered for further emotion recognition research in general and in specific to the Indian context have been highlighted where ever necessary. 相似文献

8.

基于多操作网络的图式多域语音情感识别研究

张会云黄鹤鸣《计算机工程》2022,48(7):59-65

多域语音情感识别研究在语料标注方法、录制场景以及交互方式等方面存在差异性，使得构建多域语音情感识别系统变得较为复杂。设计一种基于多操作网络的多域语音情感识别模型，通过组合CASIA、EMODB、SAVEE 3个单域数据库，构建Hybrid-CE、Hybrid-ES、Hybrid-CS、Hybrid-CES 4种多域语音情感数据库及层级多操作网络（HMN）。HMN网络由2个异构并行分支组成，左分支由2个同构并行的一维卷积层构成，卷积层的神经元数量均为128，右分支由并行的Bi-GRU层和Bi-LSTM层构成，GRU和LSTM的记忆单元数量均为64。将原始数据投影到不同的变换空间进行计算，从而更准确地表征语音的情感信息。通过分层的Concate、Add和Multiply多操作运算，将左右分支提取的不同特征进行多重融合。在此基础上，计算梅尔频率倒谱系数、色谱图、谱对比度等低级描述符特征的高级统计函数，得到219维特征作为模型HMN的输入。实验结果表明，该模型在4种多域数据库上的F1-score分别达到82.22%、65.02%、70.59%、73.47%，具有较好的鲁棒性和泛化性。相似文献

9.

改进脉冲耦合神经网络的语音识别研究

张晓俊陶智施晓敏顾济华《计算机工程与应用》2007,43(8):51-53

提出了一种改进脉冲耦合神经网络(IPCNN)实现语音识别的方法。首先利用IPCNN来快速提取语音的语谱图图像特征,然后由概率神经网络(PNN)辅助来识别语音。通过训练语音样本来构成语音识别库并建立综合识别系统。实验结果表明,本方法相对于单独使用PCNN和PNN识别率分别提高了22.7%和39.4%,达到92%的识别率。相似文献

10.

基于声学特征的语言情感识别

金琴陈师哲李锡荣杨刚许洁萍《计算机科学》2015,42(9):24-28

语音情感识别是语音处理领域中一个具有挑战性和广泛应用前景的研究课题。探索了语音情感识别中的关键问题之一:生成情感识别的有效的特征表示。从4个角度生成了语音信号中的情感特征表示:(1)低层次的声学特征,包括能量、基频、声音质量、频谱等相关的特征,以及基于这些低层次特征的统计特征;(2)倒谱声学特征根据情感相关的高斯混合模型进行距离转化而得出的特征;(3)声学特征依据声学词典进行转化而得出的特征;(4)声学特征转化为高斯超向量的特征。通过实验比较了各类特征在情感识别上的独立性能,并且尝试了将不同的特征进行融合,最后比较了不同的声学特征在几个不同语言的情感数据集上的效果(包括IEMOCAP英语情感语料库、CASIA汉语情感语料库和Berlin德语情感语料库)。在IEMOCAP数据集上,系统的正确识别率达到了71.9%,超越了之前在此数据集上报告的最好结果。相似文献

11.

跨库语音情感识别研究进展

张石清刘瑞欣赵小明《计算机系统应用》2022,31(11):31-48

语音情感识别在人机交互过程中发挥极为重要的作用,近年来备受关注.目前,大多数的语音情感识别方法主要在单一情感数据库上进行训练和测试.然而,在实际应用中训练集和测试集可能来自不同的情感数据库.由于这种不同情感数据库的分布存在巨大差异性,导致大多数的语音情感识别方法取得的跨库识别性能不尽人意.为此,近年来不少研究者开始聚焦跨库语音情感识别方法的研究.本文系统性综述了近年来跨库语音情感识别方法的研究现状与进展,尤其对新发展起来的深度学习技术在跨库语音情感识别中的应用进行了重点分析与归纳.首先,介绍了语音情感识别中常用的情感数据库,然后结合深度学习技术,从监督、无监督和半监督学习角度出发,总结和比较了现有基于手工特征和深度特征的跨库语音情感识别方法的研究进展情况,最后对当前跨库语音情感识别领域存在的挑战和机遇进行了讨论与展望. 相似文献

12.

Context-Independent Multilingual Emotion Recognition from Speech Signals 总被引：3，自引：0，他引：3

Vladimir Hozjan Zdravko Kačič 《International Journal of Speech Technology》2003,6(3):311-320

This paper presents and discusses an analysis of multilingual emotion recognition from speech with database-specific emotional features. Recognition was performed on English, Slovenian, Spanish, and French InterFace emotional speech databases. The InterFace databases included several neutral speaking styles and six emotions: disgust, surprise, joy, fear, anger and sadness. Speech features for emotion recognition were determined in two steps. In the first step, low-level features were defined and in the second high-level features were calculated from low-level features. Low-level features are composed from pitch, derivative of pitch, energy, derivative of energy, and duration of speech segments. High-level features are statistical presentations of low-level features. Database-specific emotional features were selected from high-level features that contain the most information about emotions in speech. Speaker-dependent and monolingual emotion recognisers were defined, as well as multilingual recognisers. Emotion recognition was performed using artificial neural networks. The achieved recognition accuracy was highest for speaker-dependent emotion recognition, smaller for monolingual emotion recognition and smallest for multilingual recognition. The database-specific emotional features are most convenient for use in multilingual emotion recognition. Among speaker-dependent, monolingual, and multilingual emotion recognition, the difference between emotion recognition with all high-level features and emotion recognition with database-specific emotional features is smallest for multilingual emotion recognition—3.84%. 相似文献

13.

Multimodal emotion recognition with evolutionary computation for human-robot interaction

《Expert systems with applications》2016

相似文献

14.

Class-specific multiple classifiers scheme to recognize emotions from speech signals

《Computer Speech and Language》2014,28(3):727-742

Automatic emotion recognition from speech signals is one of the important research areas, which adds value to machine intelligence. Pitch, duration, energy and Mel-frequency cepstral coefficients (MFCC) are the widely used features in the field of speech emotion recognition. A single classifier or a combination of classifiers is used to recognize emotions from the input features. The present work investigates the performance of the features of Autoregressive (AR) parameters, which include gain and reflection coefficients, in addition to the traditional linear prediction coefficients (LPC), to recognize emotions from speech signals. The classification performance of the features of AR parameters is studied using discriminant, k-nearest neighbor (KNN), Gaussian mixture model (GMM), back propagation artificial neural network (ANN) and support vector machine (SVM) classifiers and we find that the features of reflection coefficients recognize emotions better than the LPC. To improve the emotion recognition accuracy, we propose a class-specific multiple classifiers scheme, which is designed by multiple parallel classifiers, each of which is optimized to a class. Each classifier for an emotional class is built by a feature identified from a pool of features and a classifier identified from a pool of classifiers that optimize the recognition of the particular emotion. The outputs of the classifiers are combined by a decision level fusion technique. The experimental results show that the proposed scheme improves the emotion recognition accuracy. Further improvement in recognition accuracy is obtained when the scheme is built by including MFCC features in the pool of features. 相似文献

15.

端到端流式语音识别研究综述

下载免费PDF全文

王澳回张珑宋文宇孟杰《计算机工程与应用》2023,59(2):22-33

语音识别是实现人机交互的一种重要途径,是自然语言处理的基础环节,随着人工智能技术的发展,人机交互等大量应用场景存在着流式语音识别的需求。流式语音识别的定义是一边输入语音一边输出结果,它能够大大减少人机交互过程中语音识别的处理时间。目前在学术研究领域,端到端语音识别已经取得了丰硕的研究成果,而流式语音识别在学术研究以及工业应用中还存在着一些挑战与困难,因此,最近两年,端到端流式语音识别逐渐成为语音领域的一个研究热点与重点。从端到端流式识别模型与性能优化等方面对近些年所展开的研究进行全面的调查与分析,具体包括以下内容：（1）详细分析和归纳了端到端流式语音识别的各种方法与模型,包括直接实现流式识别的CTC与RNN-T模型,以及对注意力机制进行改进以实现流式识别的单调注意力机制等方法;（2）介绍了端到端流式语音识别模型提高识别准确率与减少延迟的方法,在提高准确率方面,主要有最小词错率训练、知识蒸馏等方法,在降低延迟方面,主要有对齐、正则化等方法;（3）介绍了流式语音识别一些常用的中英文开源数据集以及流式识别模型的性能评价标准;（4）讨论了端到端流式语音识别模型的未来发展与展望。相似文献

16.

Some performance benchmarks for isolated work speech recognition systems

L. R. Rabiner J. G. Wilpon 《Computer Speech and Language》1987,2(3-4)

The performance of isolated word speech recognition system has steadily improved over time as we learn more about how to represent the significant events in speech, and how to capture these events via appropriate analysis procedures and training algorithms. In particular, algorithms based on both template matching (via dynamic time warping (DTW) procedures) and hidden Markov models (HMMs) have been developed which yield high accuracy on several standard vocabularies, including the 10 digits (zero to nine) and the set of 26 letters of the English alphabet (A-Z). Results are given showing currently attainable performance of a laboratory system for both template-based (DTW) and HMM-based recognizers, operating in both speaker trained and speaker independent modes, on the digits and the alphabet vocabularies using telephone recordings. We show that the average error rates of these systems, on standard vocabularies, are significantly lower than those reported several years back on the exact same databases, thereby reflecting the progress which has been made in all aspects of the speech recognition process. 相似文献

17.

基于发音特征的声效相关鲁棒语音识别算法 总被引：1，自引：0，他引：1

晁浩宋成彭维平《计算机应用》2015,35(1):257-261

针对声效(VE)相关的语音识别鲁棒性问题,提出了基于多模型框架的语音识别算法.首先,分析了不同声效模式下语音信号的声学特性以及声效变化对语音识别精度的影响;然后,提出了基于高斯混合模型(GMM)的声效模式检测方法;最后,根据声效检测的结果,训练专门的声学模型用于耳语音识别,而将发音特征与传统的谱特征一起用于其余4种声效模式的语音识别.基于孤立词识别的实验结果显示,采用所提方法后语音识别准确率有了明显的提高:与基线系统相比,所提方法5种声效的平均字错误率降低了26.69%;与声学模型混合语料训练方法相比,平均字错误率降低了14.51%;与最大似然线性回归(MLLR)自适应方法相比,平均字错误率降低了15.30%.实验结果表明:与传统谱特征相比发音特征对于声效变化更具鲁棒性,而多模型框架是解决声效相关的语音识别鲁棒性问题的有效方法. 相似文献

18.

Algorithms for data-driven ASR parameter quantization

Karim Filali Xiao Li Jeff Bilmes 《Computer Speech and Language》2006,20(4):625-643

There is fast growing research on designing energy-efficient computational devices and applications running on them. As one of the most compelling applications for mobile devices, automatic speech recognition (ASR) requires new methods to allow it to use fewer computational and memory resources while still achieving a high level of accuracy. One way to achieve this is through parameter quantization. In this work, we compare a variety of novel sub-vector clustering procedures for ASR system parameter quantization. Specifically, we look at systematic data-driven sub-vector selection techniques, most of which are based on entropy minimization, and others on recognition accuracy maximization on a development set. We compare performance on two speech databases, phonebook, an isolated word speech recognition task, and timit, a phonetically diverse connected-word speech corpus. While the optimal entropy-minimizing or accuracy-driven quantization methods are intractable, several simple schemes including scalar quantization with separate codebooks per parameter and joint scalar quantization with normalization perform well in their attempt to approximate the optimal clustering. 相似文献

19.

Cross lingual speech emotion recognition via triple attentive asymmetric convolutional neural network

Elias N. N. Ocquaye Qirong Mao Yanfei Xue Heping Song 《国际智能系统杂志》2021,36(1):53-71

The application of cross‐corpus for speech emotion recognition (SER) via domain adaptation methods have gain high acknowledgment for developing good robust emotion recognition systems using different corpora or datasets. However, the issue of cross‐lingual still remains a challenge in SER and needs more attention to resolve the scenario of applying different language types in both training and testing. In this paper, we propose a triple attentive asymmetric convolutional neural network to address the recognition of emotions for cross‐lingual and cross‐corpus speech in an unsupervised approach. The proposed method adopts the joint supervision of softmax loss and center loss to learn high power discriminative feature representations for target domain via the use of high quality pseudo‐labels. The proposed model uses three attentive convolutional neural networks asymmetrically, where two of the networks are used to artificially label unlabeled target samples as a result of their predictions from training on source labeled samples and the other network is used to obtain salient target discriminative features from the pseudo‐labeled target samples. We evaluate our proposed method on three different language types (i.e., English, German, and Italian) data sets. The experimental results indicate that, our proposed method achieves higher prediction accuracy over other state‐of‐the‐art methods. 相似文献

20.

Speaker-independent speech emotion recognition by fusion of functional and accompanying paralanguage features

Qi-rong MAO Xiao-lei ZHAO Zheng-wei HUANG Yong-zhao ZHAN 《浙江大学学报:C卷英文版》2013,14(7):573-582

Functional paralanguage includes considerable emotion information, and it is insensitive to speaker changes. To improve the emotion recognition accuracy under the condition of speaker-independence, a fusion method combining the functional paralanguage features with the accompanying paralanguage features is proposed for the speaker-independent speech emotion recognition. Using this method, the functional paralanguages, such as laughter, cry, and sigh, are used to assist speech emotion recognition. The contributions of our work are threefold. First, one emotional speech database including six kinds of functional paralanguage and six typical emotions were recorded by our research group. Second, the functional paralanguage is put forward to recognize the speech emotions combined with the accompanying paralanguage features. Third, a fusion algorithm based on confidences and probabilities is proposed to combine the functional paralanguage features with the accompanying paralanguage features for speech emotion recognition. We evaluate the usefulness of the functional paralanguage features and the fusion algorithm in terms of precision, recall, and F1-measurement on the emotional speech database recorded by our research group. The overall recognition accuracy achieved for six emotions is over 67% in the speaker-independent condition using the functional paralanguage features. 相似文献