首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
语音是一种重要的信息资源传递与交流方式,人们经常使用语音作为交流信息的媒介,在语音的声学信号中包含大量的说话者信息、语义信息和丰富的情感信息,因此形成了解决语音学任务的3个不同方向,即声纹识别(Speaker Recognition,SR)、语音识别(Auto Speech Recognition,ASR)和情感识别(Speech Emotion Recognition,SER),3个任务均在各自的领域使用不同的技术与特定的方法进行信息提取与模型设计。文中首先综述了3个任务在国内外早期的发展历史路线,将语音任务的发展归纳为4个不同阶段,同时总结了3个语音学任务在特征提取时所采用的公共语音学特征,并针对每类特征的侧重点进行了说明。然后,随着近年来深度学习技术在各个领域中的广泛应用,语音任务也得到了很好的发展,文中针对目前流行的深度学习模型在声学建模中的应用分别进行了分析,按照有监督、无监督的方式总结了针对3种不同语音任务的声学特征提取方式及技术路线,还总结了基于多通道并融合注意力机制的模型,用于语音的特征提取。为了同时完成语音识别、声纹识别和情感识别任务,针对声学信号的个性化特征提出了一个基于多任务的Tandem模型;此外,提出了一个多通道协作网络模型,利用这种设计思路可以提升多任务特征提取的准确度。  相似文献   

2.
语音情感识别研究进展综述   总被引:8,自引:2,他引:6  
对语音情感识别的研究现状和进展进行了归纳和总结,对未来语音情感识别技术发展趋势进行了展望. 从5个角度逐步展开进行归纳总结,即情感描述模型、具有代表性的情感语音库、语音情感特征提取、语音情感识别算法研究和语音情感识别技术应用,旨在尽可能全面地对语音情感识别技术进行细致的介绍与分析,为相关研究人员提供有价值的学术参考;最后,立足于研究现状的分析与把握,对当前语音情感识别领域所面临的挑战与发展趋势进行了展望.侧重于对语音情感识别研究的主流方法和前沿进展进行概括、比较和分析.  相似文献   

3.

Nowadays, automatic speech emotion recognition has numerous applications. One of the important steps of these systems is the feature selection step. Because it is not known which acoustic features of person’s speech are related to speech emotion, much effort has been made to introduce several acoustic features. However, since employing all of these features will lower the learning efficiency of classifiers, it is necessary to select some features. Moreover, when there are several speakers, choosing speaker-independent features is required. For this reason, the present paper attempts to select features which are not only related to the emotion of speech, but are also speaker-independent. For this purpose, the current study proposes a multi-task approach which selects the proper speaker-independent features for each pair of classes. The selected features are then given to the classifier. Finally, the outputs of the classifiers are appropriately combined to achieve an output of a multi-class problem. Simulation results reveal that the proposed approach outperforms other methods and offers higher efficiency in terms of detection accuracy and runtime.

  相似文献   

4.
In automatic speech recognition (ASR) systems, hidden Markov models (HMMs) have been widely used for modeling the temporal speech signal. As discussed in Part I, the conventional acoustic models used for ASR have many drawbacks like weak duration modeling and poor discrimination. This paper (Part II) presents a review on the techniques which have been proposed in literature for the refinements of standard HMM methods to cope with their limitations. Current advancements related to this topic are also outlined. The approaches emphasized in this part of review are connectionist approach, explicit duration modeling, discriminative training and margin based estimation methods. Further, various challenges and performance issues such as environmental variability, tied mixture modeling, and handling of distant speech signals are analyzed along with the directions for future research.  相似文献   

5.

This paper presents a learning mechanism based on hybridization of static and dynamic learning. Realizing the detection performances offered by the state-of-the-art deep learning techniques and the competitive performances offered by the conventional static learning techniques, we propose the idea of exploitation of the concatenated (parallel) hybridization of the static and dynamic learning-based feature spaces. This is contrary to the cascaded (series) hybridization topology in which the initial feature space (provided by the conventional, static, and handcrafted feature extraction technique) is explored using deep, dynamic, and automated learning technique. Consequently, the characteristics already suppressed by the conventional representation cannot be explored by the dynamic learning technique. Instead, the proposed technique combines the conventional static and deep dynamic representation in concatenated (parallel) topology to generate an information-rich hybrid feature space. Thus, this hybrid feature space may aggregate the good characteristics of both conventional and deep representations, which are then explored using an appropriate classification technique. We also hypothesize that ensemble classification may better exploit this parallel hybrid perspective of the feature spaces. For this purpose, pyramid histogram of oriented gradients-based static learning has been incorporated in conjunction with convolution neural network-based deep learning to produce concatenated hybrid feature space. This hybrid space is then explored with various state-of-the-art ensemble classification techniques. We have considered the publicly available INRIA person and Caltech pedestrian standard image datasets to assess the performance of the proposed hybrid learning system. Furthermore, McNemar’s test has been used to statistically validate the outperformance of the proposed technique over various contemporary techniques. The validated experimental results show that the employment of the proposed hybrid representation results in effective detection performance (an AUC of 0.9996 for INRIA person and 0.9985 for Caltech pedestrian datasets) as compared to the individual static and dynamic representations.

  相似文献   

6.
语音情感识别在人机交互过程中发挥极为重要的作用, 近年来备受关注. 目前, 大多数的语音情感识别方法主要在单一情感数据库上进行训练和测试 . 然而, 在实际应用中训练集和测试集可能来自不同的情感数据库. 由于这种不同情感数据库的分布存在巨大差异性, 导致大多数的语音情感识别方法取得的跨库识别性能不尽人意. 为此, 近年来不少研究者开始聚焦跨库语音情感识别方法的研究. 本文系统性综述了近年来跨库语音情感识别方法的研究现状与进展, 尤其对新发展起来的深度学习技术在跨库语音情感识别中的应用进行了重点分析与归纳. 首先, 介绍了语音情感识别中常用的情感数据库, 然后结合深度学习技术, 从监督、无监督和半监督学习角度出发, 总结和比较了现有基于手工特征和深度特征的跨库语音情感识别方法的研究进展情况, 最后对当前跨库语音情感识别领域存在的挑战和机遇进行了讨论与展望.  相似文献   

7.
Machine hearing is an emerging research field that is analogous to machine vision in that it aims to equip computers with the ability to hear and recognise a variety of sounds. It is a key enabler of natural human–computer speech interfacing, as well as in areas such as automated security surveillance, environmental monitoring, smart homes/buildings/cities. Recent advances in machine learning allow current systems to accurately recognise a diverse range of sounds under controlled conditions. However doing so in real-world noisy conditions remains a challenging task. Several front–end feature extraction methods have been used for machine hearing, employing speech recognition features like MFCC and PLP, as well as image-like features such as AIM and SIF. The best choice of feature is found to be dependent upon the noise environment and machine learning techniques used. Machine learning methods such as deep neural networks have been shown capable of inferring discriminative classification rules from less structured front–end features in related domains. In the machine hearing field, spectrogram image features have recently shown good performance for noise-corrupted classification using deep neural networks. However there are many methods of extracting features from spectrograms. This paper explores a novel data-driven feature extraction method that uses variance-based criteria to define spectral pooling of features from spectrograms. The proposed method, based on maximising the pooled spectral variance of foreground and background sound models, is shown to achieve very good performance for robust classification.  相似文献   

8.
为了解决语音情感识别中数据集样本分布不平衡的问题,提出一种结合数据平衡和注意力机制的卷积神经网络(CNN)和长短时记忆单元(LSTM)的语音情感识别方法.该方法首先对语音情感数据集中的语音样本提取对数梅尔频谱图,并根据样本分布特点对进行分段处理,以便实现数据平衡处理,通过在分段的梅尔频谱数据集中微调预训练好的CNN模型,用于学习高层次的片段语音特征.随后,考虑到语音中不同片段区域在情感识别作用的差异性,将学习到的分段CNN特征输入到带有注意力机制的LSTM中,用于学习判别性特征,并结合LSTM和Softmax层从而实现语音情感的分类.在BAUM-1s和CHEAVD2.0数据集中的实验结果表明,本文提出的语音情感识别方法能有效地提高语音情感识别性能.  相似文献   

9.

Emotion recognition from speech signals is an interesting research with several applications like smart healthcare, autonomous voice response systems, assessing situational seriousness by caller affective state analysis in emergency centers, and other smart affective services. In this paper, we present a study of speech emotion recognition based on the features extracted from spectrograms using a deep convolutional neural network (CNN) with rectangular kernels. Typically, CNNs have square shaped kernels and pooling operators at various layers, which are suited for 2D image data. However, in case of spectrograms, the information is encoded in a slightly different manner. Time is represented along the x-axis and y-axis shows frequency of the speech signal, whereas, the amplitude is indicated by the intensity value in the spectrogram at a particular position. To analyze speech through spectrograms, we propose rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods, to extract discriminative features. The proposed scheme effectively learns discriminative features from speech spectrograms and performs better than many state-of-the-art techniques when evaluated its performance on Emo-DB and Korean speech dataset.

  相似文献   

10.
提取能表征语音情感的特征并构建具有较强鲁棒性和泛化性的声学模型是语音情感识别系统的核心。面向语音情感识别构建基于注意力机制的异构并行卷积神经网络模型AHPCL,采用长短时记忆网络提取语音情感的时间序列特征,使用卷积操作提取语音空间谱特征,通过将时间信息和空间信息相结合共同表征语音情感,提高预测结果的准确率。利用注意力机制,根据不同时间序列特征对语音情感的贡献程度分配权重,实现从大量特征信息中选择出更能表征语音情感的时间序列。在CASIA、EMODB、SAVEE等3个语音情感数据库上提取音高、过零率、梅尔频率倒谱系数等低级描述符特征,并计算这些低级描述符特征的高级统计函数共得到219维的特征作为输入进行实验验证。结果表明,AHPCL模型在3个语音情感数据库上分别取得了86.02%、84.03%、64.06%的未加权平均召回率,相比LeNet、DNN-ELM和TSFFCNN基线模型具有更强的鲁棒性和泛化性。  相似文献   

11.
Activity detection and classification using different sensor modalities have emerged as revolutionary technology for real-time and autonomous monitoring in behaviour analysis, ambient assisted living, activity of daily living (ADL), elderly care, rehabilitations, entertainments and surveillance in smart home environments. Wearable devices, smart-phones and ambient environments devices are equipped with variety of sensors such as accelerometers, gyroscopes, magnetometer, heart rate, pressure and wearable camera for activity detection and monitoring. These sensors are pre-processed and different feature sets such as time domain, frequency domain, wavelet transform are extracted and transform using machine learning algorithm for human activity classification and monitoring. Recently, deep learning algorithms for automatic feature representation have also been proposed to lessen the burden of reliance on handcrafted features and to increase performance accuracy. Initially, one set of sensor data, features or classifiers were used for activity recognition applications. However, there are new trends on the implementation of fusion strategies to combine sensors data, features and classifiers to provide diversity, offer higher generalization, and tackle challenging issues. For instances, combination of inertial sensors provide mechanism to differentiate activity of similar patterns and accurate posture identification while other multimodal sensor data are used for energy expenditure estimations, object localizations in smart homes and health status monitoring. Hence, the focus of this review is to provide in-depth and comprehensive analysis of data fusion and multiple classifier systems techniques for human activity recognition with emphasis on mobile and wearable devices. First, data fusion methods and modalities were presented and also feature fusion, including deep learning fusion for human activity recognition were critically analysed, and their applications, strengths and issues were identified. Furthermore, the review presents different multiple classifier system design and fusion methods that were recently proposed in literature. Finally, open research problems that require further research and improvements are identified and discussed.  相似文献   

12.
Automatic recognition of the speech of children is a challenging topic in computer-based speech recognition systems. Conventional feature extraction method namely Mel-frequency cepstral coefficient (MFCC) is not efficient for children's speech recognition. This paper proposes a novel fuzzy-based discriminative feature representation to address the recognition of Malay vowels uttered by children. Considering the age-dependent variational acoustical speech parameters, performance of the automatic speech recognition (ASR) systems degrades in recognition of children's speech. To solve this problem, this study addresses representation of relevant and discriminative features for children's speech recognition. The addressed methods include extraction of MFCC with narrower filter bank followed by a fuzzy-based feature selection method. The proposed feature selection provides relevant, discriminative, and complementary features. For this purpose, conflicting objective functions for measuring the goodness of the features have to be fulfilled. To this end, fuzzy formulation of the problem and fuzzy aggregation of the objectives are used to address uncertainties involved with the problem.The proposed method can diminish the dimensionality without compromising the speech recognition rate. To assess the capability of the proposed method, the study analyzed six Malay vowels from the recording of 360 children, ages 7 to 12. Upon extracting the features, two well-known classification methods, namely, MLP and HMM, were employed for the speech recognition task. Optimal parameter adjustment was performed for each classifier to adapt them for the experiments. The experiments were conducted based on a speaker-independent manner. The proposed method performed better than the conventional MFCC and a number of conventional feature selection methods in the children speech recognition task. The fuzzy-based feature selection allowed the flexible selection of the MFCCs with the best discriminative ability to enhance the difference between the vowel classes.  相似文献   

13.
在发音质量评测研究中,传统仅用发音标准的数据进行声学建模,难以描述实际测试面临的非标准发音,使得训练与测试的失配在所难免。针对上述问题,该文提出一种利用覆盖各种发音的数据,根据最小化机器分与人工分均方误差准则进行声学模型优化的算法。实验在普通话水平考试现场3 685份数据(其中498份测试,3 187份训练)上进行。实验表明采用优化算法得到的针对发音质量的评测声学模型相比传统建模方式得到的声学模型有显著的优势。  相似文献   

14.
The application of cross‐corpus for speech emotion recognition (SER) via domain adaptation methods have gain high acknowledgment for developing good robust emotion recognition systems using different corpora or datasets. However, the issue of cross‐lingual still remains a challenge in SER and needs more attention to resolve the scenario of applying different language types in both training and testing. In this paper, we propose a triple attentive asymmetric convolutional neural network to address the recognition of emotions for cross‐lingual and cross‐corpus speech in an unsupervised approach. The proposed method adopts the joint supervision of softmax loss and center loss to learn high power discriminative feature representations for target domain via the use of high quality pseudo‐labels. The proposed model uses three attentive convolutional neural networks asymmetrically, where two of the networks are used to artificially label unlabeled target samples as a result of their predictions from training on source labeled samples and the other network is used to obtain salient target discriminative features from the pseudo‐labeled target samples. We evaluate our proposed method on three different language types (i.e., English, German, and Italian) data sets. The experimental results indicate that, our proposed method achieves higher prediction accuracy over other state‐of‐the‐art methods.  相似文献   

15.
In automatic speech recognition (ASR) systems, the speech signal is captured and parameterized at front end and evaluated at back end using the statistical framework of hidden Markov model (HMM). The performance of these systems depend critically on both the type of models used and the methods adopted for signal analysis. Researchers have proposed a variety of modifications and extensions for HMM based acoustic models to overcome their limitations. In this review, we summarize most of the research work related to HMM-ASR which has been carried out during the last three decades. We present all these approaches under three categories, namely conventional methods, refinements and advancements of HMM. The review is presented in two parts (papers): (i) An overview of conventional methods for acoustic phonetic modeling, (ii) Refinements and advancements of acoustic models. Part I explores the architecture and working of the standard HMM with its limitations. It also covers different modeling units, language models and decoders. Part II presents a review on the advances and refinements of the conventional HMM techniques along with the current challenges and performance issues related to ASR.  相似文献   

16.
Construction procedural constraints are critical in facilitating effective construction procedure checking in practice and for various inspection systems. Nowadays, the manual extraction of construction procedural constraints is costly and time-consuming. The automatic extraction of construction procedural constraint knowledge (e.g., knowledge entities and interlinks/relationships between them) from regulatory documents is a key challenge. Traditionally, natural language processing is implemented using either rule-based or machine learning approaches. Limited efforts on rule-based extraction of construction regulations often rely on pre-defined vocabularies and involve heavy feature engineering. Based on characteristics of the knowledge expression of construction procedural constraints in Chinese regulations, this paper explores a hybrid deep neural network, combining the bidirectional long short-term memory (Bi-LSTM) and the conditional random field (CRF), for the automatic extraction of the qualitative construction procedural constraints. Based on the proposed deep neural network, the recognition and extraction of named entities and relations between them are realized. Unlike existing information extraction research efforts using rule-based methods, the proposed hybrid deep learning approach can be applied without complex handcrafted features engineering. Besides, the long distance dependency relationships between different entities in regulations are considered. The model implementation results demonstrate the good performance of the end-to-end deep neural network in the extraction of construction procedural constraints. This study can be considered as one of the early explorations of knowledge extraction from construction regulations.  相似文献   

17.
In recent years, the use of Multi-Layer Perceptron (MLP) derived acoustic features has become increasingly popular in automatic speech recognition systems. These features are typically used in combination with standard short-term spectral-based features, and have been found to yield consistent performance improvements. However there are a number of design decisions and issues associated with the use of MLP features for state-of-the-art speech recognition systems. Two modifications to the standard training/adaptation procedures are described in this work. First, the paper examines how MLP features, and the associated acoustic models, can be trained efficiently on large training corpora using discriminative training techniques. An approach that combines multiple individual MLPs is proposed, and this reduces the time needed to train MLPs on large amounts of data. In addition, to further speed up discriminative training, a lattice re-use method is proposed. The paper also examines how systems with MLP features can be adapted to a particular speakers, or acoustic environments. In contrast to previous work (where standard HMM adaptation schemes are used), linear input network adaptation is investigated. System performance is investigated within a multi-pass adaptation/combination framework. This allows the performance gains of individual techniques to be evaluated at various stages, as well as the impact in combination with other sub-systems. All the approaches considered in this paper are evaluated on an Arabic large vocabulary speech recognition task which includes both Broadcast News and Broadcast Conversation test data.  相似文献   

18.
提出基于深层声学特征的端到端单声道语音分离算法,传统声学特征提取方法需要经过傅里叶变换、离散余弦变换等操作,会造成语音能量损失以及长时间延迟.为了改善这些问题,提出了以语音信号的原始波形作为深度神经网络的输入,通过网络模型来学习语音信号的更深层次的声学特征,实现端到端的语音分离.客观评价实验说明,本文提出的分离算法不仅有效地提升了语音分离的性能,也减少了语音分离算法的时间延迟.  相似文献   

19.
Recently, deep learning methodologies have become popular to analyse physiological signals in multiple modalities via hierarchical architectures for human emotion recognition. In most of the state-of-the-arts of human emotion recognition, deep learning for emotion classification was used. However, deep learning is mostly effective for deep feature extraction. Therefore, in this research, we applied unsupervised deep belief network (DBN) for depth level feature extraction from fused observations of Electro-Dermal Activity (EDA), Photoplethysmogram (PPG) and Zygomaticus Electromyography (zEMG) sensors signals. Afterwards, the DBN produced features are combined with statistical features of EDA, PPG and zEMG to prepare a feature-fusion vector. The prepared feature vector is then used to classify five basic emotions namely Happy, Relaxed, Disgust, Sad and Neutral. As the emotion classes are not linearly separable from the feature-fusion vector, the Fine Gaussian Support Vector Machine (FGSVM) is used with radial basis function kernel for non-linear classification of human emotions. Our experiments on a public multimodal physiological signal dataset show that the DBN, and FGSVM based model significantly increases the accuracy of emotion recognition rate as compared to the existing state-of-the-art emotion classification techniques.  相似文献   

20.
This article analyses research in speech emotion recognition (“SER”) from 2006 to 2017 in order to identify the current focus of research, and areas in which research is lacking. The objective is to examine what is being done in this field of research. Searching on selected keywords, we extracted and analysed 260 articles from well-known online databases. The analysis indicates that SER research is an active field of research, dozens of articles being published each year in journals and conference proceedings. The majority of articles concentrate on three critical aspects of SER, namely (1) databases, (2) suitable speech features, and (3) classification techniques to maximize the recognition accuracy of SER systems. Having carried out association analysis of the critical aspects and how they influence the performance of the SER system in term of recognition accuracy, we found that certain combination of databases, speech features and classifiers influence the recognition accuracy of the SER system. We have also suggested aspects of SER that could be taken into consideration in future works based on our review.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号