共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
K. Sreenivasa Rao Shashidhar G. Koolagudi Ramu Reddy Vempada 《International Journal of Speech Technology》2013,16(2):143-160
In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, duration, pitch, and energy values are used to represent the prosodic information, for recognizing the emotions from speech. Global prosodic features represent the gross statistics such as mean, minimum, maximum, standard deviation, and slope of the prosodic contours. Local prosodic features represent the temporal dynamics in the prosody. In this work, global and local prosodic features are analyzed separately and in combination at different levels for the recognition of emotions. In this study, we have also explored the words and syllables at different positions (initial, middle, and final) separately, to analyze their contribution towards the recognition of emotions. In this paper, all the studies are carried out using simulated Telugu emotion speech corpus (IITKGP-SESC). These results are compared with the results of internationally known Berlin emotion speech corpus (Emo-DB). Support vector machines are used to develop the emotion recognition models. The results indicate that, the recognition performance using local prosodic features is better compared to the performance of global prosodic features. Words in the final position of the sentences, syllables in the final position of the words exhibit more emotion discriminative information compared to the words and syllables present in the other positions. 相似文献
3.
Gajic B. Paliwal K.K. 《IEEE transactions on audio, speech, and language processing》2006,14(2):600-608
We investigate how dominant-frequency information can be used in speech feature extraction to increase the robustness of automatic speech recognition against additive background noise. First, we review several earlier proposed auditory-based feature extraction methods and argue that the use of dominant-frequency information might be one of the major reasons for their improved noise robustness. Furthermore, we propose a new feature extraction method, which combines subband power information with dominant subband frequency information in a simple and computationally efficient way. The proposed features are shown to be considerably more robust against additive background noise than standard mel-frequency cepstrum coefficients on two different recognition tasks. The performance improvement increased as we moved from a small-vocabulary isolated-word task to a medium-vocabulary continuous-speech task, where the proposed features also outperformed a computationally expensive auditory-based method. The greatest improvement was obtained for noise types characterized by a relatively flat spectral density. 相似文献
4.
5.
Shashidhar G. Koolagudi K. Sreenivasa Rao 《International Journal of Speech Technology》2012,15(2):265-289
In this work, source, system, and prosodic features of speech are explored for characterizing and classifying the underlying
emotions. Different speech features contribute in different ways to express the emotions, due to their complementary nature.
Linear prediction residual samples chosen around glottal closure regions, and glottal pulse parameters are used to represent
excitation source information. Linear prediction cepstral coefficients extracted through simple block processing and pitch
synchronous analysis represent the vocal tract information. Global and local prosodic features extracted from gross statistics
and temporal dynamics of the sequence of duration, pitch, and energy values represent the prosodic information. Emotion recognition
models are developed using above mentioned features separately, and in combination. Simulated Telugu emotion database (IITKGP-SESC)
is used to evaluate the proposed features. The emotion recognition results obtained using IITKGP-SESC are compared with the
results of internationally known Berlin emotion speech database (Emo-DB). Autoassociative neural networks, Gaussian mixture
models, and support vector machines are used to develop emotion recognition systems with source, system, and prosodic features,
respectively. Weighted combination of evidence has been used while combining the performance of systems developed using different
features. From the results, it is observed that, each of the proposed speech features has contributed toward emotion recognition.
The combination of features improved the emotion recognition performance, indicating the complementary nature of the features. 相似文献
6.
针对传统的语音识别系统采用数据驱动并利用语言模型来决策最优的解码路径,导致在部分场景下的解码结果存在明显的音对字错的问题,提出一种基于韵律特征辅助的端到端语音识别方法,利用语音中的韵律信息辅助增强正确汉字组合在语言模型中的概率。在基于注意力机制的编码-解码语音识别框架的基础上,首先利用注意力机制的系数分布提取发音间隔、发音能量等韵律特征;然后将韵律特征与解码端结合,从而显著提升了发音相同或相近、语义歧义情况下的语音识别准确率。实验结果表明,该方法在1 000 h及10 000 h级别的语音识别任务上分别较端到端语音识别基线方法在准确率上相对提升了5.2%和5.0%,进一步改善了语音识别结果的可懂度。 相似文献
7.
Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods. 相似文献
8.
9.
This paper proposes and describes a complete system for Blind Source Extraction (BSE). The goal is to extract a target signal source in order to recognize spoken commands uttered in reverberant and noisy environments, and acquired by a microphone array. The architecture of the BSE system is based on multiple stages: (a) TDOA estimation, (b) mixing system identification for the target source, (c) on-line semi-blind source separation and (d) source extraction. All the stages are effectively combined, allowing the estimation of the target signal with limited distortion.While a generalization of the BSE framework is described, here the proposed system is evaluated on the data provided for the CHiME Pascal 2011 competition, i.e. binaural recordings made in a real-world domestic environment. The CHiME mixtures are processed with the BSE and the recovered target signal is fed to a recognizer, which uses noise robust features based on Gammatone Frequency Cepstral Coefficients. Moreover, acoustic model adaptation is applied to further reduce the mismatch between training and testing data and improve the overall performance. A detailed comparison between different models and algorithmic settings is reported, showing that the approach is promising and the resulting system gives a significant reduction of the error rate. 相似文献
11.
In this paper, we propose a method for robust detection of the vowel onset points (VOPs) from noisy speech. The proposed VOP detection method exploits the spectral energy at formant frequencies of the speech segments present in glottal closure region. In this work, formants are extracted by using group delay function, and glottal closure instants are extracted by using zero frequency filter based method. Performance of the proposed VOP detection method is compared with the existing method, which uses the combination of evidence from excitation source, spectral peaks energy and modulation spectrum. Speech data from TIMIT database and noise samples from NOISEX database are used for analyzing the performance of the VOP detection methods. Significant improvement in the performance of VOP detection is observed by using proposed method compared to existing method. 相似文献
12.
Noise estimation and detection algorithms must adapt to a changing environment quickly, so they use a least mean square (LMS) filter. However, there is a downside. An LMS filter is very low, and it consequently lowers speech recognition rates. In order to overcome such a weak point, we propose a method to establish a robust speech recognition clustering model for noisy environments. Since this proposed method allows the cancelation of noise with an average estimator least mean square (AELMS) filter in a noisy environment, a robust speech recognition clustering model can be established. With the AELMS filter, which can preserve source features of speech and decrease the degradation of speech information, noise in a contaminated speech signal gets canceled, and a Gaussian state model is clustered as a method to make noise more robust. By composing a Gaussian clustering model, which is a robust speech recognition clustering model, in a noisy environment, recognition performance was evaluated. The study shows that the signal-to-noise ratio of speech, which was improved by canceling environment noise that kept changing, was enhanced by 2.8 dB on average, and recognition rate improved by 4.1 %. 相似文献
13.
Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech. 相似文献
14.
Chiu-Chuan TuChia-Feng Juang 《Expert systems with applications》2012,39(3):2479-2488
This paper proposes a new method to detect the boundary of speech in noisy environments. This detection method uses Haar wavelet energy and entropy (HWEE) as detection features. The Haar wavelet energy (HWE) is derived by using the robust band that shows the most significant difference between speech and nonspeech segments at different noise levels. Similarly, the wavelet energy entropy (WEE) is computed by selecting the two wavelet energy bands whose entropy shows the most significant speech/nonspeech difference. The HWEE features are fed as inputs to a recurrent self-evolving interval type-2 fuzzy neural network (RSEIT2FNN) for classification. The RSEIT2FNN is used because it uses type-2 fuzzy sets, which are more robust to noise than type-1 fuzzy sets. The recurrent structure in the RSEIT2FNN helps to remember the context information of a test frame. The RSEIT2FNN outputs are compared with a parameter threshold to determine whether it is a speech or nonspeech period. The HWEE-based RSEIT2FNN detection was applied to speech detection in different noisy environments with different noise levels. Comparisons with different detection methods verified the advantage of the proposed method of using HWEE. 相似文献
15.
16.
Mustafa K. Bruce I.C. 《IEEE transactions on audio, speech, and language processing》2006,14(2):435-444
Several algorithms have been developed for tracking formant frequency trajectories of speech signals, however most of these algorithms are either not robust in real-life noise environments or are not suitable for real-time implementation. The algorithm presented in this paper obtains formant frequency estimates from voiced segments of continuous speech by using a time-varying adaptive filterbank to track individual formant frequencies. The formant tracker incorporates an adaptive voicing detector and a gender detector for formant extraction from continuous speech, for both male and female speakers. The algorithm has a low signal delay and provides smooth and accurate estimates for the first four formant frequencies at moderate and high signal-to-noise ratios. Thorough testing of the algorithm has shown that it is robust over a wide range of signal-to-noise ratios for various types of background noises. 相似文献
17.
Histogram equalization (HEQ) is one of the most efficient and effective techniques that have been used to reduce the mismatch between training and test acoustic conditions. However, most of the current HEQ methods are merely performed in a dimension-wise manner and without allowing for the contextual relationships between consecutive speech frames. In this paper, we present several novel HEQ approaches that exploit spatial-temporal feature distribution characteristics for speech feature normalization. The automatic speech recognition (ASR) experiments were carried out on the Aurora-2 standard noise-robust ASR task. The performance of the presented approaches was thoroughly tested and verified by comparisons with the other popular HEQ methods. The experimental results show that for clean-condition training, our approaches yield a significant word error rate reduction over the baseline system, and also give competitive performance relative to the other HEQ methods compared in this paper. 相似文献
18.
Yeh-Huann Goh Yann-Ling Goh Yoon-Ket Lee Ying-Hao Ko 《International Journal of Speech Technology》2017,20(3):455-463
A speech signal processing system using multi-parameter model bidirectional Kalman filter has been proposed in this paper. Conventional unidirectional Kalman filter usually performs estimation of current state speech signal by processing the time varying autoregressive model of speech signals from the past time states. A bidirectional Kalman filter utilizes the past and future measurements to estimate the current state of a speech signal that minimize the mean of the squared error using efficient recursive means. The matrices involved in the difference equations and the measurement equations of the bidirectional Kalman filter algorithm are kept constant throughout the process. With multi-parameter model, the proposed bidirectional Kalman filter relates more measurements from the future and past time states to the current time state. The proposed multi-parameter bidirectional Kalman filter has been implemented into a speech recognition system and its performance has been compared to other conventional speech processing algorithms. Compared to the single-parameter model bidirectional Kalman filter, the multi-parameter bidirectional Kalman filter improves the accuracy in the state prediction, reduces the speech information lost after the filtering process and better word error rate has been achieved at high SNR regions (clean, 20, 15, 10 dB). 相似文献
19.
Leena Mary K. K. Anish Babu Aju Joseph 《International Journal of Speech Technology》2012,15(3):407-417
This paper describes a work aimed towards understanding the art of mimicking by professional mimicry artists while imitating the speech characteristics of known persons, and also explores the possibility of detecting a given speech as genuine or impostor. This includes a systematic approach of collecting three categories of speech data, namely original speech of the mimicry artists, speech while mimicking chosen celebrities and original speech of the chosen celebrities, to analyze the variations in prosodic features. A?method is described for the automatic extraction of relevant prosodic features in order to model speaker characteristics. Speech is automatically segmented as intonation phrases using speech/nonspeech classification. Further segmentation is done using valleys in energy contour. Intonation, duration and energy features are extracted for each of these segments. Intonation curve is approximated using Legendre polynomials. Other useful prosodic features include average jitter, average shimmer, total duration, voiced duration and change in energy. These prosodic features extracted from original speech of celebrities and mimicry artists are used for creating speaker models. Support Vector Machine (SVM) is used for creating speaker models, and detection of a given speech as genuine or impostor is attempted using a speaker verification framework of SVM models. 相似文献
20.
Hidden Markov Models are used in an experiment to investigate how state occupancy corresponds to prosodic parameters and spectral balance. In order to define separate sub-classes in the data using a maximum likelihood approach, modelling was performed using a single model where individual states correspond to different categories without assuming the structure of the data, rather than manually segmenting the data and modelling each predefined category separately.The results indicate a significant content of segmental information in the prosodic parameters, but the results based on the time-alignment of the model states with the feature vectors are in a form which is not directly usable in a recognition environment. The classification of various phonetic categories is particularly consistent for vowels and nasals and is generally better for voiced than unvoiced speech. The classification is also robust to influences of segmental effects on the data, with consistent alignments with segments regardless of the type of neighbouring phenemes. 相似文献