期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An experimental framework for Arabic digits speech recognition in noisy environments

Azzedine Touazi Mohamed Debyeche 《International Journal of Speech Technology》2017,20(2):205-224

相似文献

2.

Robust speech recognition in noisy environments based on subband spectral centroid histograms

Gajic B. Paliwal K.K. 《IEEE transactions on audio, speech, and language processing》2006,14(2):600-608

We investigate how dominant-frequency information can be used in speech feature extraction to increase the robustness of automatic speech recognition against additive background noise. First, we review several earlier proposed auditory-based feature extraction methods and argue that the use of dominant-frequency information might be one of the major reasons for their improved noise robustness. Furthermore, we propose a new feature extraction method, which combines subband power information with dominant subband frequency information in a simple and computationally efficient way. The proposed features are shown to be considerably more robust against additive background noise than standard mel-frequency cepstrum coefficients on two different recognition tasks. The performance improvement increased as we moved from a small-vocabulary isolated-word task to a medium-vocabulary continuous-speech task, where the proposed features also outperformed a computationally expensive auditory-based method. The greatest improvement was obtained for noise types characterized by a relatively flat spectral density. 相似文献

3.

Emotion recognition from speech using global and local prosodic features

K. Sreenivasa Rao Shashidhar G. Koolagudi Ramu Reddy Vempada 《International Journal of Speech Technology》2013,16(2):143-160

In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, duration, pitch, and energy values are used to represent the prosodic information, for recognizing the emotions from speech. Global prosodic features represent the gross statistics such as mean, minimum, maximum, standard deviation, and slope of the prosodic contours. Local prosodic features represent the temporal dynamics in the prosody. In this work, global and local prosodic features are analyzed separately and in combination at different levels for the recognition of emotions. In this study, we have also explored the words and syllables at different positions (initial, middle, and final) separately, to analyze their contribution towards the recognition of emotions. In this paper, all the studies are carried out using simulated Telugu emotion speech corpus (IITKGP-SESC). These results are compared with the results of internationally known Berlin emotion speech corpus (Emo-DB). Support vector machines are used to develop the emotion recognition models. The results indicate that, the recognition performance using local prosodic features is better compared to the performance of global prosodic features. Words in the final position of the sentences, syllables in the final position of the words exhibit more emotion discriminative information compared to the words and syllables present in the other positions. 相似文献

4.

Affect-insensitive speaker recognition systems via emotional speech clustering using prosodic features

Dongdong Li Yubo Yuan Zhaohui Wu Yingchun Yang 《Neural computing & applications》2015,26(2):473-484

相似文献

5.

Emotion recognition from speech using source, system, and prosodic features

Shashidhar G. Koolagudi K. Sreenivasa Rao 《International Journal of Speech Technology》2012,15(2):265-289

In this work, source, system, and prosodic features of speech are explored for characterizing and classifying the underlying emotions. Different speech features contribute in different ways to express the emotions, due to their complementary nature. Linear prediction residual samples chosen around glottal closure regions, and glottal pulse parameters are used to represent excitation source information. Linear prediction cepstral coefficients extracted through simple block processing and pitch synchronous analysis represent the vocal tract information. Global and local prosodic features extracted from gross statistics and temporal dynamics of the sequence of duration, pitch, and energy values represent the prosodic information. Emotion recognition models are developed using above mentioned features separately, and in combination. Simulated Telugu emotion database (IITKGP-SESC) is used to evaluate the proposed features. The emotion recognition results obtained using IITKGP-SESC are compared with the results of internationally known Berlin emotion speech database (Emo-DB). Autoassociative neural networks, Gaussian mixture models, and support vector machines are used to develop emotion recognition systems with source, system, and prosodic features, respectively. Weighted combination of evidence has been used while combining the performance of systems developed using different features. From the results, it is observed that, each of the proposed speech features has contributed toward emotion recognition. The combination of features improved the emotion recognition performance, indicating the complementary nature of the features. 相似文献

6.

Robust emotion recognition in noisy speech via sparse representation

Xiaoming Zhao Shiqing Zhang Bicheng Lei 《Neural computing & applications》2014,24(7-8):1539-1553

Emotion recognition in speech signals is currently a very active research topic and has attracted much attention within the engineering application area. This paper presents a new approach of robust emotion recognition in speech signals in noisy environment. By using a weighted sparse representation model based on the maximum likelihood estimation, an enhanced sparse representation classifier is proposed for robust emotion recognition in noisy speech. The effectiveness and robustness of the proposed method is investigated on clean and noisy emotional speech. The proposed method is compared with six typical classifiers, including linear discriminant classifier, K-nearest neighbor, C4.5 decision tree, radial basis function neural networks, support vector machines as well as sparse representation classifier. Experimental results on two publicly available emotional speech databases, that is, the Berlin database and the Polish database, demonstrate the promising performance of the proposed method on the task of robust emotion recognition in noisy speech, outperforming the other used methods. 相似文献

7.

Arabic speech recognition using SPHINX engine

Hussein Hyassat Raed Abu Zitar 《International Journal of Speech Technology》2006,9(3-4):133-150

相似文献

8.

Blind source extraction for robust speech recognition in multisource noisy environments

Francesco Nesta Marco Matassoni 《Computer Speech and Language》2013,27(3):703-725

This paper proposes and describes a complete system for Blind Source Extraction (BSE). The goal is to extract a target signal source in order to recognize spoken commands uttered in reverberant and noisy environments, and acquired by a microphone array. The architecture of the BSE system is based on multiple stages: (a) TDOA estimation, (b) mixing system identification for the target source, (c) on-line semi-blind source separation and (d) source extraction. All the stages are effectively combined, allowing the estimation of the target signal with limited distortion.While a generalization of the BSE framework is described, here the proposed system is evaluated on the data provided for the CHiME Pascal 2011 competition, i.e. binaural recordings made in a real-world domestic environment. The CHiME mixtures are processed with the BSE and the recovered target signal is fed to a recognizer, which uses noise robust features based on Gammatone Frequency Cepstral Coefficients. Moreover, acoustic model adaptation is applied to further reduce the mismatch between training and testing data and improve the overall performance. A detailed comparison between different models and algorithmic settings is reported, showing that the approach is promising and the resulting system gives a significant reduction of the error rate. 相似文献

9.

Toward enhanced Arabic speech recognition using part of speech tagging

Dia?AbuZeina Email author Wasfi?Al-Khatib Moustafa?Elshafei Husni?Al-Muhtaseb 《International Journal of Speech Technology》2011,14(4):419-426

相似文献

10.

Vowel onset point detection for noisy speech using spectral energy at formant frequencies

Anil Kumar Vuppala K. Sreenivasa Rao 《International Journal of Speech Technology》2013,16(2):229-235

In this paper, we propose a method for robust detection of the vowel onset points (VOPs) from noisy speech. The proposed VOP detection method exploits the spectral energy at formant frequencies of the speech segments present in glottal closure region. In this work, formants are extracted by using group delay function, and glottal closure instants are extracted by using zero frequency filter based method. Performance of the proposed VOP detection method is compared with the existing method, which uses the combination of evidence from excitation source, spectral peaks energy and modulation spectrum. Speech data from TIMIT database and noise samples from NOISEX database are used for analyzing the performance of the VOP detection methods. Significant improvement in the performance of VOP detection is observed by using proposed method compared to existing method. 相似文献

11.

Robust vocabulary recognition clustering model using an average estimator least mean square filter in noisy environments

Chan-Shik Ahn Sang-Yeob Oh 《Personal and Ubiquitous Computing》2014,18(6):1295-1301

Noise estimation and detection algorithms must adapt to a changing environment quickly, so they use a least mean square (LMS) filter. However, there is a downside. An LMS filter is very low, and it consequently lowers speech recognition rates. In order to overcome such a weak point, we propose a method to establish a robust speech recognition clustering model for noisy environments. Since this proposed method allows the cancelation of noise with an average estimator least mean square (AELMS) filter in a noisy environment, a robust speech recognition clustering model can be established. With the AELMS filter, which can preserve source features of speech and decrease the degradation of speech information, noise in a contaminated speech signal gets canceled, and a Gaussian state model is clustered as a method to make noise more robust. By composing a Gaussian clustering model, which is a robust speech recognition clustering model, in a noisy environment, recognition performance was evaluated. The study shows that the signal-to-noise ratio of speech, which was improved by canceling environment noise that kept changing, was enhanced by 2.8 dB on average, and recognition rate improved by 4.1 %. 相似文献

12.

Arabic speech synthesis and diacritic recognition

Ilyes Rebai Yassine BenAyed 《International Journal of Speech Technology》2016,19(3):485-494

Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech. 相似文献

13.

Recurrent type-2 fuzzy neural network using Haar wavelet energy and entropy features for speech detection in noisy environments

Chiu-Chuan TuChia-Feng Juang 《Expert systems with applications》2012,39(3):2479-2488

This paper proposes a new method to detect the boundary of speech in noisy environments. This detection method uses Haar wavelet energy and entropy (HWEE) as detection features. The Haar wavelet energy (HWE) is derived by using the robust band that shows the most significant difference between speech and nonspeech segments at different noise levels. Similarly, the wavelet energy entropy (WEE) is computed by selecting the two wavelet energy bands whose entropy shows the most significant speech/nonspeech difference. The HWEE features are fed as inputs to a recurrent self-evolving interval type-2 fuzzy neural network (RSEIT2FNN) for classification. The RSEIT2FNN is used because it uses type-2 fuzzy sets, which are more robust to noise than type-1 fuzzy sets. The recurrent structure in the RSEIT2FNN helps to remember the context information of a test frame. The RSEIT2FNN outputs are compared with a parameter threshold to determine whether it is a speech or nonspeech period. The HWEE-based RSEIT2FNN detection was applied to speech detection in different noisy environments with different noise levels. Comparisons with different detection methods verified the advantage of the proposed method of using HWEE. 相似文献

14.

Robust speech recognition using spatial-temporal feature distribution characteristics

Berlin Chen Wei-Hau ChenShih-Hsiang Lin Wen-Yi Chu 《Pattern recognition letters》2011,32(7):919-926

Histogram equalization (HEQ) is one of the most efficient and effective techniques that have been used to reduce the mismatch between training and test acoustic conditions. However, most of the current HEQ methods are merely performed in a dimension-wise manner and without allowing for the contextual relationships between consecutive speech frames. In this paper, we present several novel HEQ approaches that exploit spatial-temporal feature distribution characteristics for speech feature normalization. The automatic speech recognition (ASR) experiments were carried out on the Aurora-2 standard noise-robust ASR task. The performance of the presented approaches was thoroughly tested and verified by comparisons with the other popular HEQ methods. The experimental results show that for clean-condition training, our approaches yield a significant word error rate reduction over the baseline system, and also give competitive performance relative to the other HEQ methods compared in this paper. 相似文献

15.

Robust formant tracking for continuous speech with speaker variability

Mustafa K. Bruce I.C. 《IEEE transactions on audio, speech, and language processing》2006,14(2):435-444

Several algorithms have been developed for tracking formant frequency trajectories of speech signals, however most of these algorithms are either not robust in real-life noise environments or are not suitable for real-time implementation. The algorithm presented in this paper obtains formant frequency estimates from voiced segments of continuous speech by using a time-varying adaptive filterbank to track individual formant frequencies. The formant tracker incorporates an adaptive voicing detector and a gender detector for formant extraction from continuous speech, for both male and female speakers. The algorithm has a low signal delay and provides smooth and accurate estimates for the first four formant frequencies at moderate and high signal-to-noise ratios. Thorough testing of the algorithm has shown that it is robust over a wide range of signal-to-noise ratios for various types of background noises. 相似文献

16.

Arabic phonemes recognition using hybrid LVQ/HMM model for continuous speech recognition

Khalid M. O. Nahar Mohammed Abu Shquier Wasfi G. Al-Khatib Husni Al-Muhtaseb Moustafa Elshafei 《International Journal of Speech Technology》2016,19(3):495-508

相似文献

17.

Robust speech recognition system using multi-parameter bidirectional Kalman filter

Yeh-Huann Goh Yann-Ling Goh Yoon-Ket Lee Ying-Hao Ko 《International Journal of Speech Technology》2017,20(3):455-463

A speech signal processing system using multi-parameter model bidirectional Kalman filter has been proposed in this paper. Conventional unidirectional Kalman filter usually performs estimation of current state speech signal by processing the time varying autoregressive model of speech signals from the past time states. A bidirectional Kalman filter utilizes the past and future measurements to estimate the current state of a speech signal that minimize the mean of the squared error using efficient recursive means. The matrices involved in the difference equations and the measurement equations of the bidirectional Kalman filter algorithm are kept constant throughout the process. With multi-parameter model, the proposed bidirectional Kalman filter relates more measurements from the future and past time states to the current time state. The proposed multi-parameter bidirectional Kalman filter has been implemented into a speech recognition system and its performance has been compared to other conventional speech processing algorithms. Compared to the single-parameter model bidirectional Kalman filter, the multi-parameter bidirectional Kalman filter improves the accuracy in the state prediction, reduces the speech information lost after the filtering process and better word error rate has been achieved at high SNR regions (clean, 20, 15, 10 dB). 相似文献

18.

Modelling of speech using primarily prosodic parameters

Andrej Ljolje Frank Fallside 《Computer Speech and Language》1987,2(3-4)

Hidden Markov Models are used in an experiment to investigate how state occupancy corresponds to prosodic parameters and spectral balance. In order to define separate sub-classes in the data using a maximum likelihood approach, modelling was performed using a single model where individual states correspond to different categories without assuming the structure of the data, rather than manually segmenting the data and modelling each predefined category separately.The results indicate a significant content of segmental information in the prosodic parameters, but the results based on the time-alignment of the model states with the feature vectors are in a form which is not directly usable in a recognition environment. The classification of various phonetic categories is particularly consistent for vowels and nasals and is generally better for voiced than unvoiced speech. The classification is also robust to influences of segmental effects on the data, with consistent alignments with segments regardless of the type of neighbouring phenemes. 相似文献

19.

Analysis and detection of mimicked speech based on prosodic features

Leena Mary K. K. Anish Babu Aju Joseph 《International Journal of Speech Technology》2012,15(3):407-417

This paper describes a work aimed towards understanding the art of mimicking by professional mimicry artists while imitating the speech characteristics of known persons, and also explores the possibility of detecting a given speech as genuine or impostor. This includes a systematic approach of collecting three categories of speech data, namely original speech of the mimicry artists, speech while mimicking chosen celebrities and original speech of the chosen celebrities, to analyze the variations in prosodic features. A?method is described for the automatic extraction of relevant prosodic features in order to model speaker characteristics. Speech is automatically segmented as intonation phrases using speech/nonspeech classification. Further segmentation is done using valleys in energy contour. Intonation, duration and energy features are extracted for each of these segments. Intonation curve is approximated using Legendre polynomials. Other useful prosodic features include average jitter, average shimmer, total duration, voiced duration and change in energy. These prosodic features extracted from original speech of celebrities and mimicry artists are used for creating speaker models. Support Vector Machine (SVM) is used for creating speaker models, and detection of a given speech as genuine or impostor is attempted using a speaker verification framework of SVM models. 相似文献

20.

Emotion recognition improvement using normalized formant supplementary features by hybrid of DTW-MLP-GMM model

Davood Gharavian Mansour Sheikhan Farhad Ashoftedel 《Neural computing & applications》2013,22(6):1181-1191

In recent four decades, enormous efforts have been focused on developing automatic speech recognition systems to extract linguistic information, but much research is needed to decode the paralinguistic information such as speaking styles and emotion. The effect of using first three normalized formant frequencies and pitch frequency as supplementary features on improving the performance of an emotion recognition system that uses Mel-frequency cepstral coefficients and energy-related features, as the components of feature vector, is investigated in this paper. The normalization is performed using a dynamic time warping-multi-layer perceptron hybrid model after determining the frequency range that is most affected by emotion. To reduce the number of features, fast correlation-based filter and analysis of variations (ANOVA) methods are used in this study. Recognizing of the emotional states is performed using Gaussian mixture model. Experimental results show that first formant (F₁)-based warping and ANOVA-based feature selection result in the best performance as compared to other simulated systems in this study, and the average emotion recognition accuracy is acceptable as compared to most of the recent researches in this field. 相似文献