首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 9 毫秒
1.
Channel distortion is one of the major factors which degrade the performances of automatic speech recognition (ASR) systems. Current compensation methods are generally based on the assumption that the channel distortion is a constant or slowly varying bias in an utterance or globally. However, this assumption is not sustained in a more complex circumstance, when the speech records being recognized are from many different unknown channels and have parts of the spectrum completely removed (e.g. band-limited speech). On the one hand, different channels may cause different distortions; on the other, the distortion caused by a given channel varies over the speech frames when parts of the speech spectrum are removed completely. As a result, the performance of the current methods is limited in complex environments. To solve this problem, we propose a unified framework in which the channel distortion is first divided into two subproblems, namely, spectrum missing and magnitude changing. Next, the two types of distortions are compensated with different techniques in two steps. In the first step, the speech bandwidth is detected for each utterance and the acoustic models are synthesized with clean models to compensate for spectrum missing. In the second step, the constant term of the distortion is estimated via the expectation-maximization (EM) algorithm and subtracted from the means of the synthesized model to further compensate for magnitude changing. Several databases are chosen to evaluate the proposed framework. The speech in these databases is recorded in different channels, including various microphones and band-limited channels. Moreover, to simulate more types of spectrum missing, various low-pass and band-pass filters are used to process the speech from the chosen databases. Although these databases and their filtered versions make the channel conditions more challenging for recognition, experimental results show that the proposed framework can substantially improve the performance of ASR systems in complex channel environments.  相似文献   

2.
一种基于噪声对消与倒谱均值相减的鲁棒语音识别方法   总被引:1,自引:0,他引:1  
提出一种基于语音增强算法的噪声鲁棒语音识别方法.在语音识别预处理阶段,通过噪声对消语音增强法来抑制噪声提高信噪比.然后对增强语音提取Mel频段倒谱特征参数,并在倒谱域应用倒谱均值相减处理来补偿增强语音中的失真成分和剩余噪声.实验结果表明,在低信噪比(-12—0 dB)条件下,该方法对于数字语音识别具有较好的识别率,其性能明显优于基本的Mel频段倒谱参数识别器、传统的谱减法和噪声对消语音增强法.  相似文献   

3.
噪声鲁棒语音识别研究综述*   总被引:3,自引:1,他引:2  
针对噪声环境下的语音识别问题,对现有的噪声鲁棒语音识别技术进行讨论,阐述了噪声鲁棒语音识别研究的主要问题,并根据语音识别系统的构成将噪声鲁棒语音识别技术按照信号空间、特征空间和模型空间进行分类总结,分析了各种鲁棒语音识别技术的特点、实现,以及在语音识别中的应用。最后展望了进一步的研究方向。  相似文献   

4.
This paper investigates the potential of exploiting the redundancy implicit in multiple resolution analysis for automatic speech recognition systems. The analysis is performed by a binary tree of elements, each one of which is made by a half-band filter followed by a down sampler which discards odd samples. Filter design and feature computation from samples are discussed and recognition performance with different choices is presented.A paradigm consisting in redundant feature extraction, followed by feature normalization, followed by dimensionality reduction is proposed. Feature normalization is performed by denoising algorithms. Two of them are considered and evaluated, namely, signal-to-noise ratio-dependent spectral subtraction and soft thresholding. Dimensionality reduction is performed with principal component analysis.Experiments using telephone corpora and the Aurora3 corpus are reported. They indicate that the proposed paradigm leads to a recognition performance with clean speech, measured in word error rate, marginally superior to the one obtained with perceptual linear prediction coefficients. Nevertheless, performance of the proposed analysis paradigm is significantly superior when used with noisy data and the same denoising algorithm is applied to all the analysis methods, which are compared.  相似文献   

5.
Automatic speech recognition (ASR) system suffers from the variation of acoustic quality in an input speech. Speech may be produced in noisy environments and different speakers have their own way of speaking style. Variations can be observed even in the same utterance and the same speaker in different moods. All these uncertainties and variations should be normalized to have a robust ASR system. In this paper, we apply and evaluate different approaches of acoustic quality normalization in an utterance for robust ASR. Several HMM (hidden Markov model)-based systems using utterance-level, word-level, and monophone-level normalization are evaluated with HMM-SM (subspace method)-based system using monophone-level normalization for normalizing variations and uncertainties in an utterance. SM can represent variations of fine structures in sub-words as a set of eigenvectors, and so has better performance at monophone-level than HMM. Experimental results show that word accuracy is significantly improved by the HMM-SM-based system with monophone-level normalization compared to that by the typical HMM-based system with utterance-level normalization in both clean and noisy conditions. Experimental results also suggest that monophone-level normalization using SM has better performance than that using HMM.  相似文献   

6.
This paper presents a combination approach to robust speech recognition by using two-stage model-based feature compensation. Gaussian mixture model (GMM)-based and hidden Markov model (HMM)-based compensation approaches are combined together and conducted sequentially in the multiple-decoding recognition system. The clean speech is firstly modeled as a GMM in the initial pass, and then modeled as a HMM generated from the initial pass in the following passes, respectively. The environment parameter estimation on these two modeling strategies are formulated both under maximum a posteriori (MAP) criterion. Experimental result shows that a significant improvement is achieved compared to European Telecommunications Standards Institute (ETSI) advanced compensation approach, GMM-based feature compensation approach, HMM-based feature compensation approach, and acoustic model compensation approach.  相似文献   

7.
8.
Robustness is one of the most important topics for automatic speech recognition (ASR) in practical applications. Monaural speech separation based on computational auditory scene analysis (CASA) offers a solution to this problem. In this paper, a novel system is presented to separate the monaural speech of two talkers. Gaussian mixture models (GMMs) and vector quantizers (VQs) are used to learn the grouping cues on isolated clean data for each speaker. Given an utterance, speaker identification is firstly performed to identify the two speakers presented in the utterance, then the factorial-max vector quantization model (MAXVQ) is used to infer the mask signals and finally the utterance of the target speaker is resynthesized in the CASA framework. Recognition results on the 2006 speech separation challenge corpus prove that this proposed system can improve the robustness of ASR significantly.  相似文献   

9.
This paper investigates a noise robust technique for automatic speech recognition which exploits hidden Markov modeling of stereo speech features from clean and noisy channels. The HMM trained this way, referred to as stereo HMM, has in each state a Gaussian mixture model (GMM) with a joint distribution of both clean and noisy speech features. Given the noisy speech input, the stereo HMM gives rise to a two-pass compensation and decoding process where MMSE denoising based on N-best hypotheses is first performed and followed by decoding the denoised speech in a reduced search space on lattice. Compared to the feature space GMM-based denoising approaches, the stereo HMM is advantageous as it has finer-grained noise compensation and makes use of information of the whole noisy feature sequence for the prediction of each individual clean feature. Experiments on large vocabulary spontaneous speech from speech-to-speech translation applications show that the proposed technique yields superior performance than its feature space counterpart in noisy conditions while still maintaining decent performance in clean conditions.  相似文献   

10.
This paper proposes a new discrete speech recognition method which investigates the capability of graphical models based on tree distributions that are widely used in many optimization areas. A novel spanning tree structure that utilizes the temporal nature of speech signal is proposed. The proposed tree structure significantly reduces complexity in so far that can reflect simply a few essential relationships rather than all possible structures of trees. The application of this model is illustrated with different isolated word databases. Experimentally it has been shown that, the proposed approaches compared to the conventional discrete hidden Markov model (DHMM) yield reduced error rates of 2.54?%–12?% and improve recognition speed minimum 3-fold. In addition, an impressive gain in learning time is observed. The overall recognition accuracy was 93.09?%–95.34?%, thereby confirming the effectiveness of the proposed methods.  相似文献   

11.
Linear discriminant analysis (LDA) has long been used to derive data-driven temporal filters in order to improve the robustness of speech features used in speech recognition. In this paper, we proposed the use of new optimization criteria of principal component analysis (PCA) and the minimum classification error (MCE) for constructing the temporal filters. Detailed comparative performance analysis for the features obtained using the three optimization criteria, LDA, PCA, and MCE, with various types of noise and a wide range of SNR values is presented. It was found that the new criteria lead to superior performance over the original MFCC features, just as LDA-derived filters can. In addition, the newly proposed MCE-derived filters can often do better than the LDA-derived filters. Also, it is shown that further performance improvements are achievable if any of these LDA/PCA/MCE-derived filters are integrated with the conventional approach of cepstral mean and variance normalization (CMVN). The performance improvements obtained in recognition experiments are further supported by analyses conducted using two different distance measures.  相似文献   

12.
刘金刚  周翊  马永保  刘宏清 《计算机应用》2016,36(12):3369-3373
针对语音识别系统在噪声环境下不能保持很好鲁棒性的问题,提出了一种切换语音功率谱估计算法。该算法假设语音的幅度谱服从Chi分布,提出了一种改进的基于最小均方误差(MMSE)的语音功率谱估计算法。然后,结合语音存在的概率(SPP),推导出改进的基于语音存在概率的MMSE估计器。接下来,将改进的MSME估计器与传统的维纳滤波器结合。在噪声干扰比较大时,使用改进的MMSE估计器来估计纯净语音的功率谱,当噪声干扰较小时,改用传统的维纳滤波器以减少计算量,最终得到用于识别系统的切换语音功率谱估计算法。实验结果表明,所提算法相比传统的瑞利分布下的MMSE估计器在各种噪声的情况下识别率平均提高在8个百分点左右,在去除噪声干扰、提高识别系统鲁棒性的同时,减小了语音识别系统的功耗。  相似文献   

13.
A conventional automatic speech recognizer does not perform well in the presence of multiple sound sources, while human listeners are able to segregate and recognize a signal of interest through auditory scene analysis. We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time–frequency (T–F) mask which retains the mixture in a local T–F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T–F units across time frames. The resulting masks are used in an uncertainty decoding framework for automatic speech recognition. We evaluate our system on a speech separation challenge and show that our system yields substantial improvement over the baseline performance.  相似文献   

14.
This paper proposes and describes a complete system for Blind Source Extraction (BSE). The goal is to extract a target signal source in order to recognize spoken commands uttered in reverberant and noisy environments, and acquired by a microphone array. The architecture of the BSE system is based on multiple stages: (a) TDOA estimation, (b) mixing system identification for the target source, (c) on-line semi-blind source separation and (d) source extraction. All the stages are effectively combined, allowing the estimation of the target signal with limited distortion.While a generalization of the BSE framework is described, here the proposed system is evaluated on the data provided for the CHiME Pascal 2011 competition, i.e. binaural recordings made in a real-world domestic environment. The CHiME mixtures are processed with the BSE and the recovered target signal is fed to a recognizer, which uses noise robust features based on Gammatone Frequency Cepstral Coefficients. Moreover, acoustic model adaptation is applied to further reduce the mismatch between training and testing data and improve the overall performance. A detailed comparison between different models and algorithmic settings is reported, showing that the approach is promising and the resulting system gives a significant reduction of the error rate.  相似文献   

15.
Discriminative classifiers are a popular approach to solving classification problems. However, one of the problems with these approaches, in particular kernel based classifiers such as support vector machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. This paper describes a scheme for overcoming this problem for speech recognition in noise by adapting the kernel rather than the SVM decision boundary. Generative kernels, defined using generative models, are one type of kernel that allows SVMs to handle sequence data. By compensating the parameters of the generative models for each noise condition noise-specific generative kernels can be obtained. These can be used to train a noise-independent SVM on a range of noise conditions, which can then be used with a test-set noise kernel for classification. The noise-specific kernels used in this paper are based on Vector Taylor Series (VTS) model-based compensation. VTS allows all the model parameters to be compensated and the background noise to be estimated in a maximum likelihood fashion. A brief discussion of VTS, and the optimisation of the mismatch function representing the impact of noise on the clean speech, is also included. Experiments using these VTS-based test-set noise kernels were run on the AURORA 2 continuous digit task. The proposed SVM rescoring scheme yields large gains in performance over the VTS compensated models.  相似文献   

16.
Speech signals are produced by the articulatory movements with a certain modulation structure constrained by the regular phonetic sequences. This modulation structure encodes most of the speech intelligibility information that can be used to discriminate the speech from noise. In this study, we proposed a noise reduction algorithm based on this speech modulation property. Two steps are involved in the proposed algorithm: one is the temporal modulation contrast normalization, another is the modulation events preserved smoothing. The purpose for these processing is to normalize the modulation contrast of the clean and noisy speech to be in the same level, and to smooth out the modulation artifacts caused by noise interferences. Since our proposed method can be used independently for noise reduction, it can be combined with the traditional noise reduction methods to further reduce the noise effect. We tested our proposed method as a front-end for robust speech recognition on the AURORA-2J data corpus. Two advanced noise reduction methods, ETSI advanced front-end (AFE) method, and particle filtering (PF) with minimum mean square error (MMSE) estimation method, are used for comparison and combinations. Experimental results showed that, as an independent front-end processor, our proposed method outperforms the advanced methods, and as combined front-ends, further improved the performance consistently than using each method independently.  相似文献   

17.
An analysis-based non-linear feature extraction approach is proposed, inspired by a model of how speech amplitude spectra are affected by additive noise. Acoustic features are extracted based on the noise-robust parts of speech spectra without losing discriminative information. Two non-linear processing methods, harmonic demodulation and spectral peak-to-valley ratio locking, are designed to minimize mismatch between clean and noisy speech features. A previously studied method, peak isolation [IEEE Transactions on Speech and Audio Processing 5 (1997) 451], is also discussed with this model. These methods do not require noise estimation and are effective in dealing with both stationary and non-stationary noise. In the presence of additive noise, ASR experiments show that using these techniques in the computation of MFCCs improves recognition performance greatly. For the TI46 isolated digits database, the average recognition rate across several SNRs is improved from 60% (using unmodified MFCCs) to 95% (using the proposed techniques) with additive speech-shaped noise. For the Aurora 2 connected digit-string database, the average recognition rate across different noise types, including non-stationary noise background, and SNRs improves from 58% to 80%.  相似文献   

18.
The noise robustness of automatic speech recognition systems can be improved by reducing an eventual mismatch between the training and test data distributions during feature extraction. Based on the quantiles of these distributions the parameters of transformation functions can be reliably estimated with small amounts of data. This paper will give a detailed review of quantile equalization applied to the Mel scaled filter bank, including considerations about the application in online systems and improvements through a second transformation step that combines neighboring filter channels. The recognition tests have shown that previous experimental observations on small vocabulary recognition tasks can be confirmed on the larger vocabulary Aurora 4 noisy Wall Street Journal database. The word error rate could be reduced from 45.7% to 25.5% (clean training) and from 19.5% to 17.0% (multicondition training).  相似文献   

19.
Traditionally FFT (fast Fourier transform) has been utilized in recognition algorithms involving speech. Other discrete transforms such as Walsh-Hadamard transform (WHT) and rapid transform (RT) can play equally important roles in the recognition process as they have advantages in implementation and hardware realization. The capability of these transforms in recognizing phonemes based on training matrices and various matching criteria is investigated. The speech data base consists of ten sentences spoken by ten different speakers (all male). For recognition purposes the speech is sectioned into 10 ms intervals and is sampled at 20 KHz. Training matrices for all the three transforms are developed. Test matrices in the transform domain are compared with the prototypes based on these criteria which led to the decision process. WHT and RT appear to offer promise and potential compared to FFT as the former are easier to implement and as the yield recognition results comparable to those of the FFT. Other distance measures and recognition schemes are proposed for improving the classification accuracy.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号