期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An investigation of the impact of MVA normalization on the advanced front-end features

Azzedine?Touazi Email author Mohamed?Debyeche 《International Journal of Speech Technology》2018,21(4):887-893

Feature normalization is a key objective in speech related applications. In this paper, we study the effects of the Mean subtraction, Variance normalization, and Autoregressive Moving Average (ARMA) filtering (MVA) normalization method on the ETSI Advanced Front-End (AFE) features. A series of experiments, on the Aurora-2 task, was conducted to show the impact of MVA normalization for different subsets of AFE feature components. Compared to the AFE baseline system, recognition results show performance improvement when only the logarithmic energy coefficient is normalized. However, the performance is degraded through the normalization of the rest of AFE coefficients. To investigate this degradation, other experiments were performed by eliminating the AFE implemented blind equalization post-processing block. It has shown that one part of this degradation can plausibly be interpreted as the effect of over-normalization caused by the MVA post-processing to the AFE original features. Furthermore, by analyzing the statistical distributions of AFE features we found that the effectiveness of MVA could also be affected by the high intra-frame variability of AFE features. 相似文献

2.

基于双高斯GMM的特征参数规整及其在语音识别中的应用

刘波戴礼荣王仁华杜俊李锦宇《自动化学报》2006,32(4):519-525

对特征参数概率分布的实验分析表明,在有噪声影响的情况下,特征参数通常呈现双峰分布.据此,本文提出了一种新的,基于双高斯的高斯混合模型(Gaussian mixture model,GMM)的特征参数归一化方法,以提高语音识别系统的鲁棒性.该方法采用更为细致的双高斯模型来表达特征参数的累积分布函数(CDF),并依据估计得到的CDF进行参数变换将训练和识别时的特征参数的分布都规整为标准高斯分布,从而提高识别正确率.在Aurora 2和Aurora 3数据库上的实验结果表明,本文提出的方法的性能明显好于传统的倒谱均值规整(Cepstral mean normalization,CMN)和倒谱均值方差规整(Cepstral mean and variance normalization,CMVN)方法,而与非参数化方法-直方图均衡特征规整方法的性能基本相当. 相似文献

3.

Features based on filtering and spectral peaks in autocorrelation domain for robust speech recognition

《Computer Speech and Language》2007,21(1):187-205

In this paper, a set of features derived by filtering and spectral peak extraction in autocorrelation domain are proposed. We focus on the effect of the additive noise on speech recognition. Assuming that the channel characteristics and additive noises are stationary, these new features improve the robustness of speech recognition in noisy conditions. In this approach, initially, the autocorrelation sequence of a speech signal frame is computed. Filtering of the autocorrelation of speech signal is carried out in the second step, and then, the short-time power spectrum of speech is obtained from the speech signal through the fast Fourier transform. The power spectrum peaks are then calculated by differentiating the power spectrum with respect to frequency. The magnitudes of these peaks are then projected onto the mel-scale and pass the filter bank. Finally, a set of cepstral coefficients are derived from the outputs of the filter bank. The effectiveness of the new features for speech recognition in noisy conditions will be shown in this paper through a number of speech recognition experiments.A task of multi-speaker isolated-word recognition and another one of multi-speaker continuous speech recognition with various artificially added noises such as factory, babble, car and F16 were used in these experiments. Also, a set of experiments were carried out on Aurora 2 task. Experimental results show significant improvements under noisy conditions in comparison to the results obtained using traditional feature extraction methods. We have also reported the results obtained by applying cepstral mean normalization on the methods to get robust features against both additive noise and channel distortion. 相似文献

4.

Transforming Binary Uncertainties for Robust Speech Recognition

Srinivasan S. DeLiang Wang 《IEEE transactions on audio, speech, and language processing》2007,15(7):2130-2140

Recently, several algorithms have been proposed to enhance noisy speech by estimating a binary mask that can be used to select those time-frequency regions of a noisy speech signal that contain more speech energy than noise energy. This binary mask encodes the uncertainty associated with enhanced speech in the linear spectral domain. The use of the cepstral transformation smears the information from the noise dominant time-frequency regions across all the cepstral features. We propose a supervised approach using regression trees to learn the nonlinear transformation of the uncertainty from the linear spectral domain to the cepstral domain. This uncertainty is used by a decoder that exploits the variance associated with the enhanced cepstral features to improve robust speech recognition. Systematic evaluations on a subset of the Aurora4 task using the estimated uncertainty show substantial improvement over the baseline performance across various noise conditions. 相似文献

5.

Constructing Modulation Frequency Domain-Based Features for Robust Speech Recognition

Jeih-Weih Hung Wei-Yi Tsai 《IEEE transactions on audio, speech, and language processing》2008,16(3):563-577

Data-driven temporal filtering approaches based on a specific optimization technique have been shown to be capable of enhancing the discrimination and robustness of speech features in speech recognition. The filters in these approaches are often obtained with the statistics of the features in the temporal domain. In this paper, we derive new data-driven temporal filters that employ the statistics of the modulation spectra of the speech features. Three new temporal filtering approaches are proposed and based on constrained versions of linear discriminant analysis (LDA), principal component analysis (PCA), and minimum class distance (MCD), respectively. It is shown that these proposed temporal filters can effectively improve the speech recognition accuracy in various noise-corrupted environments. In experiments conducted on Test Set A of the Aurora-2 noisy digits database, these new temporal filters, together with cepstral mean and variance normalization (CMVN), provide average relative error reduction rates of over 40% and 27% when compared with baseline Mel frequency cepstral coefficient (MFCC) processing and CMVN alone, respectively. 相似文献

6.

弯折滤波器在说话人识别的鲁棒特征提取中的应用

邓蕾高勇《计算机系统应用》2017,26(12):227-232

针对噪声环境中说话人识别性能急剧下降的问题. 提出了一种用于说话人识别的鲁棒特征提取的方法. 采用弯折滤波器组（Warped filter banks,WFBS）来模拟人耳听觉特性,将立方根压缩算法、相对谱滤波技术（RASTA）、倒谱均值方差归一化算法（CMVN）引入到鲁棒特征的提取中. 在高斯混合模型（GMM）下进行仿真,实验结果表明该方法提取的特征参数在鲁棒性和识别性能上均优于MFCC特征参数和CFCC特征参数. 相似文献

7.

Additive background noise as a source of non-linear mismatch in the cepstral and log-energy domain

《Computer Speech and Language》2005,19(1):31-54

The aim of this investigation is to determine to what extent automatic speech recognition may be enhanced if, in addition to the linear compensation accomplished by mean and variance normalisation, a non-linear mismatch reduction technique is applied to the cepstral and energy features, respectively. An additional goal is to determine whether the degree of mismatch between the feature distributions of the training and test data that is associated with acoustic mismatch, differs for the cepstral and energy features. Towards these aims, two non-linear mismatch reduction techniques – time domain noise reduction and histogram normalisation – were evaluated on the Aurora2 digit recognition task as well as on a continuous speech recognition task with noisy test conditions similar to those in the Aurora2 experiments. The experimental results show that recognition performance is enhanced by the application of both non-linear mismatch reduction techniques. The best results are obtained when the two techniques are applied simultaneously. The results also reveal that the mismatch in the energy features is quantitatively and qualitatively much larger than the corresponding mismatch associated with the cepstral coefficients. The most substantial gains in average recognition rate are therefore accomplished by reducing training-test mismatch for the energy features. 相似文献

8.

Higher Order Cepstral Moment Normalization for Improved Robust Speech Recognition

Chang-Wen Hsu Lin-Shan Lee 《IEEE transactions on audio, speech, and language processing》2009,17(2):205-220

Cepstral normalization has widely been used as a powerful approach to produce robust features for speech recognition. Good examples of this approach include cepstral mean subtraction, and cepstral mean and variance normalization, in which either the first or both the first and the second moments of the Mel-frequency cepstral coefficients (MFCCs) are normalized. In this paper, we propose the family of higher order cepstral moment normalization, in which the MFCC parameters are normalized with respect to a few moments of orders higher than 1 or 2. The basic idea is that the higher order moments are more dominated by samples with larger values, which are very likely the primary sources of the asymmetry and abnormal flatness or tail size of the parameter distributions. Normalization with respect to these moments therefore puts more emphasis on these signal components and constrains the distributions to be more symmetric with more reasonable flatness and tail size. The fundamental principles behind this approach are also analyzed and discussed based on the statistical properties of the distributions of the MFCC parameters. Experimental results based on the AURORA 2, AURORA 3, AURORA 4, and Resource Management (RM) testing environments show that with the proposed approach, recognition accuracy can be significantly and consistently improved for all types of noise and all SNR conditions. 相似文献

9.

Recognizing the message and the messenger: biomimetic spectral analysis for robust speech and speaker recognition

Sridhar Krishna Nemala Kailash Patil Mounya Elhilali 《International Journal of Speech Technology》2013,16(3):313-322

Humans are quite adept at communicating in presence of noise. However most speech processing systems, like automatic speech and speaker recognition systems, suffer from a significant drop in performance when speech signals are corrupted with unseen background distortions. The proposed work explores the use of a biologically-motivated multi-resolution spectral analysis for speech representation. This approach focuses on the information-rich spectral attributes of speech and presents an intricate yet computationally-efficient analysis of the speech signal by careful choice of model parameters. Further, the approach takes advantage of an information-theoretic analysis of the message and speaker dominant regions in the speech signal, and defines feature representations to address two diverse tasks such as speech and speaker recognition. The proposed analysis surpasses the standard Mel-Frequency Cepstral Coefficients (MFCC), and its enhanced variants (via mean subtraction, variance normalization and time sequence filtering) and yields significant improvements over a state-of-the-art noise robust feature scheme, on both speech and speaker recognition tasks. 相似文献

10.

基于谱减和LMS的自适应语音增强

姜占才孙燕王得芳《计算机工程与应用》2012,48(7):142-145

针对宽带噪声背景下的语音增强问题,将短时语音视为非平稳或宽平稳信号,基于谱减法和自适应滤波的最小均方（LMS）算法,提出了一种FIR型自适应滤波算法（SSLMS）：用减谱法由短时噪声观测语音估计期望信号,作为滤波器输出信号的参考信号;用滤波器的输出与参考信号的差值为误差信号,用LMS算法求得滤波器权系数修正量,并修正滤波器。权系数最速下降调整中,采用了归一化LMS、符号LMS、块LMS技术,以简化保证权系数收敛的步长选择、减少权系数修正的运算量,从而提高自适应速度。对不同的语音在各种信噪比下仿真实验,并与改进的谱减法比较,结果表明,该法增强效果优于谱减法;在信噪比为3 dB时该法的增强效果仍然令人满意。相似文献

11.

一种改进的特征提取方法在语音识别中的应用

陈树于海波《传感器与微系统》2018,(5):154-157

针对梅尔频率倒谱系数(MFCC)参数在噪声环境中语音识别率下降的问题,提出了一种基于耳蜗倒谱系数(CFCC)的改进的特征参数提取方法.提取具有听觉特性的CFCC特征参数;运用改进的线性判别分析(LDA)算法对提取出的特征参数进行线性变换,得到更具有区分性的特征参数和满足隐马尔可夫模型(HMM)需要的对角化协方差矩阵;进行均值方差归一化,得到最终的特征参数.实验结果表明:提出的方法能有效地提高噪声环境中语音识别系统的识别率和鲁棒性. 相似文献

12.

Automatic speech recognition with an adaptation model motivated by auditory processing

Holmberg M. Gelbart D. Hemmert W. 《IEEE transactions on audio, speech, and language processing》2006,14(1):43-49

The mel-frequency cepstral coefficient (MFCC) or perceptual linear prediction (PLP) feature extraction typically used for automatic speech recognition (ASR) employ several principles which have known counterparts in the cochlea and auditory nerve: frequency decomposition, mel- or bark-warping of the frequency axis, and compression of amplitudes. It seems natural to ask if one can profitably employ a counterpart of the next physiological processing step, synaptic adaptation. We, therefore, incorporated a simplified model of short-term adaptation into MFCC feature extraction. We evaluated the resulting ASR performance on the AURORA 2 and AURORA 3 tasks, in comparison to ordinary MFCCs, MFCCs processed by RASTA, and MFCCs processed by cepstral mean subtraction (CMS), and both in comparison to and in combination with Wiener filtering. The results suggest that our approach offers a simple, causal robustness strategy which is competitive with RASTA, CMS, and Wiener filtering and performs well in combination with Wiener filtering. Compared to the structurally related RASTA, our adaptation model provides superior performance on AURORA 2 and, if Wiener filtering is used prior to both approaches, on AURORA 3 as well. 相似文献

13.

Speech enhancement with an adaptive Wiener filter

Marwa A. Abd El-Fattah Moawad I. Dessouky Alaa M. Abbas Salaheldin M. Diab El-Sayed M. El-Rabaie Waleed Al-Nuaimy Saleh A. Alshebeili Fathi E. Abd El-samie 《International Journal of Speech Technology》2014,17(1):53-64

This paper proposes an adaptive Wiener filtering method for speech enhancement. This method depends on the adaptation of the filter transfer function from sample to sample based on the speech signal statistics; the local mean and the local variance. It is implemented in the time domain rather than in the frequency domain to accommodate for the time-varying nature of the speech signals. The proposed method is compared to the traditional frequency-domain Wiener filtering, spectral subtraction and wavelet denoising methods using different speech quality metrics. The simulation results reveal the superiority of the proposed Wiener filtering method in the case of Additive White Gaussian Noise (AWGN) as well as colored noise. 相似文献

14.

Unsupervised noise reduction scheme for voice-based information retrieval in mobile environments

Jeong-Sik Park Gil-Jin Jang Ji-Hwan Kim Sang-Soo Yeo 《Multimedia Tools and Applications》2016,75(9):4981-4996

This study proposes an unsupervised noise reduction scheme that improves the performance of voice-based information retrieval tasks in mobile environments. Various types of noises could interfere with speech processing tasks, and noise reduction has become an essential technique in this field. In particular, noise reduction needs to be carefully processed in mobile environments based on the speech coding system and the client-server architecture. In this study, we propose an effective noise reduction scheme that employs the adaptive comb filtering technique. A way of directly using several codec parameters during the filtering process is also investigated. In particular, we modify the conventional comb filter using line spectral pair parameters. To verify the efficiency of the proposed noise reduction approach, we conducted speech recognition experiments using the Aurora2 database. Our approach provided superior recognition performance under various noise conditions compared to the conventional techniques. 相似文献

15.

Single-Channel Speech Separation Using Soft Mask Filtering 总被引：2，自引：0，他引：2

Radfar M.H. Dansereau R.M. 《IEEE transactions on audio, speech, and language processing》2007,15(8):2299-2310

We present an approach for separating two speech signals when only one single recording of their linear mixture is available. For this purpose, we derive a filter, which we call the soft mask filter, using minimum mean square error (MMSE) estimation of the log spectral vectors of sources given the mixture's log spectral vectors. The soft mask filter's parameters are estimated using the mean and variance of the underlying sources which are modeled using the Gaussian composite source modeling (CSM) approach. It is also shown that the binary mask filter which has been empirically and extensively used in single-channel speech separation techniques is, in fact, a simplified form of the soft mask filter. The soft mask filtering technique is compared with the binary mask and Wiener filtering approaches when the input consists of male+male, female+female, and male+female mixtures. The experimental results in terms of signal-to-noise ratio (SNR) and segmental SNR show that soft mask filtering outperforms binary mask and Wiener filtering. 相似文献

16.

Environmental robust speech and speaker recognition through multi-channel histogram equalization

Stefano Squartini^{Author Vitae} Emanuele PrincipiAuthor VitaeRudy RotiliAuthor Vitae Francesco PiazzaAuthor Vitae 《Neurocomputing》2012,78(1):111-120

Feature statistics normalization in the cepstral domain is one of the most performing approaches for robust automaticspeech and speaker recognition in noisy acoustic scenarios: feature coefficients are normalized by using suitable linear or nonlinear transformations in order to match the noisy speech statistics to the clean speech one. Histogram equalization (HEQ) belongs to such a category of algorithms and has proved to be effective on purpose and therefore taken here as reference.In this paper the presence of multi-channel acoustic channels is used to enhance the statistics modeling capabilities of the HEQ algorithm, by exploiting the availability of multiple noisy speech occurrences, with the aim of maximizing the effectiveness of the cepstra normalization process. Computer simulations based on the Aurora 2 database in speech and speaker recognition scenarios have shown that a significant recognition improvement with respect to the single-channel counterpart and other multi-channel techniques can be achieved confirming the effectiveness of the idea. The proposed algorithmic configuration has also been combined with the kernel estimation technique in order to further improve the speech recognition performances. 相似文献

17.

语音识别中谱包自相关技术 总被引：1，自引：0，他引：1

徐静波于洪涛冉崇森《数据采集与处理》2004,19(4):421-424

提出了一种语音识别线性预测分析方法：基于谱自相关和频率抽样获得谱包，即由归一化频率估计谱包，此谱包规定在Mel频率级；再由语音信号谱包估计抽样自相关，用IDFT提取抽样自相关估计。从抽样自相关的结果，最终获得谱包倒谱系数。HMM识别试验显示：谱包倒谱系数与其他算法相比较，在低信噪比时，识别率可提高10％以上，识别性能明显提高，在噪声环境下也能达到好的识别效果。相似文献

18.

基于特征参数归一化的鲁棒语音识别方法综述

肖云鹏叶卫平《中文信息学报》2010,24(5):106-117

目前,自动语音识别系统往往会因为环境中复杂因素的影响,造成训练环境和测试环境存在不匹配现象,使得识别系统性能大幅度下降,极大地限制了语音识别技术的应用范围。近年来,很多鲁棒语音识别技术成功地被提出,这些技术的目标都是相同的,主要是提高系统的鲁棒性,进而提高识别率。其中,基于特征的归一化技术简单而有效,常常被作为鲁棒语音识别的首选方法,它主要是通过对特征向量的统计属性、累积密度函数或功率谱的归一化来补偿环境不匹配产生的影响。该文主要对目前主流的归一化方法进行介绍,其中包括倒谱矩归一化方法、直方图均衡化方法以及调频谱归一化方法等。相似文献

19.

ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

K P Bharath M Rajesh Kumar 《Multimedia Tools and Applications》2020,79(39-40):28859-28883

In current scenario, speaker recognition under noisy condition is the major challenging task in the area of speech processing. Due to noise environment there is a significant degradation in the system performance. The major aim of the proposed work is to identify the speaker’s under clean and noise background using limited dataset. In this paper, we proposed a multitaper based Mel frequency cepstral coefficients (MFCC) and power normalization cepstral coefficients (PNCC) techniques with fusion strategies. Here, we used MFCC and PNCC techniques with different multitapers to extract the desired features from the obtained speech samples. Then, cepstral mean and variance normalization (CMVN) and Feature warping (FW) are the two techniques applied to normalize the obtained features from both the techniques. Furthermore, as a system model low dimension i-vector model is used and also different fusion score strategies like mean, maximum, weighted sum, cumulative and concatenated fusion techniques are utilized. Finally extreme learning machine (ELM) is used for classification in order to increase the system identification accuracy (SIA) intern which is having a single layer feedforward neural network with less complexity and time consuming compared to other neural networks. TIMIT and SITW 2016 are the two different databases are used to evaluate the proposed system under limited data of these databases. Both clean and noisy backgrounds conditions are used to check the SIA.

相似文献

20.

Multiple resolution analysis for robust automatic speech recognition

《Computer Speech and Language》2006,20(1):2-21

This paper investigates the potential of exploiting the redundancy implicit in multiple resolution analysis for automatic speech recognition systems. The analysis is performed by a binary tree of elements, each one of which is made by a half-band filter followed by a down sampler which discards odd samples. Filter design and feature computation from samples are discussed and recognition performance with different choices is presented.A paradigm consisting in redundant feature extraction, followed by feature normalization, followed by dimensionality reduction is proposed. Feature normalization is performed by denoising algorithms. Two of them are considered and evaluated, namely, signal-to-noise ratio-dependent spectral subtraction and soft thresholding. Dimensionality reduction is performed with principal component analysis.Experiments using telephone corpora and the Aurora3 corpus are reported. They indicate that the proposed paradigm leads to a recognition performance with clean speech, measured in word error rate, marginally superior to the one obtained with perceptual linear prediction coefficients. Nevertheless, performance of the proposed analysis paradigm is significantly superior when used with noisy data and the same denoising algorithm is applied to all the analysis methods, which are compared. 相似文献