首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Automatic recognition of children’s speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children’s speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorithm [Zolfaghari, P., Robinson, T., 1996. Formant analysis using mixtures of Gaussians, Proceedings of International Conference on Spoken Language Processing, 1229–1232]. For limited adaptation data, the algorithm outperforms traditional vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) techniques.  相似文献   

2.

In current scenario, speaker recognition under noisy condition is the major challenging task in the area of speech processing. Due to noise environment there is a significant degradation in the system performance. The major aim of the proposed work is to identify the speaker’s under clean and noise background using limited dataset. In this paper, we proposed a multitaper based Mel frequency cepstral coefficients (MFCC) and power normalization cepstral coefficients (PNCC) techniques with fusion strategies. Here, we used MFCC and PNCC techniques with different multitapers to extract the desired features from the obtained speech samples. Then, cepstral mean and variance normalization (CMVN) and Feature warping (FW) are the two techniques applied to normalize the obtained features from both the techniques. Furthermore, as a system model low dimension i-vector model is used and also different fusion score strategies like mean, maximum, weighted sum, cumulative and concatenated fusion techniques are utilized. Finally extreme learning machine (ELM) is used for classification in order to increase the system identification accuracy (SIA) intern which is having a single layer feedforward neural network with less complexity and time consuming compared to other neural networks. TIMIT and SITW 2016 are the two different databases are used to evaluate the proposed system under limited data of these databases. Both clean and noisy backgrounds conditions are used to check the SIA.

  相似文献   

3.
This paper describes and discusses the "STBU" speaker recognition system, which performed well in the NIST Speaker Recognition Evaluation 2006 (SRE). STBU is a consortium of four partners: Spescom DataVoice (Stellenbosch, South Africa), TNO (Soesterberg, The Netherlands), BUT (Brno, Czech Republic), and the University of Stellenbosch (Stellenbosch, South Africa). The STBU system was a combination of three main kinds of subsystems: 1) GMM, with short-time Mel frequency cepstral coefficient (MFCC) or perceptual linear prediction (PLP) features, 2) Gaussian mixture model-support vector machine (GMM-SVM), using GMM mean supervectors as input to an SVM, and 3) maximum-likelihood linear regression-support vector machine (MLLR-SVM), using MLLR speaker adaptation coefficients derived from an English large vocabulary continuous speech recognition (LVCSR) system. All subsystems made use of supervector subspace channel compensation methods-either eigenchannel adaptation or nuisance attribute projection. We document the design and performance of all subsystems, as well as their fusion and calibration via logistic regression. Finally, we also present a cross-site fusion that was done with several additional systems from other NIST SRE-2006 participants.  相似文献   

4.
In this work, we combine the decisions of two classifiers as an alternative means of improving the performance of a speaker recognition system in adverse environments. The difference between these classifiers is in their feature-sets. One system is based on the popular mel-frequency cepstral coefficients (MFCC) and the other on the new parametric feature-sets (PFS) algorithm. The feature-vectors both have mel-scale spectral warping and are computed in the cepstral domain but the feature-sets differs in the use of spectral filters and compressions. The performance of the classifier is not much different in recognition rates terms but they are complementary. This shows that there is information that is not captured in the popular mel-frequency cepstral coefficients (MFCC), and the parametric feature-sets (PFS) is able to add further information for improved performance. Several ways of combining these classifiers gives significant improvements in a speaker identification task using a very large telephone degraded NTIMIT database.  相似文献   

5.
In this paper, we study the effect of filter bank smoothing on the recognition performance of children's speech. Filter bank smoothing of spectra is done during the computation of the Mel filter bank cepstral coefficients (MFCCs). We study the effect of smoothing both for the case when there is vocal-tract length normalization (VTLN) as well as for the case when there is no VTLN. The results from our experiments indicate that unlike conventional VTLN implementation, it is better not to scale the bandwidths of the filters during VTLN - only the filter center frequencies need be scaled. Our interpretation of the above result is that while the formant center frequencies may approximately scale between speakers, the formant bandwidths do not change significantly. Therefore, the scaling of filter bandwidths by a warp-factor during conventional VTLN results in differences in spectral smoothing leading to degradation in recognition performance. Similarly, results from our experiments indicate that for telephone-based speech when there is no normalization it is better to use uniform-bandwidth filters instead of the constant- like filters that are used in the computation of conventional MFCC. Our interpretation is that with constant- filters there is excessive spectral smoothing at higher frequencies which leads to degradation in performance for children's speech. However, the use of constant- filters during VTLN does not create any additional performance degradation. As we will show, during VTLN it is only important that the filter bandwidths are not scaled irrespective of whether we use constant- or uniform-bandwidth filters. With our proposed changes in the filter bank implementation we get comparable performance for adults and about 6% improvement for children both for the case of using VTLN as well as the for the case of not using VTLN on a telephone-based digit recognition task.  相似文献   

6.
Automatic speech recognition (ASR) systems follow a well established approach of pattern recognition, that is signal processing based feature extraction at front-end and likelihood evaluation of feature vectors at back-end. Mel-frequency cepstral coefficients (MFCCs) are the features widely used in state-of-the-art ASR systems, which are derived by logarithmic spectral energies of the speech signal using Mel-scale filterbank. In filterbank analysis of MFCC there is no consensus for the spacing and number of filters used in various noise conditions and applications. In this paper, we propose a novel approach to use particle swarm optimization (PSO) and genetic algorithm (GA) to optimize the parameters of MFCC filterbank such as the central and side frequencies. The experimental results show that the new front-end outperforms the conventional MFCC technique. All the investigations are conducted using two separate classifiers, HMM and MLP, for Hindi vowels recognition in typical field condition as well as in noisy environment.  相似文献   

7.
基于FMFCC和HMM的说话人识别   总被引:2,自引:0,他引:2  
张永亮  张先庭  鲁宇明 《计算机仿真》2010,27(5):352-354,358
美尔频率倒谱系数(MFCC)是说话人识别中常用的特征参数,而语音信号是非平稳信号,MFCC并不能很好的反映语音的时频特性。针对这一缺陷,为了提高说话人的识别率,结合新的时频分析工具分数傅立叶变换(FRFT)。将MFCC推广到分数形式,得到分数美尔频率倒谱系数(FMFCC),用以表征语音信号的特征;并利用可分性测度验证了特征参数的有效性;通过建立20个不同说话人的FMFCC特征库,采用隐马尔可夫模型(HMM)对说话人进行仿真识别。仿真结果表明,在合适的变换阶次下,说话人的平均识别率可达93%以上。  相似文献   

8.
MFCC特征改进算法在语音识别中的应用   总被引:2,自引:0,他引:2       下载免费PDF全文
本文的目的是阐明一种Mel频率倒谱参数特征的改进算法。该算法是通过线性预测的方法从语音信号中提取出残差相位,同时将残差相位与传统的MFCC相结合,并应用到语音识别系统中。该改进算法比传统的MFCC算法具有更好的识别率。  相似文献   

9.
10.
This paper presents a fuzzy control mechanism for conventional maximum likelihood linear regression (MLLR) speaker adaptation, called FLC-MLLR, by which the effect of MLLR adaptation is regulated according to the availability of adaptation data in such a way that the advantage of MLLR adaptation could be fully exploited when the training data are sufficient, or the consequence of poor MLLR adaptation would be restrained otherwise. The robustness of MLLR adaptation against data scarcity is thus ensured. The proposed mechanism is conceptually simple and computationally inexpensive and effective; the experiments in recognition rate show that FLC-MLLR outperforms standard MLLR especially when encountering data insufficiency and performs better than MAPLR at much less computing cost.  相似文献   

11.
In this paper we introduce a robust feature extractor, dubbed as robust compressive gammachirp filterbank cepstral coefficients (RCGCC), based on an asymmetric and level-dependent compressive gammachirp filterbank and a sigmoid shape weighting rule for the enhancement of speech spectra in the auditory domain. The goal of this work is to improve the robustness of speech recognition systems in additive noise and real-time reverberant environments. As a post processing scheme we employ a short-time feature normalization technique called short-time cepstral mean and scale normalization (STCMSN), which, by adjusting the scale and mean of cepstral features, reduces the difference of cepstra between the training and test environments. For performance evaluation, in the context of speech recognition, of the proposed feature extractor we use the standard noisy AURORA-2 connected digit corpus, the meeting recorder digits (MRDs) subset of the AURORA-5 corpus, and the AURORA-4 LVCSR corpus, which represent additive noise, reverberant acoustic conditions and additive noise as well as different microphone channel conditions, respectively. The ETSI advanced front-end (ETSI-AFE), the recently proposed power normalized cepstral coefficients (PNCC), conventional MFCC and PLP features are used for comparison purposes. Experimental speech recognition results demonstrate that the proposed method is robust against both additive and reverberant environments. The proposed method provides comparable results to that of the ETSI-AFE and PNCC on the AURORA-2 as well as AURORA-4 corpora and provides considerable improvements with respect to the other feature extractors on the AURORA-5 corpus.  相似文献   

12.
Feature extraction is an essential and important step for speaker recognition systems. In this paper, we propose to improve these systems by exploiting both conventional features such as mel frequency cepstral coding (MFCC), linear predictive cepstral coding (LPCC) and non-conventional ones. The method exploits information present in the linear predictive (LP) residual signal. The features extracted from the LP-residue are then combined to the MFCC or the LPCC. We investigate two approaches termed as temporal and frequential representations. The first one consists of an auto-regressive (AR) modelling of the signal followed by a cepstral transformation in a similar way to the LPC-LPCC transformation. In order to take into account the non-linear nature of the speech signals we used two estimation methods based on second and third-order statistics. They are, respectively, termed as R-SOS-LPCC (residual plus second-order statistic based estimation of the AR model plus cepstral transformation) and R-HOS-LPCC (higher order). Concerning the frequential approach, we exploit a filter bank method called the power difference of spectra in sub-band (PDSS) which measures the spectral flatness over the sub-bands. The resulting features are named R-PDSS. The analysis of these proposed schemes are done over a speaker identification problem with two different databases. The first one is the Gaudi database and contains 49 speakers. The main interest lies in the controlled acquisition conditions: mismatch between the microphones and the interval sessions. The second database is the well-known NTIMIT corpus with 630 speakers. The performances of the features are confirmed over this larger corpus. In addition, we propose to compare traditional features and residual ones by the fusion of recognizers (feature extractor + classifier). The results show that residual features carry speaker-dependent features and the combination with the LPCC or the MFCC shows global improvements in terms of robustness under different mismatches. A comparison between the residual features under the opinion fusion framework gives us useful information about the potential of both temporal and frequential representations.  相似文献   

13.
Exploiting the capabilities offered by the plethora of existing wavelets, together with the powerful set of orthonormal bases provided by wavelet packets, we construct a novel wavelet packet-based set of speech features that is optimized for the task of speaker verification. Our approach differs from previous wavelet-based work, primarily in the wavelet-packet tree design that follows the concept of critical bands, as well as in the particular wavelet basis function that has been used. In comparative experiments, we investigate several alternative speech parameterizations with respect to their usefulness for differentiating among human voices. The experimental results confirm that the proposed speech features outperform Mel-Frequency Cepstral Coefficients (MFCC) and previously used wavelet features on the task of speaker verification. A relative reduction of the equal error rate by 15%, 15% and 8% was observed for the proposed speech features, when compared to the wavelet packet features introduced by Farooq and Datta, the MFCC of Slaney, and the subband based cepstral coefficients of Sarikaya et al., respectively.  相似文献   

14.
陈晓  曾昭优 《测控技术》2024,43(6):21-25
为了在低参数量下提高鸟鸣声的识别准确率,提出了一种新的鸟声识别方法,包括鸟声信号特征优化和乌鸦搜索-支持向量机(Support Vector Machine,SVM)分类识别。该方法首先采用主成分分析法对从鸟声中提取的梅尔频率倒谱系数(Mel Frequency Cepstrum Coefficient,MFCC)和翻转梅尔频率倒谱系数进行选择,得到优化后的声音特征参数并将其作为鸟声识别算法的输入;然后利用乌鸦搜索算法对SVM的核参数和损失值进行选优,得到改进的SVM网络用于鸟声分类识别。试验结果表明,该方法对5种鸟声识别的准确率为92.2%,声音特征维数在16时可以得到最好的识别效果。该方法为野外鸟声自动识别提供了一种可行的方式。  相似文献   

15.
In this paper Type-2 Information Set (T2IS) features and Hanman Transform (HT) features as Higher Order Information Set (HOIS) based features are proposed for the text independent speaker recognition. The speech signals of different speakers represented by Mel Frequency Cepstral Coefficients (MFCC) are converted into T2IS features and HT features by taking account of the cepstral and temporal possibilistic uncertainties. The features are classified by Improved Hanman Classifier (IHC), Support Vector Machine (SVM) and k-Nearest Neighbours (kNN). The performance of the proposed approaches is tested in terms of speed, computational complexity, memory requirement and accuracy on three datasets namely NIST-2003, VoxForge 2014 speech corpus and VCTK speech corpus and compared with that of the baseline features like MFCC, ?MFCC, ??MFCC and GFCC under white Gaussian noisy environment at different signal-to-noise ratios. The proposed features have the reduced feature size, computational time, and complexity and also their performance is not degraded under the noisy environment.  相似文献   

16.
寇占奎  徐江峰 《计算机工程与设计》2012,33(9):3323-3326,3341
提出了一个基于语音特征和均值量化的DWT域半脆弱音频数字水印方案.该方案对语音特征进行提取得到梅勒频率倒谱系数(MFCC),对其进行turbo纠错编码,通过均值量化把该编码作为水印嵌入到载体语音中.实验分析结果表明,该方案既对音频的一般操作有较强的鲁棒性,又对篡改音频操作有很强的敏感性,同时水印提取不需要额外的水印信号,准确度高.  相似文献   

17.
This paper explores the significance of stereo-based stochastic feature compensation (SFC) methods for robust speaker verification (SV) in mismatched training and test environments. Gaussian Mixture Model (GMM)-based SFC methods developed in past has been solely restricted for speech recognition tasks. Application of these algorithms in a SV framework for background noise compensation is proposed in this paper. A priori knowledge about the test environment and availability of stereo training data is assumed. During the training phase, Mel frequency cepstral coefficient (MFCC) features extracted from a speaker's noisy and clean speech utterance (stereo data) are used to build front end GMMs. During the evaluation phase, noisy test utterances are transformed on the basis of a minimum mean squared error (MMSE) or maximum likelihood (MLE) estimate, using the target speaker GMMs. Experiments conducted on the NIST-2003-SRE database with clean speech utterances artificially degraded with different types of additive noises reveal that the proposed SV systems strictly outperform baseline SV systems in mismatched conditions across all noisy background environments.  相似文献   

18.
19.
20.
语音MFCC特征计算的改进算法   总被引:1,自引:0,他引:1  
提出了一种计算Mel频倒谱参数(Mel frequency cepstral coefficient,MFCC)特征的改进算法,该算法采用了加权滤波器分析(Wrapped discrete Fourier transform,WDFT)技术来提高语音信号低频部分的频谱分辨率,使之更符合人类听觉系统的特性。同时还运用了加权滤波器分析(Weighted filter bank analysis,WFBA)技术,以提高MFCC的鲁棒性。对TIMIT连续语音数据库中DR1集的音素识别结果表明,本文提出的改进算法比传统MFCC算法具有更好的识别率。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号