首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper explains a new hybrid method for Automatic Speaker Recognition using speech signals based on the Artificial Neural Network (ANN). ASR performance characteristics is regarded as the foremost challenge and necessitated to be improved. This research work mainly focusses on resolving the ASR problems as well as to improve the accuracy of the prediction of a speaker.. Mel Frequency Cepstral Coefficient (MFCC) is greatly exploited for signal feature extraction.The input samples are created using these extracted features and its dimensions have been reduced using Self Organizing Feature Map (SOFM). Finally, using the reduced input samples, recognition is performed using Multilayer Perceptron (MLP) with Bayesian Regularization.. The training of the network has been accomplished and verified by means of real speech datasets from the Multivariability speaker recognition database for 10 speakers. The proposed method is validated by performance estimation as well as classification accuracies in contradiction to other models.The proposed method gives better recognition rate and 93.33% accuracy is attained.  相似文献   

2.
目前说话人识别系统在理想环境下识别率已可达90%以上,但在实际通信环境下识别率却迅速下降.本文对信道失配环境下的鲁棒说话人识别进行研究.首先建立了一个基于高斯混合模型(GMM)的说话人识别系统,然后通过对实际通信信道的测试和分析,提出了两种改进方法.一是由实测数据建立了一个通用信道模型,将干净语音经通用信道模型滤波后再作为训练语音训练说话人模型;二是通过对比实测信道﹑理想低通信道及语音梅尔倒谱系数(MFCC)的特点,提出合理舍去语音第一﹑二维特征参数的方法.实验结果表明,通过处理后,系统在通信环境下的识别率提升了20%左右,与传统的倒谱均值减(CMS)方法相比,识别率提高了9%-12%.  相似文献   

3.
针对说话人识别易受环境噪声影响的问题,借鉴生物听皮层神经元频谱-时间感受野(STRF)的时空滤波机制,提出一种新的声纹特征提取方法。在该方法中,对基于STRF获得的听觉尺度-速率图进行了二次特征提取,并与传统梅尔倒谱系数(MFCC)进行组合,获得了对环境噪声具有强容忍的声纹特征。采用支持向量机(SVM)作为分类器,对不同信噪比(SNR)语音数据进行测试的结果表明,基于STRF的特征对噪声的鲁棒性普遍高于MFCC系数,但识别正确率较低;组合特征提升了语音识别的正确率,同时对环境噪声具有良好的鲁棒性。该结果说明所提方法在强噪声环境下说话人识别上是有效的。  相似文献   

4.
针对说话人识别易受环境噪声影响的问题,借鉴生物听皮层神经元频谱-时间感受野(STRF)的时空滤波机制,提出一种新的声纹特征提取方法。在该方法中,对基于STRF获得的听觉尺度-速率图进行了二次特征提取,并与传统梅尔倒谱系数(MFCC)进行组合,获得了对环境噪声具有强容忍的声纹特征。采用支持向量机(SVM)作为分类器,对不同信噪比(SNR)语音数据进行测试的结果表明,基于STRF的特征对噪声的鲁棒性普遍高于MFCC系数,但识别正确率较低;组合特征提升了语音识别的正确率,同时对环境噪声具有良好的鲁棒性。该结果说明所提方法在强噪声环境下说话人识别上是有效的。  相似文献   

5.
In this paper, a new and novel Automatic Speaker Recognition (ASR) system is presented. The new ASR system includes novel feature extraction and vector classification steps utilizing distributed Discrete Cosine Transform (DCT-II) based Mel Frequency Cepstral Coefficients (MFCC) and Fuzzy Vector Quantization (FVQ). The ASR algorithm utilizes an approach based on MFCC to identify dynamic features that are used for Speaker Recognition (SR). A series of experiments were performed utilizing three different feature extraction methods: (1) conventional MFCC; (2) Delta-Delta MFCC (DDMFCC); and (3) DCT-II based DDMFCC. The experiments were then expanded to include four classifiers: (1) FVQ; (2) K-means Vector Quantization (VQ); (3) Linde, Buzo and Gray VQ; and (4) Gaussian Mixed Model (GMM). The combination of DCT-II based MFCC, DMFCC and DDMFCC with FVQ was found to have the lowest Equal Error Rate for the VQ based classifiers. The results found were an improvement over previously reported non-GMM methods and approached the results achieved for the computationally expensive GMM based method. Speaker verification tests carried out highlighted the overall performance improvement for the new ASR system. The National Institute of Standards and Technology Speaker Recognition Evaluation corpora was used to provide speaker source data for the experiments.  相似文献   

6.
杜晓青  于凤芹 《计算机工程》2013,(11):197-199,204
Mel频率倒谱系数(MFCC)与线性预测倒谱系数(LPCC)融合算法只能反映语音静态特征,且LPCC对语音低频局部特征描述不足。为此,提出将希尔伯特黄变换(HHT)倒谱系数与相对光谱一感知线性预测倒谱系数(RASTA—PLPCC)融合,得到一种既反映发声机理又体现人耳感知特性的说话人识别算法。HHT倒谱系数体现发声机理,能反映语音动态特性,并更好地描述信号低频局部特征,可改进LPCC的不足。PLPCC体现人耳感知特性,识别性能强于MFCC,用3种融合算法对两者进行融合,将融合特征用于高斯混合模型进行说话人识别。仿真实验结果表明,该融合算法较已有的MFCC与LPCC融合算法识别率提高了8.0%。  相似文献   

7.
Speech coding facilitates speech compression without perceptual loss that results in the elimination or deterioration of both speech and speaker specific features used for a wide range of applications like automatic speaker and speech recognition, biometric authentication, prosody evaluations etc. The present work investigates the effect of speech coding in the quality of features which include Mel Frequency Cepstral Coefficients, Gammatone Frequency Cepstral Coefficients, Power-Normalized Cepstral Coefficients, Perceptual Linear Prediction Cepstral Coefficients, Rasta-Perceptual Linear Prediction Cepstral Coefficients, Residue Cepstrum Coefficients and Linear Predictive Coding-derived cepstral coefficients extracted from codec compressed speech. The codecs selected for this study are G.711, G.729, G.722.2, Enhanced Voice Services, Mixed Excitation Linear Prediction and also three codecs based on compressive sensing frame work. The analysis also includes the variation in the quality of extracted features with various bit-rates supported by Enhanced Voice Services, G.722.2 and compressive sensing codecs. The quality analysis of extracted epochs, fundamental frequency and formants estimated from codec compressed speech was also performed here. In the case of various features extracted from the output of selected codecs, the variation introduced by Mixed Excitation Linear Prediction codec is the least due to its unique method for the representation of excitation. In the case of compressive sensing based codecs, there is a drastic improvement in the quality of extracted features with the augmentation of bit rate due to the waveform type coding used in compressive sensing based codecs. For the most popular Code Excited Linear Prediction codec based on Analysis-by-Synthesis coding paradigm, the impact of Linear Predictive Coding order in feature extraction is investigated. There is an improvement in the quality of extracted features with the order of linear prediction and the optimum performance is obtained for Linear Predictive Coding order between 20 and 30, and this varies with gender and statistical characteristics of speech. Even though the basic motive of a codec is to compress single voice source, the performance of codecs in multi speaker environment is also studied, which is the most common environment in majority of the speech processing applications. Here, the multi speaker environment with two speakers is considered and there is an augmentation in the quality of individual speeches with increase in diversity of mixtures that are passed through codecs. The perceptual quality of individual speeches extracted from the codec compressed speech is almost same for both Mixed Excitation Linear Prediction and Enhanced Voice Services codecs but regarding the preservation of features, the Mixed Excitation Linear Prediction codec has shown a superior performance over Enhanced Voice Services codec.  相似文献   

8.
针对说话人识别系统中存在的有效语音特征提取以及噪声影响的问题,提出了一种新的语音特征提取方法——基于S变换的美尔倒谱系数(SMFCC)。该方法是在传统美尔倒谱系数(MFCC)的基础上利用S变换的二维时频多分辨率特性,以及奇异值分解(SVD)方法的二维时频矩阵有效去噪性,并结合相关统计分析方法最终获得语音特征。采用TIMIT语音数据库,将所提的特征和现有特征进行对比实验。SMFCC特征的等错误率(EER)和最小检测代价(MinDCF)均小于线性预测倒谱系数(LPCC)、MFCC及其结合方法LMFCC,比MFCC的EER和MinDCF08分别下降了3.6%与17.9%。实验结果表明所提方法能够有效去除语音信号中的噪声,提升局部分辨率。  相似文献   

9.
VoIP压缩码流说话人识别研究   总被引:1,自引:0,他引:1  
研究基于微聚类算法的VoIP压缩码流说话人识别算法。给出直接从G.729,G.723.1(6.3Kb/s),G.723.1(5.3Kb/s)压缩语音的码流中提取识别参数,以微聚类算法作为识别结构的说话人识别算法。实验结果表明,对比在压缩码流中使用同样识别参数的GMM模型,微聚类算法在识别正确率和效率上都有很大的提高。  相似文献   

10.
Gender (Male/Female) classification plays a primary vital role to develop a robust Automatic Tamil Speech Recognition (ASR) applications due to the diversity in the vocal tract of speakers. Various features including Formants (F1, F2, F3, F4), Zero Crossings, and Mel-Frequency Cepstral Coefficients (MFCCs) etc. have appeared in the literature especially for speech/signal classification/recognition. Recently Dalal et al. have proposed a feature called as Histogram of Oriented Gradients (HOG) for extracting feature from an image for efficient detection/classification of objects. We extend and apply the HOG for spectrogram of speech signal and hence called as Spectral Histogram of Oriented Gradients (SHOGs). The results of Tamil language male/female speaker classification using SHOGs features shows good improvement in the classification rate when compared to other features. The results of combination of various features with SHOGs are also promissing.  相似文献   

11.
The characterization of a speech signal using non-linear dynamical features has been the focus of intense research lately. In this work, the results obtained with time-dependent largest Lyapunov exponents (TDLEs) in a text-dependent speaker verification task are reported. The baseline system used Gaussian mixture models (GMMs), obtained from the adaptation of a universal background model (UBM), for the speaker voice models. Sixteen cepstral and 16 delta cepstral features were used in the experiments, and it is shown how the addition of TDLEs can improve the system’s accuracy. Cepstral mean subtraction was applied to all features in the tests for channel equalization, and silence frames were discarded. The corpus used, obtained from a subset of the Center for Spoken Language Understanding (CSLU) Speaker Recognition corpus, consisted of telephone speech from 91 different speakers.  相似文献   

12.
Extraction of robust features from noisy speech signals is one of the challenging problems in speaker recognition. As bispectrum and all higher order spectra for Gaussian process are identically zero, it removes the additive white Gaussian noise while preserving the magnitude and phase information of original signal. The spectrum of original signal can be recovered from its noisy version using this property. Robust Mel Frequency Cepstral Coefficients (MFCC) are extracted from the estimated spectral magnitude (denoted as Bispectral-MFCC (BMFCC)). The effectiveness of BMFCC has been tested on TIMIT and SGGS databases in noisy environment. The proposed BMFCC features yield 95.30?%, 97.26?% and 94.22?% speaker recognition rate on TIMIT, SGGS and SGGS2 databases, respectively for 20 dB SNR whereas these values for 0 dB SNR are 45.84?%, 50.79?% and 44.98?%. The experimental results show the superiority of the proposed technique compared to conventional methods for all databases.  相似文献   

13.
基于美国国家科学技术标准局的说话人识别评测任务,提出了快速说话人识别技术框架。在此框架下,低层的声学特征向量首先经过高斯混合建模和非线性映射,转变为高层的高维特征向量(超向量);接着利用区分性分类器支持向量机,对超向量进行分类;最后依据假设检验理论,进行最终判决。相比较传统的说话人识别系统,以超向量作为特征进行分类,比直接采用声学特征更为稳健。实现了一个快速说话人识别系统,并在电话数据集上进行测试,取得了不错的效果。  相似文献   

14.
Robustness is one of the most important topics for automatic speech recognition (ASR) in practical applications. Monaural speech separation based on computational auditory scene analysis (CASA) offers a solution to this problem. In this paper, a novel system is presented to separate the monaural speech of two talkers. Gaussian mixture models (GMMs) and vector quantizers (VQs) are used to learn the grouping cues on isolated clean data for each speaker. Given an utterance, speaker identification is firstly performed to identify the two speakers presented in the utterance, then the factorial-max vector quantization model (MAXVQ) is used to infer the mask signals and finally the utterance of the target speaker is resynthesized in the CASA framework. Recognition results on the 2006 speech separation challenge corpus prove that this proposed system can improve the robustness of ASR significantly.  相似文献   

15.
This paper examines the performance of a Distributed Speech Recognition (DSR) system in the presence of both background noise and packet loss. Recognition performance is examined for feature vectors extracted from speech using a physiologically-based auditory model, as an alternative to the more commonly-used Mel Frequency Cepstral Coefficient (MFCC) front-end. The feature vectors produced by the auditory model are vector quantised and combined in pairs for transmission over a statistically modelled channel that is subject to packet burst loss. In order to improve recognition performance in the presence of noise, the speech is enhanced prior to feature extraction using Wiener filtering. Packet loss mitigation to compensate for missing features is also used to further improve performance. Speech recognition results show the benefit of combining speech enhancement and packet loss mitigation to compensate for channel and environmental degradations.  相似文献   

16.
一种适用于说话人识别的改进Mel滤波器   总被引:1,自引:0,他引:1  
项要杰  杨俊安  李晋徽  陆俊 《计算机工程》2013,(11):214-217,222
Mel倒谱系数(MFcc)侧重提取语音信号的低频信息,对语音信号的频谱分布特性描述不充分,不能有效区分说话人个性信息。为此,通过分析语音信号各频段所含说话人个性信息的不同,结合Mel滤波器和反Mel滤波器在高低频段的不同特性,提出一种适于说话人识别的改进Mel滤波器。实验结果表明,改进Mel滤波器提取的新特征能够获得比传统Mel倒谱系数以及反Mel倒谱系数(IMFCC)更好的识别效果,并且基本不增加说话人识别系统训练和识别的时间开销。  相似文献   

17.
Speaker verification techniques neglect the short-time variation in the feature space even though it contains speaker related attributes. We propose a simple method to capture and characterize this spectral variation through the eigenstructure of the sample covariance matrix. This covariance is computed using sliding window over spectral features. The newly formulated feature vectors representing local spectral variations are used with classical and state-of-the-art speaker recognition systems. Results on multiple speaker recognition evaluation corpora reveal that eigenvectors weighted with their normalized singular values are useful in representing local covariance information. We have also shown that local variability features can be extracted using mel frequency cepstral coefficients (MFCCs) as well as using three recently developed features: frequency domain linear prediction (FDLP), mean Hilbert envelope coefficients (MHECs) and power-normalized cepstral coefficients (PNCCs). Since information conveyed in the proposed feature is complementary to the standard short-term features, we apply different fusion techniques. We observe considerable relative improvements in speaker verification accuracy in combined mode on text-independent (NIST SRE) and text-dependent (RSR2015) speech corpora. We have obtained up to 12.28% relative improvement in speaker recognition accuracy on text-independent corpora. Conversely in experiments on text-dependent corpora, we have achieved up to 40% relative reduction in EER. To sum up, combining local covariance information with the traditional cepstral features holds promise as an additional speaker cue in both text-independent and text-dependent recognition.  相似文献   

18.
针对翻录语音攻击说话人识别系统,危害合法用户的权益问题,提出了一种基于卷积神经网络(CNN)的翻录语音检测算法。首先,通过提取原始语音与翻录语音的语谱图,并将其输入到卷积神经网络中,对其进行特征提取及分类;然后,搭建了适应于检测翻录语音的网络框架,分析讨论了输入不同窗移的语谱图对检测率的影响;最后,对不同偷录及回放设备的翻录语音进行了交叉实验检测,并与现有的经典算法进行了对比。实验结果表明,所提方法能够准确地判断待测语音是否为翻录语音,其识别率达到了99.26%,与静音段梅尔频率倒谱系数(MFCC)算法、信道模式噪声算法和长时窗比例因子算法相比,识别率分别提高了约26个百分点、21个百分点和0.35个百分点。  相似文献   

19.
Investigating Speaker Verification in real-world noisy environments, a novel feature extraction process suitable for suppression of time-varying noise is compared with a fine-tuned spectral subtraction method. The proposed feature extraction process is based on approximating the clean speech and the noise spectral magnitude with a mixture of Gaussian probability density functions (pdfs) by using the Expectation-Maximization algorithm (EM). Subsequently, the Bayesian inference framework is applied to the degraded spectral coefficients, and by employing Minimum Mean Square Error Estimation (MMSE), a closed form solution for the spectral magnitude estimation task is derived. The estimated spectral magnitude finally is incorporated into the Mel-Frequency Cepstral Coefficients (MFCCs) front-end of a baseline text-independent speaker verification system, based on Probabilistic Neural Networks, which participated successfully in the 2002 NIST (National Institute of Standards and Technology of USA) Speaker Recognition Evaluation. A comparative study of the proposed technique for real-world noise types demonstrates a significant performance gain compared to the baseline speech features and to the spectral subtraction enhancement method. Improvements of the absolute speaker verification performance with more than 27% for 0 dB signal-to-noise ratio (SNR), compared to the MFCCs, and with more than 13% for –5 dB SNR, compared to the spectral subtraction version, were obtained in the case of a passing-by aircraft scenario.  相似文献   

20.
Conventional Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) systems generally utilize cepstral features as acoustic observation and phonemes as basic linguistic units. Some of the most powerful features currently used in ASR systems are Mel-Frequency Cepstral Coefficients (MFCCs). Speech recognition is inherently complicated due to the variability in the speech signal which includes within- and across-speaker variability. This leads to several kinds of mismatch between acoustic features and acoustic models and hence degrades the system performance. The sensitivity of MFCCs to speech signal variability motivates many researchers to investigate the use of a new set of speech feature parameters in order to make the acoustic models more robust to this variability and thus improve the system performance. The combination of diverse acoustic feature sets has great potential to enhance the performance of ASR systems. This paper is a part of ongoing research efforts aspiring to build an accurate Arabic ASR system for teaching and learning purposes. It addresses the integration of complementary features into standard HMMs for the purpose to make them more robust and thus improve their recognition accuracies. The complementary features which have been investigated in this work are voiced formants and Pitch in combination with conventional MFCC features. A series of experimentations under various combination strategies were performed to determine which of these integrated features can significantly improve systems performance. The Cambridge HTK tools were used as a development environment of the system and experimental results showed that the error rate was successfully decreased, the achieved results seem very promising, even without using language models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号