首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 102 毫秒
1.

In voice recognition, the two main problems are speech recognition (what was said), and speaker recognition (who was speaking). The usual method for speaker recognition is to postulate a model where the speaker identity corresponds to the parameters of the model, which estimation could be time-consuming when the number of candidate speakers is large. In this paper, we model the speaker as a high dimensional point cloud of entropy-based features, extracted from the speech signal. The method allows indexing, and hence it can manage large databases. We experimentally assessed the quality of the identification with a publicly available database formed by extracting audio from a collection of YouTube videos of 1,000 different speakers. With 20 second audio excerpts, we were able to identify a speaker with 97% accuracy when the recording environment is not controlled, and with 99% accuracy for controlled recording environments.

  相似文献   

2.
Speaker localization is a technique to locate and track an active speaker from multiple acoustic sources using microphone array. Microphone array is used to improve the speech quality of recorded speech signal in meeting room and other places. In this work, the time delay estimation between source and each microphone is calculated using a localization method called time differences of arrival (TDOA). TDOA localization consists of two steps namely (a) a time delay estimator and (b) a localization estimator. For time delay estimation, the generalized cross-correlation using phase transform, the generalized cross correlation using maximum likelihood, linear prediction (LP) residual and the Hilbert envelope of the LP residual are chosen for estimating the location of a person. A new speaker localization algorithm known as group search optimization (GSO) algorithm is proposed. The performance of this algorithm is analyzed and compared with Gauss–Newton nonlinear least square method and genetic algorithm. Experimental results show that the proposed GSO method outperforms the other methods in terms of mean square error, root mean square error, mean absolute error, mean absolute percentage error, euclidean distance and mean absolute relative error.  相似文献   

3.
In this study, an expert speaker identification system is presented for speaker identification using Turkish speech signals. Here, a discrete wavelet adaptive network based fuzzy inference system (DWANFIS) model is used for this aim. This model consists of two layers: discrete wavelet and adaptive network based fuzzy inference system. The discrete wavelet layer is used for adaptive feature extraction in the time–frequency domain and is composed of discrete wavelet decomposition and discrete wavelet entropy. The performance of the used system is evaluated by using repeated speech signals. These test results show the effectiveness of the developed intelligent system presented in this paper. The rate of correct classification is about 90.55% for the sample speakers.  相似文献   

4.
在文本无关的说话人辨识中,为了提高系统在电话语音条件下的鲁棒性,提出了将说话人确认中常用的评分规整手段用于说话人辨识中,即对测试语音通过不同话者模型的评分分别进行评分规整,为测试语音选取最接近的话者模型作为系统识别输出,有效地提高了系统性能。在NIST’03 1spk数据库上的说话人辨识实验表明了评分规整技术对说话人辨识的有效性。  相似文献   

5.
We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. A formal framework is developed based on probability of correct decision for analytical comparison of the proposed adaptive rule with other classifier combination rules. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided.  相似文献   

6.
7.
A multi-point conference is an efficient and cost effective substitute for a face to face meeting. It involves three or more participants placed in separate locations, where each participant employs a single microphone and camera. The routing and processing of the audiovisual information is very demanding on the network. This raises a need for reducing the amount of information that flows through the system. One solution is to identify the dominant speaker and partially discard information originating from non-active participants. We propose a novel method for dominant speaker identification using speech activity information from time intervals of different lengths. The proposed method processes the audio signal of each participant independently and computes speech activity scores for the immediate, medium and long time-intervals. These scores are compared and the dominant speaker is identified. In comparison to other speaker selection methods, experimental results demonstrate reduction in the number of false speaker switches and improved robustness to transient audio interferences.  相似文献   

8.
在噪声环境下的分级说话人辨识   总被引:1,自引:0,他引:1  
将小波变换与维纳滤波结合起来对语音进行去噪.为了提高系统的鲁棒性和辨识率,在采用分级说话人辨识的基础上,将基音周期的高斯概率密度对GMM分类器的似然度进行加权,形成新的似然度进行说话人辨识.实验结果显示,所提出系统的鲁棒性和辨识率都有所提高.  相似文献   

9.
This paper presents a new feature extraction technique for speaker recognition using Radon transform (RT) and discrete cosine transform (DCT). The spectrogram is compact, efficient in representation and carries information about acoustic features in the form of pattern. In the proposed method, speaker specific features have been extracted by applying image processing techniques to the pattern available in the spectrogram. Radon transform has been used to derive the effective acoustic features from the speech spectrogram. Radon transform adds up the pixel values in the given image along a straight line in a particular direction and at a specific displacement. The proposed technique computes Radon projections for seven orientations and captures the acoustic characteristics of the spectrogram. DCT applied on Radon projections yields low dimensional feature vector. The technique is computationally efficient, text-independent, robust to session variations and insensitive to additive noise. The performance of the proposed algorithm has been evaluated using the Texas Instruments and Massachusetts Institute of Technology (TIMIT) and our own created Shri Guru Gobind Singhji (SGGS) databases. The recognition rate of the proposed algorithm on TIMIT database (consisting of 630 speakers) is 96.69% and for SGGS database (consisting of 151 speakers) is 98.41%. These results highlight the superiority of the proposed method over some of the existing algorithms.  相似文献   

10.
In classification tasks, the error rate is proportional to the commonality among classes. In conventional GMM-based modeling technique, since the model parameters of a class are estimated without considering other classes in the system, features that are common across various classes may also be captured, along with unique features. This paper proposes to use unique characteristics of a class at the feature-level and at the phoneme-level, separately, to improve the classification accuracy. At the feature-level, the performance of a classifier has been analyzed by capturing the unique features while modeling, and removing common feature vectors during classification. Experiments were conducted on speaker identification task, using speech data of 40 female speakers from NTIMIT corpus, and on a language identification task, using speech data of two languages (English and French) from OGI_MLTS corpus. At the phoneme-level, performance of a classifier has been analyzed by identifying a subset of phonemes, which are unique to a speaker with respect to his/her closely resembling speaker, in the acoustic sense, on a speaker identification task. In both the cases (feature-level and phoneme-level) considerable improvement in classification accuracy is observed over conventional GMM-based classifiers in the above mentioned tasks. Among the three experimental setup, speaker identification task using unique phonemes shows as high as 9.56 % performance improvement over conventional GMM-based classifier.  相似文献   

11.
针对PSO算法容易陷于局部极值的缺点,提出了一种改进的PSO优化算法(IPSO)。该算法根据粒子进化速度对粒子个体极值进行自适应扰动,使粒子及时跳出局部极值点而继续优化,从而扩大粒子搜索范围。改进后的PSO算法加快了收敛速度,能够很好地调整算法的全局与局部搜索能力之间的平衡。同时,给出了应用IPSO算法训练支持向量机的方法,并将其应用于说话人辨识。改进后的PSO可以使SVM用较少的SV取得最优分类面,从而减少SVM的训练量,提高了说话人辨识速度。  相似文献   

12.
In this paper, an intelligent speaker identification system is presented for speaker identification by using speech/voice signal. This study includes both combination of the adaptive feature extraction and classification by using optimum wavelet entropy parameter values. These optimum wavelet entropy values are obtained from measured Turkish speech/voice signal waveforms using speech experimental set. It is developed a genetic wavelet adaptive network based on fuzzy inference system (GWANFIS) model in this study. This model consists of three layers which are genetic algorithm, wavelet and adaptive network based on fuzzy inference system (ANFIS). The genetic algorithm layer is used for selecting of the feature extraction method and obtaining the optimum wavelet entropy parameter values. In this study, one of the eight different feature extraction methods is selected by using genetic algorithm. Alternative feature extraction methods are wavelet decomposition, wavelet decomposition – short time Fourier transform, wavelet decomposition – Born–Jordan time–frequency representation, wavelet decomposition – Choi–Williams time–frequency representation, wavelet decomposition – Margenau–Hill time–frequency representation, wavelet decomposition – Wigner–Ville time–frequency representation, wavelet decomposition – Page time–frequency representation, wavelet decomposition – Zhao–Atlas–Marks time–frequency representation. The wavelet layer is used for optimum feature extraction in the time–frequency domain and is composed of wavelet decomposition and wavelet entropies. The ANFIS approach is used for evaluating to fitness function of the genetic algorithm and for classification speakers. It has been evaluated the performance of the developed system by using noisy Turkish speech/voice signals. The test results showed that this system is effective in detecting real speech signals. The correct classification rate is about 91% for speaker classification.  相似文献   

13.
从线性预测(LP)残差信号中提出了一种新的特征提取方法,这种特征跟单个的说话人的声道密切相关。通过把HAAR小波变换运用于LP 残差而获得了一个新的特征(HOCOR)。为了进一步提高系统的鲁棒性和辨识率,在采用分级说话人辨识的基础上,将基音周期的高斯概率密度对GMM分类器的似然度进行加权,形成新的似然度进行说话人辨识。试验结果显示,所提出系统的鲁棒性和辨识率都有所提高。  相似文献   

14.
In this work, we combine the decisions of two classifiers as an alternative means of improving the performance of a speaker recognition system in adverse environments. The difference between these classifiers is in their feature-sets. One system is based on the popular mel-frequency cepstral coefficients (MFCC) and the other on the new parametric feature-sets (PFS) algorithm. The feature-vectors both have mel-scale spectral warping and are computed in the cepstral domain but the feature-sets differs in the use of spectral filters and compressions. The performance of the classifier is not much different in recognition rates terms but they are complementary. This shows that there is information that is not captured in the popular mel-frequency cepstral coefficients (MFCC), and the parametric feature-sets (PFS) is able to add further information for improved performance. Several ways of combining these classifiers gives significant improvements in a speaker identification task using a very large telephone degraded NTIMIT database.  相似文献   

15.
A novel approach for joint speaker identification and speech recognition is presented in this article. Unsupervised speaker tracking and automatic adaptation of the human-computer interface is achieved by the interaction of speaker identification, speech recognition and speaker adaptation for a limited number of recurring users. Together with a technique for efficient information retrieval a compact modeling of speech and speaker characteristics is presented. Applying speaker specific profiles allows speech recognition to take individual speech characteristics into consideration to achieve higher recognition rates. Speaker profiles are initialized and continuously adapted by a balanced strategy of short-term and long-term speaker adaptation combined with robust speaker identification. Different users can be tracked by the resulting self-learning speech controlled system. Only a very short enrollment of each speaker is required. Subsequent utterances are used for unsupervised adaptation resulting in continuously improved speech recognition rates. Additionally, the detection of unknown speakers is examined under the objective to avoid the requirement to train new speaker profiles explicitly. The speech controlled system presented here is suitable for in-car applications, e.g. speech controlled navigation, hands-free telephony or infotainment systems, on embedded devices. Results are presented for a subset of the SPEECON database. The results validate the benefit of the speaker adaptation scheme and the unified modeling in terms of speaker identification and speech recognition rates.  相似文献   

16.
采用遗传算法的文本无关说话人识别   总被引:1,自引:0,他引:1  
为解决在说话人识别方法的矢量量化(Vector Quantization,VQ)系统中,K-均值法的码本设计很容易陷入局部最优,而且初始码本的选取对最佳码本设计影响很大的问题,将遗传算法(Genetic Algorithm,GA)与基于非参数模型的VQ相结合,得到1种VQ码本设计的GA-K算法.该算法利用GA的全局优化能力得到最优的VQ码本,避免LBG算法极易收敛于局部最优点的问题;通过GA自身参数,结合K-均值法收敛速度快的优点,搜索出训练矢量空间中全局最优的码本.实验结果表明,GA-K算法优于LBG算法,可以很好地协调收敛性和识别率之间的关系.  相似文献   

17.
Feature extraction is an essential and important step for speaker recognition systems. In this paper, we propose to improve these systems by exploiting both conventional features such as mel frequency cepstral coding (MFCC), linear predictive cepstral coding (LPCC) and non-conventional ones. The method exploits information present in the linear predictive (LP) residual signal. The features extracted from the LP-residue are then combined to the MFCC or the LPCC. We investigate two approaches termed as temporal and frequential representations. The first one consists of an auto-regressive (AR) modelling of the signal followed by a cepstral transformation in a similar way to the LPC-LPCC transformation. In order to take into account the non-linear nature of the speech signals we used two estimation methods based on second and third-order statistics. They are, respectively, termed as R-SOS-LPCC (residual plus second-order statistic based estimation of the AR model plus cepstral transformation) and R-HOS-LPCC (higher order). Concerning the frequential approach, we exploit a filter bank method called the power difference of spectra in sub-band (PDSS) which measures the spectral flatness over the sub-bands. The resulting features are named R-PDSS. The analysis of these proposed schemes are done over a speaker identification problem with two different databases. The first one is the Gaudi database and contains 49 speakers. The main interest lies in the controlled acquisition conditions: mismatch between the microphones and the interval sessions. The second database is the well-known NTIMIT corpus with 630 speakers. The performances of the features are confirmed over this larger corpus. In addition, we propose to compare traditional features and residual ones by the fusion of recognizers (feature extractor + classifier). The results show that residual features carry speaker-dependent features and the combination with the LPCC or the MFCC shows global improvements in terms of robustness under different mismatches. A comparison between the residual features under the opinion fusion framework gives us useful information about the potential of both temporal and frequential representations.  相似文献   

18.
传统的说话人识别中,人们往往认为人耳对相位信息不敏感而忽略了相位信息对语音识别的影响。为了验证相位信息对说话人识别的影响,提出了一种提取相位特征参数的方法。分别在纯净语音和带噪语音条件下,基于高斯混合模型,通过将相位特征参数与耳蜗倒谱系数(CFCC)相结合,研究了相位信息对说话人辨识性能的影响。实验结果标明:相位信息在说话人识别中也有着重要的作用,将其应用于说话人辨识系统,可明显提高系统的识别率和鲁棒性。  相似文献   

19.
基于PCA和多约简SVM的多级说话人辨识   总被引:1,自引:1,他引:1  
提出一种基于主成分分析(PCA)和多约简支持向量机(SVM)的多级说话人辨识方法。首先用PCA对注册说话人进行快速粗判决,再用多约简SVM进行最后决策。此多约简SVM有两个约简步骤,即用PCA和样本选择算法分别减少训练数据的维数和个数。理论分析和实验结果表明:该方法可以大大减少系统的存储量和计算量,提高训练和识别时间,并具有较好的鲁棒性。  相似文献   

20.
In the present study, the techniques of wavelet transform (WT) and neural network were developed for speech based text-independent speaker identification. The first five formants in conjunction with the Shannon entropy of wavelet packet (WP) upon level four features extraction method was developed. Thirty-five features were fed to feed-forward backpropagation neural networks (FFPBNN) for classification. The functions of features extraction and classification are performed using the wavelet packet and formants neural networks (WPFNN) expert system. The declared results show that the proposed method can make an effectual analysis with average identification rates reaching 91.09. Two published methods were investigated for comparison. The best recognition rate selection obtained was for WPFNN. Discrete wavelet transform (DWT) was studied to improve the system robustness against the noise of −2 dB.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号