首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 171 毫秒
1.
The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively.  相似文献   

2.

In voice recognition, the two main problems are speech recognition (what was said), and speaker recognition (who was speaking). The usual method for speaker recognition is to postulate a model where the speaker identity corresponds to the parameters of the model, which estimation could be time-consuming when the number of candidate speakers is large. In this paper, we model the speaker as a high dimensional point cloud of entropy-based features, extracted from the speech signal. The method allows indexing, and hence it can manage large databases. We experimentally assessed the quality of the identification with a publicly available database formed by extracting audio from a collection of YouTube videos of 1,000 different speakers. With 20 second audio excerpts, we were able to identify a speaker with 97% accuracy when the recording environment is not controlled, and with 99% accuracy for controlled recording environments.

  相似文献   

3.
为了提高说话人索引准确率,对说话人改变判决中常用的贝叶斯信息判决(BIC)进行改进和在说话人辨认中使用性别信息,提出了一种基于性别的说话人索引算法。首先使用惩罚距离公式对说话人改变进行检测,解决了在说话人改变判决中使用BIC需要不断调节惩罚因子的问题;其次在说话人改变检测的基础上,采用性别模型判断每个说话人的性别;最后把男性和女性说话人分别对待,使用说话人模型自举法对说话人进行辨认。实验结果表明:在说话人改变检测中,采用惩罚距离公式,和BIC相比不需要调整参数,和DISTBIC相比,在F1方面提高了2%;在说话人辨认方面,利用性别信息,说话人索引准确率(SIA)提高了20.93%,说话人数量准确率(SNA)方面提高了3%。  相似文献   

4.

Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on text-dependent speaker recognition. In contrast, text-independent speaker recognition is more advantageous as the client can talk freely to the system. In this paper, text-independent speaker recognition is considered in the presence of some degradation effects such as noise and reverberation. Mel-Frequency Cepstral Coefficients (MFCCs), spectrum and log-spectrum are used for feature extraction from the speech signals. These features are processed with the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) as a classification tool to complete the speaker recognition task. The network learns to recognize the speakers efficiently in a text-independent manner, when the recording circumstances are the same. The recognition rate reaches 95.33% using MFCCs, while it is increased to 98.7% when using spectrum or log-spectrum. However, the system has some challenges to recognize speakers from different recording environments. Hence, different speech enhancement techniques, such as spectral subtraction and wavelet denoising, are used to improve the recognition performance to some extent. The proposed approach shows superiority, when compared to the algorithm of R. Togneri and D. Pullella (2011).

  相似文献   

5.
多子系统似然度评分融合说话人识别   总被引:1,自引:0,他引:1  
李恒杰 《计算机应用》2008,28(1):116-119,119
针对短电话语音条件下文本无关说话人识别问题中语音数据不充分和电话信道失配问题,提出了一种基于话者聚类的多子系统输出似然度评分融合策略。采用KLD和GLR测度下的模型相似度聚类方法对目标话者聚类,并在每个话者类内构建由MFCC、LPCC和SSFE三个不同类型特征参数子系统组成的输出似然度评分融合系统,通过不同参数子系统的互补,即MFCC和LPCC参数的识别准确性结合SSFE的良好鲁棒性,以及不同话者类采用不同的输出似然度评分融合网络,提高了系统的整体性能。使用NIST SRE 05数据作为评估数据,实验结果表明,与传统的不分类多系统输出似然度评分融合相比,采用KLD和GLR测度的话者聚类融合策略使系统等误识率分别下降了10.3%和8.7%。  相似文献   

6.
武妍  金明曦  王守觉 《计算机工程》2006,32(12):184-186
根据一种全新的仿生模式识别理论,提出了采用神经网络实现并完成说话人识别的新方法。该方法利用高阶神经网络形成的复杂包络在特征空间中构造不同说话人的覆盖区域达到识别目的。相关实验证明,这种新的说话人识别方法只要通过少量样本的训练即可达到比传统方法更高的识别率。  相似文献   

7.
This paper proposes a new method for speaker feature extraction based on Formants, Wavelet Entropy and Neural Networks denoted as FWENN. In the first stage, five formants and seven Shannon entropy wavelet packet are extracted from the speakers’ signals as the speaker feature vector. In the second stage, these 12 feature extraction coefficients are used as inputs to feed-forward neural networks. Probabilistic neural network is also proposed for comparison. In contrast to conventional speaker recognition methods that extract features from sentences (or words), the proposed method extracts the features from vowels. Advantages of using vowels include the ability to recognize speakers when only partially-recorded words are available. This may be useful for deaf-mute persons or when the recordings are damaged. Experimental results show that the proposed method succeeds in the speaker verification and identification tasks with high classification rate. This is accomplished with minimum amount of information, using only 12 coefficient features (i.e. vector length) and only one vowel signal, which is the major contribution of this work. The results are further compared to well-known classical algorithms for speaker recognition and are found to be superior.  相似文献   

8.
在连续语音识别系统中,针对复杂环境(包括说话人及环境噪声的多变性)造成训练数据与测试数据不匹配导致语音识别率低下的问题,提出一种基于自适应深度神经网络的语音识别算法。结合改进正则化自适应准则及特征空间的自适应深度神经网络提高数据匹配度;采用融合说话人身份向量i-vector及噪声感知训练克服说话人及环境噪声变化导致的问题,并改进传统深度神经网络输出层的分类函数,以保证类内紧凑、类间分离的特性。通过在TIMIT英文语音数据集和微软中文语音数据集上叠加多种背景噪声进行测试,实验结果表明,相较于目前流行的GMM-HMM和传统DNN语音声学模型,所提算法的识别词错误率分别下降了5.151%和3.113%,在一定程度上提升了模型的泛化性能和鲁棒性。  相似文献   

9.
Model-based approach is one of methods widely used for speaker identification, where a statistical model is used to characterize a specific speaker's voice but no interspeaker information is involved in its parameter estimation. It is observed that interspeaker information is very helpful in discriminating between different speakers. In this paper, we propose a novel method for the use of interspeaker information to improve performance of a model-based speaker identification system. A neural network is employed to capture the interspeaker information from the output space of those statistical models. In order to sufficiently utilize interspeaker information, a rival penalized encoding rule is proposed to design supervised learning pairs. For better generalization, moreover, a query-based learning algorithm is presented to actively select the input data of interest during training of the neural network. Comparative results on the KING speech corpus show that our method leads to a considerable improvement for a model-based speaker identification system.  相似文献   

10.
This paper presents the study of speaker identification for security systems based on the energy of speaker utterances. The proposed system consisted of a combination of signal pre-process, feature extraction using wavelet packet transform (WPT) and speaker identification using artificial neural network. In the signal pre-process, the amplitude of utterances, for a same sentence, were normalized for preventing an error estimation caused by speakers’ change in volume. In the feature extraction, three conventional methods were considered in the experiments and compared with the irregular decomposition method in the proposed system. In order to verify the effect of the proposed system for identification, a general regressive neural network (GRNN) was used and compared in the experimental investigation. The experimental results demonstrated the effectiveness of the proposed speaker identification system and were compared with the discrete wavelet transform (DWT), conventional WPT and WPT in Mel scale.  相似文献   

11.
12.
基于遗传径向基神经网络的声音转换   总被引:4,自引:1,他引:4  
声音转换技术可以将一个人的语音模式转换为与其特性不同的另一个人语音模式,使转换语音保持源说话人原有语音信息内容不变,而具有目标说话人的声音特点。本文研究了由遗传算法训练的RBF神经网络捕获说话人的语音频谱包络映射关系,以实现不同说话人之间声音特性的转换。实验对六个普通话单元音音素的转换语音质量分别作了客观和主观评估,结果表明用神经网络方法可以获得所期望的转换语音性能。实验结果还说明,与K-均值法相比,用遗传算法训练神经网络可以增强网络的全局寻优能力,使转换语音与目标语音的平均频谱失真距离减小约10%。  相似文献   

13.
In this paper we are proposing neural network based feature transformation framework for developing emotion independent speaker identification system. Most of the present speaker recognition systems may not perform well during emotional environments. In real life, humans extensively express emotions during conversations for effectively conveying the messages. Therefore, in this work we propose the speaker recognition system, robust to variations in emotional moods of speakers. Neural network models are explored to transform the speaker specific spectral features from any specific emotion to neutral. In this work, we have considered eight emotions namely, Anger, Sad, Disgust, Fear, Happy, Neutral, Sarcastic and Surprise. The emotional databases developed in Hindi, Telugu and German are used in this work for analyzing the effect of proposed feature transformation on the performance of speaker identification system. In this work, spectral features are represented by mel-frequency cepstral coefficients, and speaker models are developed using Gaussian mixture models. Performance of the speaker identification system is analyzed with various feature mapping techniques. Results have demonstrated that the proposed neural network based feature transformation has improved the speaker identification performance by 20?%. Feature transformation at the syllable level has shown the better performance, compared to sentence level.  相似文献   

14.
This paper presents two approaches for speaker role recognition in multiparty audio recordings. The experiments are performed over a corpus of 96 radio bulletins corresponding to roughly 19 h of material. Each recording involves, on average, 11 speakers playing one among six roles belonging to a predefined set. Both proposed approaches start by segmenting automatically the recordings into single speaker segments, but perform role recognition using different techniques. The first approach is based on Social Network Analysis, the second relies on the intervention duration distribution across different speakers. The two approaches are used separately and combined and the results show that around 85% of the recording time can be labeled correctly in terms of role.  相似文献   

15.
An intelligent system for text-dependent speaker recognition is proposed in this paper. The system consists of a wavelet-based module as the feature extractor of speech signals and a neural-network-based module as the signal classifier. The Daubechies wavelet is employed to filter and compress the speech signals. The fuzzy ARTMAP (FAM) neural network is used to classify the processed signals. A series of experiments on text-dependent gender and speaker recognition are conducted to assess the effectiveness of the proposed system using a collection of vowel signals from 100 speakers. A variety of operating strategies for improving the FAM performance are examined and compared. The experimental results are analyzed and discussed.  相似文献   

16.
Voice corpus is an essential element for automatic speaker recognition systems. In order for a corpus to be useful in recognition tasks, it must contain recordings from several speakers pronouncing phonetically balanced utterances; recorded through several sessions using different recording media. This work shows the methodology, development and evaluation of a Mexican Spanish Corpus referred as to VoCMex, which is aimed to support research on speaker recognition. It contains telephone and microphone recordings of 20 male and 13 female speakers, obtained through three sessions. In order to validate the usefulness of the corpus, a speaker identification system was developed and the recognition results were similar compared against those obtained using a known voice corpus.  相似文献   

17.
Speaker discrimination is a vital aspect of speaker recognition applications such as speaker identification, verification, clustering, indexing and change-point detection. These tasks are usually performed using distance-based approaches to compare speaker models or features from homogeneous speaker segments in order to determine whether or not they belong to the same speaker. Several distance measures and features have been examined for all the different applications, however, no single distance or feature has been reported to perform optimally for all applications in all conditions. In this paper, a thorough analysis is made to determine the behavior of some frequently used distance measures, as well as features, in distinguishing speakers for different data lengths. Measures studied include the Mahalanobis distance, Kullback-Leibler (KL) distance, T 2 statistic, Hellinger distance, Bhattacharyya distance, Generalized Likelihood Ratio (GLR), Levenne distance, L 2 and L distances. The Mel-Scale Frequency Cepstral Coefficient (MFCC), Linear Predictive Cepstral Coefficients (LPCC), Line Spectral Pairs (LSP) and the Log Area Ratios (LAR) comprise the features investigated. The usefulness of these measures is studied for different data lengths. Generally, a larger data size for each speaker results in better speaker differentiating capability, as more information can be taken into account. However, in some applications such as segmentation of telephone data, speakers change frequently, making it impossible to obtain large speaker-consistent utterances (especially when speaker change-points are unknown). A metric is defined for determining the probability of speaker discrimination error obtainable for each distance measure using each feature set, and the effect of data size on this probability is observed. Furthermore, simple distance-based speaker identification and clustering systems are developed, and the performances of each distance and feature for various data sizes are evaluated on these systems in order to illustrate the importance of choosing the appropriate distance and feature for each application. Results show that for tasks which do not involve any limitation of data length, such as speaker identification, the Kullback Leibler distance with the MFCCs yield the highest speaker differentiation performance, which is comparable to results obtained using more complex state-of-the-art speaker identification systems. Results also indicate that the Hellinger and Bhattacharyya distances with the LSPs yield the best performance for small data sizes.  相似文献   

18.
广播语音的音频分割   总被引:1,自引:2,他引:1  
本文的广播电视新闻的分割系统分为三部分:分割、分类和聚类。分割部分是采用本文提出的基于检测熵变化趋势的分割算法来检测连续语音音频信号的声学特征跳变点,从而实现不同性质的音频信号的分割。这种检测方法不同于传统的需要门限的跳变点检测方法,它是以检测一定窗长的信号内部的每一个可能的分割点所分割的两段信号的信号熵的变化趋势来检测音频信号声学特征跳变点的,可以避免由于门限的选择不当所带来的分割错误。分类部分是采用传统的基于高斯混合模型(GMM)的高斯分类器进行分类,聚类部分采用基于矢量量化(VQ)的说话人聚类算法进行说话人聚类。应用此系统分割三段30分钟的新闻,成功的实现了连续音频信号的分割,去除掉了所有的背景音乐,以较高的精度把属于同一个人的说话语音划归为一类,为广播语音的分类识别打下了良好的基础。  相似文献   

19.
20.
从音频信号中提取录音设备特征是司法比较研究和音频取证的前沿课题。由于录音设备识别技术受到环境、语义、说话人等因素干扰,需要攻克的难题较多,国内外的研究还处于起步阶段。为此回顾了录音设备研究的发展情况、基本理论和组成结构,特别对组成结构中非话音段检测、特征参数、识别模型和数据库建设的研究现状进行了介绍和分析。最后,进一步分析了录音设备识别存在的不足,并展望未来的研究发展方向,指出加快构建现有各品牌各型号的录音设备、各场合、各类人群的数据库建设与深度学习在录音设备中的应用是下一阶段研究的重点。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号