首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper addresses the problem of recognising speech in the presence of a competing speaker. We review a speech fragment decoding technique that treats segregation and recognition as coupled problems. Data-driven techniques are used to segment a spectro-temporal representation into a set of fragments, such that each fragment is dominated by one or other of the speech sources. A speech fragment decoder is used which employs missing data techniques and clean speech models to simultaneously search for the set of fragments and the word sequence that best matches the target speaker model. The paper investigates the performance of the system on a recognition task employing artificially mixed target and masker speech utterances. The fragment decoder produces significantly lower error rates than a conventional recogniser, and mimics the pattern of human performance that is produced by the interplay between energetic and informational masking. However, at around 0 dB the performance is generally quite poor. An analysis of the errors shows that a large number of target/masker confusions are being made. The paper presents a novel fragment-based speaker identification approach that allows the target speaker to be reliably identified across a wide range of SNRs. This component is combined with the recognition system to produce significant improvements. When the target and masker utterance have the same gender, the recognition system has a performance at 0 dB equal to that of humans; in other conditions the error rate is roughly twice the human error rate.  相似文献   

2.
The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively.  相似文献   

3.
4.
Feature statistics normalization in the cepstral domain is one of the most performing approaches for robust automaticspeech and speaker recognition in noisy acoustic scenarios: feature coefficients are normalized by using suitable linear or nonlinear transformations in order to match the noisy speech statistics to the clean speech one. Histogram equalization (HEQ) belongs to such a category of algorithms and has proved to be effective on purpose and therefore taken here as reference.In this paper the presence of multi-channel acoustic channels is used to enhance the statistics modeling capabilities of the HEQ algorithm, by exploiting the availability of multiple noisy speech occurrences, with the aim of maximizing the effectiveness of the cepstra normalization process. Computer simulations based on the Aurora 2 database in speech and speaker recognition scenarios have shown that a significant recognition improvement with respect to the single-channel counterpart and other multi-channel techniques can be achieved confirming the effectiveness of the idea. The proposed algorithmic configuration has also been combined with the kernel estimation technique in order to further improve the speech recognition performances.  相似文献   

5.
Monaural speech separation and recognition challenge   总被引:2,自引:1,他引:1  
Robust speech recognition in everyday conditions requires the solution to a number of challenging problems, not least the ability to handle multiple sound sources. The specific case of speech recognition in the presence of a competing talker has been studied for several decades, resulting in a number of quite distinct algorithmic solutions whose focus ranges from modeling both target and competing speech to speech separation using auditory grouping principles. The purpose of the monaural speech separation and recognition challenge was to permit a large-scale comparison of techniques for the competing talker problem. The task was to identify keywords in sentences spoken by a target talker when mixed into a single channel with a background talker speaking similar sentences. Ten independent sets of results were contributed, alongside a baseline recognition system. Performance was evaluated using common training and test data and common metrics. Listeners’ performance in the same task was also measured. This paper describes the challenge problem, compares the performance of the contributed algorithms, and discusses the factors which distinguish the systems. One highlight of the comparison was the finding that several systems achieved near-human performance in some conditions, and one out-performed listeners overall.  相似文献   

6.
Exploiting the capabilities offered by the plethora of existing wavelets, together with the powerful set of orthonormal bases provided by wavelet packets, we construct a novel wavelet packet-based set of speech features that is optimized for the task of speaker verification. Our approach differs from previous wavelet-based work, primarily in the wavelet-packet tree design that follows the concept of critical bands, as well as in the particular wavelet basis function that has been used. In comparative experiments, we investigate several alternative speech parameterizations with respect to their usefulness for differentiating among human voices. The experimental results confirm that the proposed speech features outperform Mel-Frequency Cepstral Coefficients (MFCC) and previously used wavelet features on the task of speaker verification. A relative reduction of the equal error rate by 15%, 15% and 8% was observed for the proposed speech features, when compared to the wavelet packet features introduced by Farooq and Datta, the MFCC of Slaney, and the subband based cepstral coefficients of Sarikaya et al., respectively.  相似文献   

7.
8.
A text-independent speaker recognition system based on multi-resolution singular value decomposition (MSVD) is proposed. The MSVD is applied to the speaker data compression and feature extraction not at the square matrix. Our results have shown that this MSVD introduced better performance than the other Karhunen-Loeve transform with respect to the percentages of recognition.  相似文献   

9.
In this letter, a two-stage approach based on adaptive fuzzy C-means and wavelet transform clustering is proposed for efficient feature extraction of speaker recognition. Besides, the investigation includes the development of objective function for minimizing under unsupervised mode of training. Experimental results show that the speaker recognition rate is 95%.  相似文献   

10.
Speaker recognition faces many practical difficulties, among which signal inconsistency due to environmental and acquisition channel factors is most challenging. The noise imposed to the voice signal varies greatly and a priori noise model is usually unavailable. In this article, we propose a robust speaker recognition method that employs a novel adaptive wavelet shrinkage method for noise suppression. In our method, wavelet subband coefficient thresholds are automatically computed, which are proportional to the noise contamination. In the application of wavelet shrinkage for noise removal, a dual-threshold strategy is developed to suppress noise, preserve signal coefficients and minimize the introduction of artifacts. The recognition is achieved using modification of Mel-frequency cepstral coefficient of overlapped voice signal segments. The efficacy of our method is evaluated with voice signals from two public available speech signal databases and is compared with state-of-the-art methods. It is demonstrated that our proposed method exhibits great robustness in various noise conditions. The improvement is significant especially when noise dominates the underlying speech.  相似文献   

11.
A new speaker feature extracted from wavelet eigenfunction estimation is described. The signal is decomposed through interpolating the scaling function. Wavelets can offer a significant computational advantage by reducing the dimensionality of the eigenvalue problem. Our results have shown that this wavelet feature introduced better performance than the other Karhunen-Loeve transform (KLT) with respect to the percentages of recognition.  相似文献   

12.
Feature extraction is an essential and important step for speaker recognition systems. In this paper, we propose to improve these systems by exploiting both conventional features such as mel frequency cepstral coding (MFCC), linear predictive cepstral coding (LPCC) and non-conventional ones. The method exploits information present in the linear predictive (LP) residual signal. The features extracted from the LP-residue are then combined to the MFCC or the LPCC. We investigate two approaches termed as temporal and frequential representations. The first one consists of an auto-regressive (AR) modelling of the signal followed by a cepstral transformation in a similar way to the LPC-LPCC transformation. In order to take into account the non-linear nature of the speech signals we used two estimation methods based on second and third-order statistics. They are, respectively, termed as R-SOS-LPCC (residual plus second-order statistic based estimation of the AR model plus cepstral transformation) and R-HOS-LPCC (higher order). Concerning the frequential approach, we exploit a filter bank method called the power difference of spectra in sub-band (PDSS) which measures the spectral flatness over the sub-bands. The resulting features are named R-PDSS. The analysis of these proposed schemes are done over a speaker identification problem with two different databases. The first one is the Gaudi database and contains 49 speakers. The main interest lies in the controlled acquisition conditions: mismatch between the microphones and the interval sessions. The second database is the well-known NTIMIT corpus with 630 speakers. The performances of the features are confirmed over this larger corpus. In addition, we propose to compare traditional features and residual ones by the fusion of recognizers (feature extractor + classifier). The results show that residual features carry speaker-dependent features and the combination with the LPCC or the MFCC shows global improvements in terms of robustness under different mismatches. A comparison between the residual features under the opinion fusion framework gives us useful information about the potential of both temporal and frequential representations.  相似文献   

13.
This paper presents a fuzzy control mechanism for conventional maximum likelihood linear regression (MLLR) speaker adaptation, called FLC-MLLR, by which the effect of MLLR adaptation is regulated according to the availability of adaptation data in such a way that the advantage of MLLR adaptation could be fully exploited when the training data are sufficient, or the consequence of poor MLLR adaptation would be restrained otherwise. The robustness of MLLR adaptation against data scarcity is thus ensured. The proposed mechanism is conceptually simple and computationally inexpensive and effective; the experiments in recognition rate show that FLC-MLLR outperforms standard MLLR especially when encountering data insufficiency and performs better than MAPLR at much less computing cost.  相似文献   

14.
本文介绍了基于μ'nSP内核的SOC上的说话人识别算法改进的研究及实现。采用一种改进的端点检测算法,提高了识别率。并利用随机语音提示的方式,来解决身份确认中的录音作弊问题。取得了良好的效果。  相似文献   

15.
Automatic Speaker Recognition (ASR) refers to the task of identifying a person based on his or her voice with the help of machines. ASR finds its potential applications in telephone based financial transactions, purchase of credit card and in forensic science and social anthropology for the study of different cultures and languages. Results of ASR are highly dependent on database, i.e., the results obtained in ASR are meaningless if recording conditions are not known. In this paper, a methodology and a typical experimental setup used for development of corpora for various tasks in the text-independent speaker identification in different Indian languages, viz., Marathi, Hindi, Urdu and Oriya have been described. Finally, an ASR system is presented to evaluate the corpora.  相似文献   

16.
This paper considers the separation and recognition of overlapped speech sentences assuming single-channel observation. A system based on a combination of several different techniques is proposed. The system uses a missing-feature approach for improving crosstalk/noise robustness, a Wiener filter for speech enhancement, hidden Markov models for speech reconstruction, and speaker-dependent/-independent modeling for speaker and speech recognition. We develop the system on the Speech Separation Challenge database, involving a task of separating and recognizing two mixing sentences without assuming advanced knowledge about the identity of the speakers nor about the signal-to-noise ratio. The paper is an extended version of a previous conference paper submitted for the challenge.  相似文献   

17.
In this paper, we propose a multi-environment model adaptation method based on vector Taylor series (VTS) for robust speech recognition. In the training phase, the clean speech is contaminated with noise at different signal-to-noise ratio (SNR) levels to produce several types of noisy training speech and each type is used to obtain a noisy hidden Markov model (HMM) set. In the recognition phase, the HMM set which best matches the testing environment is selected, and further adjusted to reduce the environmental mismatch by the VTS-based model adaptation method. In the proposed method, the VTS approximation based on noisy training speech is given and the testing noise parameters are estimated from the noisy testing speech using the expectation-maximization (EM) algorithm. The experimental results indicate that the proposed multi-environment model adaptation method can significantly improve the performance of speech recognizers and outperforms the traditional model adaptation method and the linear regression-based multi-environment method.  相似文献   

18.
In this work, we combine the decisions of two classifiers as an alternative means of improving the performance of a speaker recognition system in adverse environments. The difference between these classifiers is in their feature-sets. One system is based on the popular mel-frequency cepstral coefficients (MFCC) and the other on the new parametric feature-sets (PFS) algorithm. The feature-vectors both have mel-scale spectral warping and are computed in the cepstral domain but the feature-sets differs in the use of spectral filters and compressions. The performance of the classifier is not much different in recognition rates terms but they are complementary. This shows that there is information that is not captured in the popular mel-frequency cepstral coefficients (MFCC), and the parametric feature-sets (PFS) is able to add further information for improved performance. Several ways of combining these classifiers gives significant improvements in a speaker identification task using a very large telephone degraded NTIMIT database.  相似文献   

19.
20.
Vocal tract length normalization (VTLN) for standard filterbank-based Mel frequency cepstral coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion. A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation. We recently proposed a novel LT to perform FW for VTLN and model adaptation with standard MFCC features. In this paper, we present the mathematical derivation of the LT and give a compact formula to calculate it for any FW function. We also show that our LT is closely related to different LTs previously proposed for FW with cepstral features, and these LTs for FW are all shown to be numerically almost identical for the sine-log all-pass transform (SLAPT) warping functions. Our formula for the transformation matrix is, however, computationally simpler and, unlike other previous LT approaches to VTLN with MFCC features, no modification of the standard MFCC feature extraction scheme is required. In VTLN and speaker adaptive modeling (SAM) experiments with the DARPA resource management (RM1) database, the performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation. This demonstrates that the approximations involved do not lead to any performance degradation. Performance comparable to front end VTLN was also obtained with LT adaptation of HMM means in the back end, combined with mean bias and variance adaptation according to the maximum likelihood linear regression (MLLR) framework. The FW methods performed significantly better than standard MLLR for very limited adaptation data (1 utterance), and were equally effective with unsupervised parameter estimation. We also performed speaker adaptive training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号