首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Automatic speech recognition (ASR) system suffers from the variation of acoustic quality in an input speech. Speech may be produced in noisy environments and different speakers have their own way of speaking style. Variations can be observed even in the same utterance and the same speaker in different moods. All these uncertainties and variations should be normalized to have a robust ASR system. In this paper, we apply and evaluate different approaches of acoustic quality normalization in an utterance for robust ASR. Several HMM (hidden Markov model)-based systems using utterance-level, word-level, and monophone-level normalization are evaluated with HMM-SM (subspace method)-based system using monophone-level normalization for normalizing variations and uncertainties in an utterance. SM can represent variations of fine structures in sub-words as a set of eigenvectors, and so has better performance at monophone-level than HMM. Experimental results show that word accuracy is significantly improved by the HMM-SM-based system with monophone-level normalization compared to that by the typical HMM-based system with utterance-level normalization in both clean and noisy conditions. Experimental results also suggest that monophone-level normalization using SM has better performance than that using HMM.  相似文献   

2.
The issue of input variability resulting from speaker changes is one of the most crucial factors influencing the effectiveness of speech recognition systems. A solution to this problem is adaptation or normalization of the input, in a way that all the parameters of the input representation are adapted to that of a single speaker, and a kind of normalization is applied to the input pattern against the speaker changes, before recognition. This paper proposes three such methods in which some effects of the speaker changes influencing speech recognition process is compensated. In all three methods, a feed-forward neural network is first trained for mapping the input into codes representing the phonetic classes and speakers. Then, among the 71 speakers used in training, the one who is showing the highest percentage of phone recognition accuracy is selected as the reference speaker so that the representation parameters of the other speakers are converted to the corresponding speech uttered by him. In the first method, the error back-propagation algorithm is used for finding the optimal point of every decision region relating to each phone of each speaker in the input space for all the phones and all the speakers. The distances between these points and the corresponding points related to the reference speaker are employed for offsetting the speaker change effects and the adaptation of the input signal to the reference speaker. In the second method, using the error back-propagation algorithm and maintaining the reference speaker data as the desirable speaker output, we correct all the speech signal frames, i.e., the train and the test datasets, so that they coincide with the corresponding speech of the reference speaker. In the third method, another feed-forward neural network is applied inversely for mapping the phonetic classes and speaker information to the input representation. The phonetic output retrieved from the direct network along with the reference speaker data are given to the inverse network. Using this information, the inverse network yields an estimation of the input representation adapted to the reference speaker. In all three methods, the final speech recognition model is trained using the adapted training data, and is tested by the adapted testing data. Implementing these methods and combining the final network results with un-adapted network based on the highest confidence level, an increase of 2.1, 2.6 and 3% in phone recognition accuracy on the clean speech is obtained from the three methods, respectively.  相似文献   

3.
Feature statistics normalization in the cepstral domain is one of the most performing approaches for robust automaticspeech and speaker recognition in noisy acoustic scenarios: feature coefficients are normalized by using suitable linear or nonlinear transformations in order to match the noisy speech statistics to the clean speech one. Histogram equalization (HEQ) belongs to such a category of algorithms and has proved to be effective on purpose and therefore taken here as reference.In this paper the presence of multi-channel acoustic channels is used to enhance the statistics modeling capabilities of the HEQ algorithm, by exploiting the availability of multiple noisy speech occurrences, with the aim of maximizing the effectiveness of the cepstra normalization process. Computer simulations based on the Aurora 2 database in speech and speaker recognition scenarios have shown that a significant recognition improvement with respect to the single-channel counterpart and other multi-channel techniques can be achieved confirming the effectiveness of the idea. The proposed algorithmic configuration has also been combined with the kernel estimation technique in order to further improve the speech recognition performances.  相似文献   

4.
Peacocke  R.D. Graf  D.H. 《Computer》1990,23(8):26-33
Five approaches that can be used to control and simplify the speech recognition task are examined. They entail the use of isolated words, speaker-dependent systems, limited vocabulary size, a tightly constrained grammar, and quiet and controlled environmental conditions. The five components of a speech recognition system are described: a speech capture device, a digital signal processing module, preprocessed signal storage, reference speech patterns, and a pattern-matching algorithm. Current speech recognition systems are reviewed and categorized. Speaker recognition approaches and systems are also discussed  相似文献   

5.
A novel approach for joint speaker identification and speech recognition is presented in this article. Unsupervised speaker tracking and automatic adaptation of the human-computer interface is achieved by the interaction of speaker identification, speech recognition and speaker adaptation for a limited number of recurring users. Together with a technique for efficient information retrieval a compact modeling of speech and speaker characteristics is presented. Applying speaker specific profiles allows speech recognition to take individual speech characteristics into consideration to achieve higher recognition rates. Speaker profiles are initialized and continuously adapted by a balanced strategy of short-term and long-term speaker adaptation combined with robust speaker identification. Different users can be tracked by the resulting self-learning speech controlled system. Only a very short enrollment of each speaker is required. Subsequent utterances are used for unsupervised adaptation resulting in continuously improved speech recognition rates. Additionally, the detection of unknown speakers is examined under the objective to avoid the requirement to train new speaker profiles explicitly. The speech controlled system presented here is suitable for in-car applications, e.g. speech controlled navigation, hands-free telephony or infotainment systems, on embedded devices. Results are presented for a subset of the SPEECON database. The results validate the benefit of the speaker adaptation scheme and the unified modeling in terms of speaker identification and speech recognition rates.  相似文献   

6.
7.
This paper presents a new architecture for automatic speech recognition systems which is characterized by the division of the spectral domain of the speech signal into several independent frequency bands. This model is based on the psycho-acoustic work of Fletcher (1953) who proposed a similar principle for the human auditory system. Jont B. Allen published a paper in 1994 in which he summarized the work of Fletcher and also proposed to adapt the multi-band paradigm to automatic speech recognition (ASR) (Allen, 1994). Many researchers have then studied this principle and built such ASR systems. The goal of this paper is to analyse some of the most important issues in the design of a multi-band ASR system in order to determine which architecture it should have in which environment. Two other major problems are then considered: how to train multi-band systems and how to use them for continuous ASR.  相似文献   

8.
Automatic speech recognition (ASR) in reverberant environments is still a challenging task. In this study, we propose a robust feature-extraction method on the basis of the normalization of the sub-band temporal modulation envelopes (TMEs). The sub-band TMEs were extracted using a series of constant bandwidth band-pass filters with Hilbert transforms followed by low-pass filtering. Based on these TMEs, the modulation spectrums in both clean and reverberation spaces are transformed to a reference space by using modulation transfer functions (MTFs), wherein the MTFs are estimated as the measure of the modulation transfer effect on the sub-band TMEs between the clean, reverberation, and reference spaces. By using the MTFs on the modulation spectrum, it is supposed that the difference on the modulation spectrum caused by the difference of the recording environments is removed. Based on the normalized modulation spectrum, inverse Fourier transform was conducted to restore the sub-band TMEs by retaining their original phase information. We tested the proposed method on speech recognition experiments in a reverberant room with differing speaker to microphone distance (SMD). For comparison, the recognition performance of using the traditional Mel frequency cepstral coefficients with mean and variance normalization was used as the baseline. The experimental results showed that by averaging the results for SMDs from 50 cm to 400 cm, we obtained a 44.96% relative improvement by only using sub-band TME processing, and obtained a further 15.68% relative improvement by performing the normalization on the modulation spectrum of the sub-band TMEs. In all, we obtained a 53.59% relative improvement, which was better than using other temporal filtering and normalization methods.  相似文献   

9.
Speech and speaker recognition systems are rapidly being deployed in real-world applications. In this paper, we discuss the details of a system and its components for indexing and retrieving multimedia content derived from broadcast news sources. The audio analysis component calls for real-time speech recognition for converting the audio to text and concurrent speaker analysis consisting of the segmentation of audio into acoustically homogeneous sections followed by speaker identification. The output of these two simultaneous processes is used to abstract statistics to automatically build indexes for text-based and speaker-based retrieval without user intervention. The real power of multimedia document processing is the possibility of Boolean queries in the form of combined text- and speaker-based user queries. Retrieval for such queries entails combining the results of individual text and speaker based searches. The underlying techniques discussed here can easily be extended to other speech-centric applications and transactions.  相似文献   

10.

Speaker recognition revolution has lead to the inclusion of speaker recognition modules in several commercial products. Most published algorithms for speaker recognition focus on text-dependent speaker recognition. In contrast, text-independent speaker recognition is more advantageous as the client can talk freely to the system. In this paper, text-independent speaker recognition is considered in the presence of some degradation effects such as noise and reverberation. Mel-Frequency Cepstral Coefficients (MFCCs), spectrum and log-spectrum are used for feature extraction from the speech signals. These features are processed with the Long-Short Term Memory Recurrent Neural Network (LSTM-RNN) as a classification tool to complete the speaker recognition task. The network learns to recognize the speakers efficiently in a text-independent manner, when the recording circumstances are the same. The recognition rate reaches 95.33% using MFCCs, while it is increased to 98.7% when using spectrum or log-spectrum. However, the system has some challenges to recognize speakers from different recording environments. Hence, different speech enhancement techniques, such as spectral subtraction and wavelet denoising, are used to improve the recognition performance to some extent. The proposed approach shows superiority, when compared to the algorithm of R. Togneri and D. Pullella (2011).

  相似文献   

11.
In the past years, several text-independent speaker recognition evaluation campaigns have taken place. This paper reports on results of the NIST evaluation of 2004 and the NFI-TNO forensic speaker recognition evaluation held in 2003, and reflects on the history of the evaluation campaigns. The effects of speech duration, training handsets, transmission type, and gender mix show expected behaviour on the DET curves. New results on the influence of language show an interesting dependence of the DET curves on the accent of speakers. We also report on a number of statistical analysis techniques that have recently been introduced in the speaker recognition community, as well as a new application of the analysis of deviance analysis. These techniques are used to determine that the two evaluations held in 2003, by NIST and NFI-TNO, are of statistically different difficulty to the speaker recognition systems.  相似文献   

12.
针对传统EM算法训练GMM不能充分利用训练数据所属高斯分量信息, 从而在一定程度上影响说话人识别性能的缺陷, 采用RPEM (竞争惩罚EM)算法训练GMM, 并引入批处理RPEM算法解决RPEM算法运算量大、收敛速度慢的问题, 同时针对RPEM和批处理RPEM算法训练时方差优化存在的问题进行了改进, 提出了改进的批处理RPEM算法。在Chains 说话人识别数据库上的实验表明, 改进的批处理RPEM算法取得了相对于传统EM、RPEM以及批处理RPEM算法更好的性能, 还极大地提高了训练效率, 减小了运算量, 说明了提出的改进批处理RPEM算法用于说话人识别时的有效性。  相似文献   

13.
Recognizing people by gait promises to be useful for identifying individuals from a distance; in this regard, improved techniques are under development. In this paper, an improved method for gait recognition is proposed. Binarized silhouette of a motion object is first represented by four 1-D signals that are the basic image features called the distance vectors. The distance vectors are differences between the bounding box and silhouette, and extracted using four projections to silhouette. Fourier Transform is employed as a preprocessing step to achieve translation invariant for the gait patterns accumulated from silhouette sequences that are extracted from the subjects’ walk in different speed and/or different time. Then, eigenspace transformation is applied to reduce the dimensionality of the input feature space. Support vector machine (SVM)-based pattern classification technique is then performed in the lower-dimensional eigenspace for recognition. The input feature space is alternatively constructed by using two different approaches. The four projections (1-D signals) are independently classified in the first approach. A fusion task is then applied to produce the final decision. In the second approach, the four projections are concatenated to have one vector and then pattern classification with one vector is performed in the lower-dimensional eigenspace for recognition. The experiments are carried out on the most well-known public gait databases: the CMU, the USF, SOTON, and NLPR human gait databases. To effectively understand the performance of the algorithm, the experiments are executed and presented as increasing amounts of the gait cycles of each person available during the training procedure. Finally, the performance of the proposed algorithm is comparatively illustrated to take into consideration the published gait recognition approaches.  相似文献   

14.
改进的说话人聚类初始化和GMM的多说话人识别*   总被引:2,自引:1,他引:1  
针对多说话人聚类线性初始化方法精度较差的问题,提出了一种改进的聚类初始化方法。该方法引入BIC对由线性初始化产生的初始类进行检测分割,有效提升了说话人初始类纯度。最后将该方法应用到高斯混合模型(GMM)多说话人识别系统。实验结果表明,所提方法使说话人平均类纯度(ACP)提高了48.51%,系统的错误识别率平均降低12.09%。  相似文献   

15.
在说话人空间中,存在语音特征随句子和时间差异而变化的问题。这个变化主要是由语音数据中的语音信息和说话人信息的变化引起的。如果把这两种信息彼此分离就能实现鲁棒的说话人识别。在假设大的说话人变量的空间为“语音空间”和小的说话人变量的空间为“说话人空间”的情况下,通过子空间方法分离语音信息和说话人信息,提出了说话人辨认和说话人确认方法。结果显示:通过相对于传统方法的比较试验,能用小量训练数据建立鲁棒说话人模型。  相似文献   

16.
Improved gait recognition by gait dynamics normalization   总被引:5,自引:0,他引:5  
Potential sources for gait biometrics can be seen to derive from two aspects: gait shape and gait dynamics. We show that improved gait recognition can be achieved after normalization of dynamics and focusing on the shape information. We normalize for gait dynamics using a generic walking model, as captured by a population Hidden Markov Model (pHMM) defined for a set of individuals. The states of this pHMM represent gait stances over one gait cycle and the observations are the silhouettes of the corresponding gait stances. For each sequence, we first use Viterbi decoding of the gait dynamics to arrive at one dynamics-normalized, averaged, gait cycle of fixed length. The distance between two sequences is the distance between the two corresponding dynamics-normalized gait cycles, which we quantify by the sum of the distances between the corresponding gait stances. Distances between two silhouettes from the same generic gait stance are computed in the linear discriminant analysis space so as to maximize the discrimination between persons, while minimizing the variations of the same subject under different conditions. The distance computation is constructed so that it is invariant to dilations and erosions of the silhouettes. This helps us handle variations in silhouette shape that can occur with changing imaging conditions. We present results on three different, publicly available, data sets. First, we consider the HumanlD Gait Challenge data set, which is the largest gait benchmarking data set that is available (122 subjects), exercising five different factors, i.e., viewpoint, shoe, surface, carrying condition, and time. We significantly improve the performance across the hard experiments involving surface change and briefcase carrying conditions. Second, we also show improved performance on the UMD gait data set that exercises time variations for 55 subjects. Third, on the CMU Mobo data set, we show results for matching across different walking speeds. It is worth noting that there was no separate training for the UMD and CMU data sets.  相似文献   

17.
Multi-stream automatic speech recognition (MS-ASR) has been confirmed to boost the recognition performance in noisy conditions. In this system, the generation and the fusion of the streams are the essential parts and need to be designed in such a way to reduce the effect of noise on the final decision. This paper shows how to improve the performance of the MS-ASR by targeting two questions; (1) How many streams are to be combined, and (2) how to combine them. First, we propose a novel approach based on stream reliability to select the number of streams to be fused. Second, a fusion method based on Parallel Hidden Markov Models is introduced. Applying the method on two datasets (TIMIT and RATS) with different noises, we show an improvement of MS-ASR.  相似文献   

18.
在文本无关的说话人辨识中,为了提高系统在电话语音条件下的鲁棒性,提出了将说话人确认中常用的评分规整手段用于说话人辨识中,即对测试语音通过不同话者模型的评分分别进行评分规整,为测试语音选取最接近的话者模型作为系统识别输出,有效地提高了系统性能。在NIST’03 1spk数据库上的说话人辨识实验表明了评分规整技术对说话人辨识的有效性。  相似文献   

19.
Automatic speech recognition is the central part of the wheel towards the natural person-to-machine interaction technique. Due to the high disparity of speaking styles, speech recognition surely demands composite methods to constitute this irregularity. A speech recognition method can work in numerous distinct states such as speaker dependent/independent speech, isolated/continuous/spontaneous speech recognition, for less to very large vocabulary. The Punjabi language is being spoken by concerning 104 million peoples in India, Pakistan and other countries with Punjabi migrants. The Punjabi language is written in Gurmukhi writing in Indian Punjab, while in Shahmukhi writing in Pakistani Punjab. In the paper, the objective is to build the speaker independent automatic spontaneous speech recognition system for the Punjabi language. The system is also capable to recognize the spontaneous Punjabi live speech. So far, no work has to be achieved in the area of spontaneous speech recognition system for the Punjabi language. The user interfaces for Punjabi live speech system is created by using the java programming. Till now, automatic speech system is trained with 6012 Punjabi words and 1433 Punjabi sentences. The performance measured in terms of recognition accuracy which is 93.79% for Punjabi words and 90.8% for Punjabi sentences.  相似文献   

20.
《Ergonomics》2012,55(11):1543-1555
The optimal type and amount of secondary feedback for data entry with automatic speech recognition were investigated. Six feedback conditions, varying the information channel for feedback (visual or auditory), the delay prior to feedback, and the amount of feedback history, were compared to a no-feedback control. In addition, the presence of a dialogue requiring users to confirm a word choice when the speech recognizer could not distinguish between two words was studied. The word confirmation dialogue increased recognition accuracy by about 5% with no significant increase in the time to enter data. Type of feedback affected both accuracy and time to enter data. When no feedback was available, data entry time was minimal but there were many errors. Any type of feedback/error correction vastly unproved accuracy, but auditory feedback provided after a string of data was spoken increased the time to enter data by a factor of three. Depending on task conditions, visual or auditory feedback following each word spoken is recommended.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号