首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
The field of automatic speech recognition (ASR) is discussed from the viewpoint of pattern recognition (PR). This tutorial examines the problem area, its methods, successes and failures, focusing on the nature of the speech signal and techniques to accomplish useful data reduction. Comparison is made with other areas of PR. Suggestions are given for areas of future progress.  相似文献   

2.
Automatic speech recognition (ASR) system suffers from the variation of acoustic quality in an input speech. Speech may be produced in noisy environments and different speakers have their own way of speaking style. Variations can be observed even in the same utterance and the same speaker in different moods. All these uncertainties and variations should be normalized to have a robust ASR system. In this paper, we apply and evaluate different approaches of acoustic quality normalization in an utterance for robust ASR. Several HMM (hidden Markov model)-based systems using utterance-level, word-level, and monophone-level normalization are evaluated with HMM-SM (subspace method)-based system using monophone-level normalization for normalizing variations and uncertainties in an utterance. SM can represent variations of fine structures in sub-words as a set of eigenvectors, and so has better performance at monophone-level than HMM. Experimental results show that word accuracy is significantly improved by the HMM-SM-based system with monophone-level normalization compared to that by the typical HMM-based system with utterance-level normalization in both clean and noisy conditions. Experimental results also suggest that monophone-level normalization using SM has better performance than that using HMM.  相似文献   

3.
《Ergonomics》2012,55(11):1543-1555
The optimal type and amount of secondary feedback for data entry with automatic speech recognition were investigated. Six feedback conditions, varying the information channel for feedback (visual or auditory), the delay prior to feedback, and the amount of feedback history, were compared to a no-feedback control. In addition, the presence of a dialogue requiring users to confirm a word choice when the speech recognizer could not distinguish between two words was studied. The word confirmation dialogue increased recognition accuracy by about 5% with no significant increase in the time to enter data. Type of feedback affected both accuracy and time to enter data. When no feedback was available, data entry time was minimal but there were many errors. Any type of feedback/error correction vastly unproved accuracy, but auditory feedback provided after a string of data was spoken increased the time to enter data by a factor of three. Depending on task conditions, visual or auditory feedback following each word spoken is recommended.  相似文献   

4.
《Ergonomics》2012,55(11):1943-1957
Abstract

Errors, whether created by the user, the recognizer, or inadequate systems design, are an important consideration in the more widespread and successful use of automatic speech recognition (ASR). An experiment is described in which recognition errors are studied under different types of feedback. Subjects entered data verbally to a microcomputer according to four experimental conditions: namely, orthogonal combinations of spoken and visual feedback presented concurrently or terminally after six items. Although no significant differences in terms of error rates or speed of data entry were shown across the conditions, analysis of the time penalty for error correction indicated that as a general rule, there is a small timing advantage for terminal feedback, when the error rate is low. It was found that subjects do not monitor visual feedback with the same degree of accuracy as spoken, as a larger number of incorrect data entry strings was being confirmed as correct. Further evidence for the use of ‘second best’ recognition data is given, since correct recognition on re-entry could be increased from 83·0% to 92·4% when the first choice recognition was deleted from the second attempt. Finally, the implications for error correction protocols in system design are discussed.  相似文献   

5.
6.
Lectures can be digitally recorded and replayed to provide multimedia revision material for students who attended the class and a substitute learning experience for students unable to attend. Deaf and hard of hearing people can find it difficult to follow speech through hearing alone or to take notes while they are lip-reading or watching a sign-language interpreter. Synchronising the speech with text captions can ensure deaf students are not disadvantaged and assist all learners to search for relevant specific parts of the multimedia recording by means of the synchronised text. Automatic speech recognition has been used to provide real-time captioning directly from lecturers’ speech in classrooms but it has proved difficult to obtain accuracy comparable to stenography. This paper describes the development, testing and evaluation of a system that enables editors to correct errors in the captions as they are created by automatic speech recognition and makes suggestions for future possible improvements.  相似文献   

7.
Automatic recognition of the speech of children is a challenging topic in computer-based speech recognition systems. Conventional feature extraction method namely Mel-frequency cepstral coefficient (MFCC) is not efficient for children's speech recognition. This paper proposes a novel fuzzy-based discriminative feature representation to address the recognition of Malay vowels uttered by children. Considering the age-dependent variational acoustical speech parameters, performance of the automatic speech recognition (ASR) systems degrades in recognition of children's speech. To solve this problem, this study addresses representation of relevant and discriminative features for children's speech recognition. The addressed methods include extraction of MFCC with narrower filter bank followed by a fuzzy-based feature selection method. The proposed feature selection provides relevant, discriminative, and complementary features. For this purpose, conflicting objective functions for measuring the goodness of the features have to be fulfilled. To this end, fuzzy formulation of the problem and fuzzy aggregation of the objectives are used to address uncertainties involved with the problem.The proposed method can diminish the dimensionality without compromising the speech recognition rate. To assess the capability of the proposed method, the study analyzed six Malay vowels from the recording of 360 children, ages 7 to 12. Upon extracting the features, two well-known classification methods, namely, MLP and HMM, were employed for the speech recognition task. Optimal parameter adjustment was performed for each classifier to adapt them for the experiments. The experiments were conducted based on a speaker-independent manner. The proposed method performed better than the conventional MFCC and a number of conventional feature selection methods in the children speech recognition task. The fuzzy-based feature selection allowed the flexible selection of the MFCCs with the best discriminative ability to enhance the difference between the vowel classes.  相似文献   

8.
This communication discusses how automatic speech recognition (ASR) can support universal access to communication and learning through the cost-effective production of text synchronised with speech and describes achievements and planned developments of the Liberated Learning Consortium to: support preferred learning and teaching styles; assist those who for cognitive, physical or sensory reasons find notetaking difficult; assist learners to manage and search online digital multimedia resources; provide automatic captioning of speech for deaf learners or when speech is not available or suitable; assist blind, visually impaired or dyslexic people to read and search material; and, assist speakers to improve their communication skills.  相似文献   

9.
Speech is the most natural form of communication for human beings. However, in situations where audio speech is not available because of disability or adverse environmental condition, people may resort to alternative methods such as augmented speech, that is, audio speech supplemented or replaced by other modalities, such as audiovisual speech, or Cued Speech. This article introduces augmented speech communication based on Electro-Magnetic Articulography (EMA). Movements of the tongue, lips, and jaw are tracked by EMA and are used as features to create hidden Markov models (HMMs). In addition, automatic phoneme recognition experiments are conducted to examine the possibility of recognizing speech only from articulation, that is, without any audio information. The results obtained are promising, which confirm that phonetic features characterizing articulation are as discriminating as those characterizing acoustics (except for voicing). This article also describes experiments conducted in noisy environments using fused audio and EMA parameters. It has been observed that when EMA parameters are fused with noisy audio speech, the recognition rate increases significantly as compared with using noisy audio speech only.  相似文献   

10.
11.
We are addressing the novel problem of jointly evaluating multiple speech patterns for automatic speech recognition and training. We propose solutions based on both the non-parametric dynamic time warping (DTW) algorithm, and the parametric hidden Markov model (HMM). We show that a hybrid approach is quite effective for the application of noisy speech recognition. We extend the concept to HMM training wherein some patterns may be noisy or distorted. Utilizing the concept of “virtual pattern” developed for joint evaluation, we propose selective iterative training of HMMs. Evaluating these algorithms for burst/transient noisy speech and isolated word recognition, significant improvement in recognition accuracy is obtained using the new algorithms over those which do not utilize the joint evaluation strategy.  相似文献   

12.
In this paper, we present an on-line learning neural network model, Dynamic Recognition Neural Network (DRNN), for real-time speech recognition. The property of accumulative learning of the DRNN makes it very suitable for real-time speech recognition with on-line learning. A comparison between the DRNN and Hidden Markov Model (HMM) shows that the computational complexity of the former is lower than that of the latter in both training and recognition. Encouraging results are obtained when the DRNN is tested on a BUPT digit database (Mandarin) and on the on-line learning of twenty isolated English computer command words.  相似文献   

13.
首先,给出结合韵律信息的系统框架。然后,针对汉语的特点,解决了韵律相关的语音识别系统中建模单元选择、模型训练等问题,并在多空间概率分布隐马尔可夫模型(multiple-space distribution hidden Markov model, MSD-HMM)框架下构建了韵律相关的语音识别系统。最后,通过语音识别的实验验证了方法的有效性。在“863”测试集上,该方法能够达到76.18%的带调音节识别正确率。  相似文献   

14.
Robustness is one of the most important topics for automatic speech recognition (ASR) in practical applications. Monaural speech separation based on computational auditory scene analysis (CASA) offers a solution to this problem. In this paper, a novel system is presented to separate the monaural speech of two talkers. Gaussian mixture models (GMMs) and vector quantizers (VQs) are used to learn the grouping cues on isolated clean data for each speaker. Given an utterance, speaker identification is firstly performed to identify the two speakers presented in the utterance, then the factorial-max vector quantization model (MAXVQ) is used to infer the mask signals and finally the utterance of the target speaker is resynthesized in the CASA framework. Recognition results on the 2006 speech separation challenge corpus prove that this proposed system can improve the robustness of ASR significantly.  相似文献   

15.
16.
大词表连续语音识别系统由多个组件构成,识别错误受多种因素的影响。系统开发者需要分析错误发生的不同原因。根据语音识别的基本理论给出了对错误进行分类分析的原理,将识别错误按错误原因分为解码错误、声学模型错误、语言模型错误、声学和语言复合错误四大类,并对分类后的错误做了统计分析。实验证明,识别错误的分类分析为系统的改进提供了参考依据。  相似文献   

17.
Channel distortion is one of the major factors which degrade the performances of automatic speech recognition (ASR) systems. Current compensation methods are generally based on the assumption that the channel distortion is a constant or slowly varying bias in an utterance or globally. However, this assumption is not sustained in a more complex circumstance, when the speech records being recognized are from many different unknown channels and have parts of the spectrum completely removed (e.g. band-limited speech). On the one hand, different channels may cause different distortions; on the other, the distortion caused by a given channel varies over the speech frames when parts of the speech spectrum are removed completely. As a result, the performance of the current methods is limited in complex environments. To solve this problem, we propose a unified framework in which the channel distortion is first divided into two subproblems, namely, spectrum missing and magnitude changing. Next, the two types of distortions are compensated with different techniques in two steps. In the first step, the speech bandwidth is detected for each utterance and the acoustic models are synthesized with clean models to compensate for spectrum missing. In the second step, the constant term of the distortion is estimated via the expectation-maximization (EM) algorithm and subtracted from the means of the synthesized model to further compensate for magnitude changing. Several databases are chosen to evaluate the proposed framework. The speech in these databases is recorded in different channels, including various microphones and band-limited channels. Moreover, to simulate more types of spectrum missing, various low-pass and band-pass filters are used to process the speech from the chosen databases. Although these databases and their filtered versions make the channel conditions more challenging for recognition, experimental results show that the proposed framework can substantially improve the performance of ASR systems in complex channel environments.  相似文献   

18.
根据情感的连续空间模型,提出一种改进的排序式选举算法,实现多个情感分类器的融合,取得了很好的情感识别效果。首先以隐马尔可夫模型(HMM)和人工神经网络(ANN)为基础,设计了三种分类器;然后用改进的排序式选举算法,实现对三种分类器的融合。分别利用普通话情感语音库和德语情感语音库进行实验,结果表明,与几种传统融合算法相比,改进的排序式选举法能够取得更好的融合效果,其识别性能明显优于单分类器。该算法不仅简单,而且可移植性好,可用于其他任意多个情感分类器的融合。  相似文献   

19.
基于HTK 的特定词语音识别系统   总被引:1,自引:1,他引:0  
语音识别技术经过半个世纪的发展,目前已日趋成熟,其在语音拨号系统、数字遥控、工业控制等领域都有了广泛的应用。由于目前常用的声学模型和语言模型的局限性,计算机只能识别一些词汇或一些句子。语音识别系统在语种改变时,往往会出现错误的识别结果。针对上述问题,结合隐马尔可夫模型原理,在HTK语音处理工具箱的基础上构建了中英文特定词语音识别系统。该系统通过代码控制整个构建过程,使其在更换新的训练数据和词典后能快速生成对应的识别模型。  相似文献   

20.
噪声鲁棒语音识别研究综述*   总被引:3,自引:1,他引:2  
针对噪声环境下的语音识别问题,对现有的噪声鲁棒语音识别技术进行讨论,阐述了噪声鲁棒语音识别研究的主要问题,并根据语音识别系统的构成将噪声鲁棒语音识别技术按照信号空间、特征空间和模型空间进行分类总结,分析了各种鲁棒语音识别技术的特点、实现,以及在语音识别中的应用。最后展望了进一步的研究方向。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号