首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Considerable progress has been made in speech-recognition technology and nowhere has this progress been more evident than in the area of large-vocabulary recognition (LVR). Laboratory systems are capable of transcribing continuous speech from any speaker with average word-error rates between 5% and 10%. If speaker adaptation is allowed, then after 2 or 3 minutes of speech, the error rate will drop well below 5% for most speakers. LVR systems had been limited to dictation applications since the systems were speaker dependent and required words to be spoken with a short pause between them. However, the capability to recognize natural continuous-speech input from any speaker opens up many more applications. This article discusses the principles and architecture of LVR systems and identifies the key issues affecting their future deployment. To illustrate the various points raised, the Cambridge University HTK system is described. This system is a modern design that gives state-of-the-art performance, and it is typical of the current generation of recognition systems  相似文献   

2.
与桌面环境相比,电话网络环境下的语音识别率仍然还比较低,为了推动电话语音识别在实际中的应用,提高其识别率成了当务之急.先前的研究表明,电话语音识别率明显下降通常是因为测试和训练环境的电话通道不同引起数据失配造成的,因此该文提出基于统计模型的动态通道补偿算法(SMDC)减少它们之间的差异,采用贝叶斯估计算法动态地跟踪电话通道的时变特性.实验结果表明,大词汇量连续语音识别的字误识率(CER)相对降低约27%,孤立词的词误识率(WER)相对降低约30%.同时,算法的结构时延和计算复杂度也比较小.平均时延约200ms.可以很好地嵌入到实际电话语音识别应用中.  相似文献   

3.
4.
This paper discusses robust speech section detection by audio and video modalities. Most of today's speech recognition systems require speech section detection prior to any further analysis, and the accuracy of detected speech section s is said to affect the speech recognition accuracy. Because audio modalities are intrinsically disturbed by audio noise, we have been researching video modality speech section detection by detecting deformations in speech organ images. Video modalities are robust to audio noise, but their detection sections are longer than audio speech sections because deformations in related organs start before the speech to prepare for the articulation of the first phoneme, and also because the settling down motion lasts longer than the speech. We have verified that inaccurate detected sections caused by this excess length degrade the speech recognition rate, leading to speech recognition errors by insertions. To reduce insertion errors, and enhance the robustness of speech detection, we propose a method that takes advantage of the two types of modalities. According to our experiment, the proposed method is confirmed to reduce the insertion error rate as well as increase the recognition rate in noisy environment.  相似文献   

5.
Thanks to the availability of various mobile applications, lots of users shift from desktop environments, e.g., PCs and laptops, to mobile devices, e.g., smartphones and tablets. However, there are still some desktop applications without counterparts on mobile devices, such as some integrated development environments (e.g., eclipse) and automatic industry control systems. In this paper, we propose Modeap, a platform-independent mobile cloud service that can push all desktop applications developed for various operating systems from cloud servers to mobile devices. Modeap follows a design principle of complete detachment and regeneration of desktop user interface, i.e., the essential graphical primitives of the original desktop applications will be intercepted and then translated into standard web-based graphical primitives such that the interactions between users and remote cloud applications become possible via mobile web browsers. In this way, all desktop applications built upon the same set of graphical primitives can be used on mobile devices in great flexibility without installing any new software. We have developed a proof-of-concept prototype that provides Windows applications from cloud server to mobile web browsers. The results of extensive experiments show that the proposed framework can achieve our design goals with low latency and bandwidth consumption.  相似文献   

6.
汉语语音理解系统的任务之一是把语音识别系统获得的汉语单音节转换成正确的汉字、词,乃至汉语的短语、语句,与语音识别系统一起,完成一个语音到文本(speech to text)的转换系统。本文利用一个闭环反馈方式汉语语音识别理解方案,在汉语词识别理解的基础上,进一步实现对汉语结构性短语的识别理解,获得了预期的结果。最后本文对实验结果和反馈式语音识别理解方案进行了讨论。  相似文献   

7.
于春雪 《电声技术》2012,36(1):55-59,73
采用ARM处理器$3C2440A构建嵌入式系统,利用音频芯片UDA1341TS对语音信号进行编解码,应用语音识别技术实现语音控制。介绍了系统设计原理和工作机制,并阐述了控制选单的软硬件设计方案和识别算法原理,给出测试方法。实验结果表明,系统能实现特定指令的语音控制,识别率高、实时性好,可适应复杂的工作环境。  相似文献   

8.
The past decade (1990-2000) has witnessed substantial advances in speech recognition technology, which when combined with the increase in computational power and storage capacity has resulted in a variety of commercial products already or soon to be on the market. The authors review the state of the art in core technology, large vocabulary continuous speech recognition, with a view toward highlighting recent advances. We then highlight issues in moving toward applications, discussing system efficiency, portability across languages and tasks, and enhancing the system output by adding tags and nonlinguistic information. Current performance in speech recognition and outstanding challenges for three classes of applications (dictation, audio indexation, and spoken language dialogue systems), are discussed  相似文献   

9.
In this paper according to the process of cognitive of human being to speech is put forward a model of speech recognition and understanding in a noisy environment. For speech recognition, two level modular Extended Associative Memory Neural Networks (EAMNN) are adopted. The learning speed is 9 times faster than that of the conventional BP net. It has high self-adaptability, robustness, fault toleration and associative memory ability to the noisy speech signals. To speech understanding, the structure of hierarchical analysis and examining faults which is a combination of statistic inference and syntactic rules is adopted, to pick up the candidates of the speech recognition and to predict the next word by the statistic inference base; and the syntactic rule base reduces effectively the recognition errors and candidates of acoustic level; then by comparing and rectifying errors through information feedback and guiding the succeeding speech process, the recognition of the sentence is realized.  相似文献   

10.
张靖  俞一彪 《通信技术》2020,(3):618-624
说话人识别系统实际应用时,一旦应用环境和训练环境不一致,系统的性能会急剧下降。由于环境噪声的多变性,系统训练时无法预测实际应用中的环境噪声。因此,引入环境自学习和自适应思想,通过改进的矢量泰勒级数(Vector Taylor Series,VTS)刻画环境噪声模型和说话人语音模型之间的统计关系,提出一种具有环境自学习能力的鲁棒说话人识别算法。系统应用中每当环境变化时利用语音输入前采集到的环境噪声信号来迭代更新环境噪声模型参数,进一步基于VTS确立的统计关系,将说话人语音模型自适应到实际应用环境来补偿环境失配的影响。说话人辨认实验结果表明,提出的方法在低信噪比条件下对于不同种类的噪声都能显著提升系统的识别性能。  相似文献   

11.
训练环境和测试环境的不匹配是造成实际情况下语音识别性能下降的主要原因。在深入研究语音识别的噪声环境和Mel域倒谱系数(MFCC)流程的基础上,基于累计分布函数匹配思想,给出了3种通过减小训练环境和测试环境的不匹配度来提高系统在不同环境下适应性的鲁棒性特征提取方法,分析了它们的理论基础、基本算法,并在Aurora2.0数据库上进行了实现,验证了方法的有效性,为实际应用中如何选择语音识别系统提供了参考。  相似文献   

12.
The fundamentals of speech recognition are reviewed. The dimensions of the speech recognition task, speech feature analysis, pattern classification using hidden Markov models, language processing, and the current accuracy of speech recognition systems are discussed. The applications of speech recognition in telecommunications, voice dictation, speech understanding for data retrieval, and consumer products are examined  相似文献   

13.
While several proactive acoustic feedback (Larsen-effect) cancellation schemes have been presented for speech applications with short acoustic feedback paths as encountered in hearing aids, these schemes fail with the long impulse responses inherent to, for instance, public address systems. We derive a new prediction error method (PEM)-based scheme (referred to as PEM-AFROW) which identifies both the acoustic feedback path and the nonstationary speech source model. A cascade of a short- and a long-term predictor removes the coloring and periodicity in voiced speech segments, which account for the unwanted correlation between the loudspeaker signal and the speech source signal. The predictors calculate row operations which are applied to prewhiten the speech source signal, resulting in a least squares system that is solved recursively by means of normalized least mean square or recursive least squares algorithms. Simulations show that this approach is indeed superior to earlier approaches whenever long acoustic channels are dealt with.  相似文献   

14.
There has been progress in improving speech recognition using a tightly-coupled modality such as lip movement; and using additional input interfaces to improve recognition of commands in multimodal human? computer interfaces such as speech and pen-based systems. However, there has been little work that attempts to improve the recognition of spontaneous, conversational speech by adding information from a loosely?coupled modality. The study investigated this idea by integrating information from gaze into an automatic speech recognition (ASR) system. A probabilistic framework for multimodal recognition was formalised and applied to the specific case of integrating gaze and speech. Gaze-contingent ASR systems were developed from a baseline ASR system by redistributing language model probability mass according to the visual attention. These systems were tested on a corpus of matched eye movement and related spontaneous conversational British English speech segments (n = 1355) for a visual-based, goal-driven task. The best performing systems had similar word error rates to the baseline ASR system and showed an increase in keyword spotting accuracy. The core values of this work may be useful for developing robust speech-centric multimodal decoding system functions.  相似文献   

15.

Majority of the automatic speech recognition systems (ASR) are trained with neutral speech and the performance of these systems are affected due to the presence of emotional content in the speech. The recognition of these emotions in human speech is considered to be the crucial aspect of human-machine interaction. The combined spectral and differenced prosody features are considered for the task of the emotion recognition in the first stage. The task of emotion recognition does not serve the sole purpose of improvement in the performance of an ASR system. Based on the recognized emotions from the input speech, the corresponding adapted emotive ASR model is selected for the evaluation in the second stage. This adapted emotive ASR model is built using the existing neutral and synthetically generated emotive speech using prosody modification method. In this work, the importance of emotion recognition block at the front-end along with the emotive speech adaptation to the ASR system models were studied. The speech samples from IIIT-H Telugu speech corpus were considered for building the large vocabulary ASR systems. The emotional speech samples from IITKGP-SESC Telugu corpus were used for the evaluation. The adapted emotive speech models have yielded better performance over the existing neutral speech models.

  相似文献   

16.
In the past decade, the performance of spoken language understanding systems has improved dramatically, including speech recognition, dialog systems, speech summarization, and text and speech translation. This has resulted in an increasingly widespread use of speech and language technologies in a wide variety of applications. With more than 6,900 languages in the world and the current trend of globalization, one of the most important challenges in spoken language technologies today is the need to support multiple input and output languages, especially if applications are intended for international markets, linguistically diverse user communities, and nonnative speakers. In many cases these applications have to support even multiple languages simultaneously to meet the needs of a multicultural society. Consequently, new algorithms and tools are required that support the simultaneous recognition of mixed-language input, the summarization of multilingual text and spoken documents, the generation of output in the appropriate language, or the accurate translation from one language to another. This article surveys significant ongoing research programs as well as trends, prognoses, and open research issues with a special emphasis on multilingual speech processing as described in detail in the work of Schultz and Hirschberg (2006) and multilingual language processing as presented in the work of Fung (2006).  相似文献   

17.
The authors develop a parallel structure for the time-delay neural network used in some speech recognition applications. The effectiveness of the design is illustrated by: (1) extracting a window computing model from the time-delay neural systems; (2) building its pipelined architecture with parallel or serial processing stages; and (3) applying this parallel window computing to some typical speech recognition systems. An analysis of the complexity of the proposed design shows a greatly reduced complexity while maintaining a high throughput rate  相似文献   

18.
李聪  葛洪伟 《信号处理》2018,34(7):867-875
由于环境噪声的影响,实际应用中说话人识别系统性能会出现急剧下降。提出了一种基于高斯混合模型-通用背景模型和自适应并行模型组合的鲁棒性语音身份识别方法。自适应并行模型组合是一种噪声鲁棒性的特征补偿算法,能够有效减少训练环境与测试环境之间的不匹配现象,从而提高系统识别准确率和抗噪性能。首先,算法从测试语音中估计出噪声特征,然后用一个单高斯模型对噪声特征进行拟合得到噪声均值和协方差。最后,根据得出的噪声均值和协方差,调整训练好的高斯混合模型均值向量和协方差矩阵,使其尽可能地匹配测试环境。实验结果表明,该方法可以准确地重构干净语音的高斯混合模型参数,并且能够显著提高说话人识别的准确率,特别是在低信噪比情况下。   相似文献   

19.
With the recent spread of speech technologies and the increasing availability of application program interfaces for speech synthesis and recognition, system designers are starting to consider whether to add speech functionality to their applications. The questions that ensue are by no means trivial. SMALTO, the tool described below, provides advice on the use of speech input and/or output modalities in combination with other modalities in the design of multimodal systems. SMALTO (S peech M odality A uxi L iary TO ol), implements a theory of modalities and incorporates structured data extracted from a corpus of claims about speech functionality found in recent literature on multimodality. The current version of the system aims mainly at supporting decisions at early design stages, as a hypertext system. However, further uses of SMALTO as part of a complete domain-oriented design environment are also envisaged.  相似文献   

20.
并行子带HMM最大后验概率自适应非线性类估计算法   总被引:1,自引:0,他引:1  
目前,自动语音识别(ASR)系统在实验室环境下获得了较高的识别率,但是在实际环境中,由于受到背景噪声和传输信道的影响,系统的识别性能急剧恶化.本文以听觉试验为基础,提出一种新的独立子带并行最大后验概率的非线性类估计算法,用以提高识别系统的鲁棒性.本算法利用多种噪声和识别内容功率谱差异,以及噪声在不同频带上对HMM影响的不同,采用多层感知机(MLP)对噪声环境下最大后验概率进行非线性映射,以减少识别系统由于环境不匹配而导致的识别性能下降.实验表明:该算法性能明显优于最大后验线性回归算法和Sangita提出的子带语音识别算法.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号