首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 343 毫秒
1.
基于发音特征的声效相关鲁棒语音识别算法   总被引:1,自引:0,他引:1  
晁浩  宋成  彭维平 《计算机应用》2015,35(1):257-261
针对声效(VE)相关的语音识别鲁棒性问题,提出了基于多模型框架的语音识别算法.首先,分析了不同声效模式下语音信号的声学特性以及声效变化对语音识别精度的影响;然后,提出了基于高斯混合模型(GMM)的声效模式检测方法;最后,根据声效检测的结果,训练专门的声学模型用于耳语音识别,而将发音特征与传统的谱特征一起用于其余4种声效模式的语音识别.基于孤立词识别的实验结果显示,采用所提方法后语音识别准确率有了明显的提高:与基线系统相比,所提方法5种声效的平均字错误率降低了26.69%;与声学模型混合语料训练方法相比,平均字错误率降低了14.51%;与最大似然线性回归(MLLR)自适应方法相比,平均字错误率降低了15.30%.实验结果表明:与传统谱特征相比发音特征对于声效变化更具鲁棒性,而多模型框架是解决声效相关的语音识别鲁棒性问题的有效方法.  相似文献   

2.
针对声效相关的语音识别鲁棒性问题,在分析了声效变化情况下声强、时长、帧能量分布以及频谱倾斜能方面特性的基础上,建立了基于GMM的声效检测器。同时,还研究了声效变化对语音识别精度的影响,并提出了基于多模型框架的语音识别算法。汉语孤立词语音识别实验显示,除正常模式的语音识别精度略有下降外,其它四种声效模式的识别精度均有大幅度的提高。实验结果表明语音信号的声强、时长、帧能量分布以及频谱倾斜等信息能够用于识别声效模式,而多模型框架是解决声效相关的语音识别鲁棒性问题的有效方法。  相似文献   

3.
为了解决语言障碍者与健康人之间的交流障碍问题,提出了一种基于神经网络的手语到情感语音转换方法。首先,建立了手势语料库、人脸表情语料库和情感语音语料库;然后利用深度卷积神经网络实现手势识别和人脸表情识别,并以普通话声韵母为合成单元,训练基于说话人自适应的深度神经网络情感语音声学模型和基于说话人自适应的混合长短时记忆网络情感语音声学模型;最后将手势语义的上下文相关标注和人脸表情对应的情感标签输入情感语音合成模型,合成出对应的情感语音。实验结果表明,该方法手势识别率和人脸表情识别率分别达到了95.86%和92.42%,合成的情感语音EMOS得分为4.15,合成的情感语音具有较高的情感表达程度,可用于语言障碍者与健康人之间正常交流。  相似文献   

4.
该文研究了基于数据模拟方法和HMM(隐马尔科夫模型)自适应的电话信道条件下语音识别问题。模拟数据模仿了纯净语音在不同电话信道条件下的语音行为。各基线系统的HMM模型分别由纯净语音和模拟语音训练而成。语音识别实验评估了各基线系统HMM模型在采用MLLR算法(最大似然线性回归)做无监督式自适应前后的识别性能。实验证明,由纯净语音转换生成的模拟语音有效地减小了训练语音和测试语音声学性质的不匹配,很大程度上提高了电话语音识别率。基线模型的自适应结果显示模拟数据的自适应性能比纯净语音自适应的性能最大提高达到9.8%,表明了电话语音识别性能的进一步改善和系统稳健性的提高。  相似文献   

5.
提出了一种随机段模型系统的说话人自适应方法。根据随机段模型的模型特性,将最大似然线性回归方法引入到随机段模型系统中。在“863 test”测试集上进行的汉语连续语音识别实验显示,在不同的解码速度下,说话人自适应后汉字错误率均有明显的下降。实验结果表明,最大似然线性回归方法在随机段模型系统中同样能取得较好的效果。  相似文献   

6.
针对英语翻译机器人智能纠错需求,基于语言特征以及迁移学习,构建用于英语翻译机器人纠错系统的方法。其中,利用DNN-HMM声学模型搭建机器人语音识别模型,并以汉语语音识别为基础任务,通过迁移学习构建对应的英语语音识别系统。实验结果证明,使用训练共享隐层所有层的方法与仅使用英语数据进行基线系统训练的方法相比1 h训练集错误率下降了24.38%,20 h训练集错误率下降了4.73%,显著提高了系统的识别精度,对英语翻译机器人纠错性能有一定的提高。  相似文献   

7.
语音是人机交互方式之一,语音识别技术是人工智能的重要组成部分.近年来神经网络技术在语音识别领域的应用快速发展,已经成为语音识别领域中主流的声学建模技术.然而测试条件中目标说话人语音与训练数据存在差异,导致模型不适配的问题.因此说话人自适应(SA)方法是为了解决说话人差异导致的不匹配问题,研究说话人自适应方法成为语音识别领域的一个热门方向.相比传统语音识别模型中的说话人自适应方法,使用神经网络的语音识别系统中的自适应存在着模型参数庞大,而自适应数据量相对较少等特点,这使得基于神经网络的语音识别系统中的说话人自适应方法成为一个研究难题.首先回顾说话人自适应方法的发展历程和基于神经网络的说话人自适应方法研究遇到的各种问题,其次将说话人自适应方法分为基于特征域和基于模型域的说话人自适应方法并介绍对应原理和改进方法,最后指出说话人自适应方法在语音识别中仍然存在的问题及未来的发展方向.  相似文献   

8.
基于深度学习的语音识别技术现状与展望   总被引:1,自引:0,他引:1  
首先对深度学习的发展历史以及概念进行简要的介绍。然后回顾最近几年基于深度学习的语音识别的研究进展。这一部分内容主要分成以下5点进行介绍:声学模型训练准则,基于深度学习的声学模型结构,基于深度学习的声学模型训练效率优化,基于深度学习的声学模型说话人自适应和基于深度学习的端到端语音识别。最后就基于深度学习的语音识别未来可能的研究方向进行展望。  相似文献   

9.
在基于语音识别的智能家居中,用于训练的语料库不完备且应用场景复杂,自然语言语音识别错误接受率远远高于小词汇的语音识别的错误接受率.作者在设计与实现基于自然语言的语音识别智能家居系统的过程中,深入研究了MAP、MLLR算法在基于HMM声学模型参数中的作用,提出了一种综合的自适应方法,并基于开源的语音识别工具CMU SPHIN最终完整的实现了该系统,结果表明所提出的自适应新算法可行有效,较好改善了系统在不同场景中的性能.  相似文献   

10.
杨淑莹  田迪  郭杨杨  赵敏 《计算机仿真》2022,39(2):278-282,418
为便利听障人士的正常社会生活,提高其社会融入度,设计开发了基于B/S模式的仿真手语翻译系统.此系统包含语音识别模块、文本分词模块和虚拟人控制模块.采集到的语音经过Mel尺度的小波包分解提取语音声学特征,并进行快速语音识别得到对应文本,使用jieba完成对应的文本分词;同时创建仿真虚拟人模型并为其添加关键帧手语动作,使用...  相似文献   

11.
为了改善发声力度对说话人识别系统性能的影响,在训练语音存在少量耳语、高喊语音数据的前提下,提出了使用最大后验概率(MAP)和约束最大似然线性回归(CMLLR)相结合的方法来更新说话人模型、投影转换说话人特征。其中,MAP自适应方法用于对正常语音训练的说话人模型进行更新,而CMLLR特征空间投影方法则用来投影转换耳语、高喊测试语音的特征,从而改善训练语音与测试语音的失配问题。实验结果显示,采用MAP+CMLLR方法时,说话人识别系统等错误率(EER)明显降低,与基线系统、最大后验概率(MAP)自适应方法、最大似然线性回归(MLLR)模型投影方法和约束最大似然线性回归(CMLLR)特征空间投影方法相比,MAP+CMLLR方法的平均等错率分别降低了75.3%、3.5%、72%和70.9%。实验结果表明,所提出方法削弱了发声力度对说话人区分性的影响,使说话人识别系统对于发声力度变化更加鲁棒。  相似文献   

12.
Automatic recognition of children’s speech using acoustic models trained by adults results in poor performance due to differences in speech acoustics. These acoustical differences are a consequence of children having shorter vocal tracts and smaller vocal cords than adults. Hence, speaker adaptation needs to be performed. However, in real-world applications, the amount of adaptation data available may be less than what is needed by common speaker adaptation techniques to yield reasonable performance. In this paper, we first study, in the discrete frequency domain, the relationship between frequency warping in the front-end and corresponding transformations in the back-end. Three common feature extraction schemes are investigated and their transformation linearity in the back-end are discussed. In particular, we show that under certain approximations, frequency warping of MFCC features with Mel-warped triangular filter banks equals a linear transformation in the cepstral space. Based on that linear transformation, a formant-like peak alignment algorithm is proposed to adapt adult acoustic models to children’s speech. The peaks are estimated by Gaussian mixtures using the Expectation-Maximization (EM) algorithm [Zolfaghari, P., Robinson, T., 1996. Formant analysis using mixtures of Gaussians, Proceedings of International Conference on Spoken Language Processing, 1229–1232]. For limited adaptation data, the algorithm outperforms traditional vocal tract length normalization (VTLN) and maximum likelihood linear regression (MLLR) techniques.  相似文献   

13.
In this paper, speaker adaptive acoustic modeling is investigated by using a novel method for speaker normalization and a well known vocal tract length normalization method. With the novel normalization method, acoustic observations of training and testing speakers are mapped into a normalized acoustic space through speaker-specific transformations with the aim of reducing inter-speaker acoustic variability. For each speaker, an affine transformation is estimated with the goal of reducing the mismatch between the acoustic data of the speaker and a set of target hidden Markov models. This transformation is estimated through constrained maximum likelihood linear regression and then applied to map the acoustic observations of the speaker into the normalized acoustic space.Recognition experiments made use of two corpora, the first one consisting of adults’ speech, the second one consisting of children’s speech. Performing training and recognition with normalized data resulted in a consistent reduction of the word error rate with respect to the baseline systems trained on unnormalized data. In addition, the novel method always performed better than the reference vocal tract length normalization method adopted in this work.When unsupervised static speaker adaptation was applied in combination with each of the two speaker normalization methods, a different behavior was observed on the two corpora: in one case performance became very similar while in the other case the difference remained significant.  相似文献   

14.
为了减小由于说话人之间声道形状的差异而引起的非特定人语音识别系统性能的 下降,研究了两种方法,一种是基于最大似然估计的频率归正说话人自适应方法,另一种是基 于Mellin变换的语音新特征.在非特定人孤立词语音识别系统上的初步实验表明,这两种方 法都可以提高系统对不同说话人的鲁棒性,相比之下,基于Mellin变换的语音新特征具有更 好的性能,它不仅提高了系统对不同话者的识别性能,而且也使系统对不同话者的误识率的 离散程度大大减小.  相似文献   

15.
Distant speech capture in lecture halls and auditoriums offers unique challenges in algorithm development for automatic speech recognition. In this study, a new adaptation strategy for distant noisy speech is created by the means of phoneme classes. Unlike previous approaches which adapt the acoustic model to the features, the proposed phoneme-class based feature adaptation (PCBFA) strategy adapts the distant data features to the present acoustic model which was previously trained on close microphone speech. The essence of PCBFA is to create a transformation strategy which makes the distributions of phoneme-classes of distant noisy speech similar to those of a close talk microphone acoustic model in a multidimensional MFCC space. To achieve this task, phoneme-classes of distant noisy speech are recognized via artificial neural networks. PCBFA is the adaptation of features rather than adaptation of acoustic models. The main idea behind PCBFA is illustrated via conventional Gaussian mixture model–Hidden Markov model (GMM–HMM) although it can be extended to new structures in automatic speech recognition (ASR). The new adapted features together with the new and improved acoustic models produced by PCBFA are shown to outperform those created only by acoustic model adaptations for ASR and keyword spotting. PCBFA offers a new powerful understanding in acoustic-modeling of distant speech.  相似文献   

16.
As core speech recognition technology improves, opening up a wider range of applications, genericity and portability are becoming important issues. Most of todays recognition systems are still tuned to a particular task and porting the system to a new task (or language) requires a substantial investment of time and money, as well as human expertise.This paper addresses issues in speech recognizer portability and in the development of generic core speech recognition technology. First, the genericity of wide domain models is assessed by evaluating their performance on several tasks of varied complexity. Then, techniques aimed at enhancing the genericity of these wide domain models are investigated. Multi-source acoustic training is shown to reduce the performance gap between task-independent and task-dependent acoustic models, and for some tasks to out-perform task-dependent acoustic models.Transparent methods for porting generic models to a specific task are also explored. Transparent unsupervised acoustic model adaptation is contrasted with supervised adaptation, and incremental unsupervised adaptation of both the acoustic and linguistic models is investigated. Experimental results on a dialog task show that with the proposed scheme, a transparently adapted generic system can perform nearly as well (about a 1% absolute gap in word error rate) as a task-specific system trained on several tens of hours of manually transcribed data.  相似文献   

17.
The recent proposed time-delay deep neural network (TDNN) acoustic models trained with lattice-free maximum mutual information (LF-MMI) criterion have been shown to give significant performance improvements over other deep neural network (DNN) models in variety speech recognition tasks. Meanwhile, the Kullback–Leibler divergence (KLD) regularization has been validated as an effective adaptation method for DNN acoustic models. However, to our best knowledge, no work has been reported on investigating whether the KLD-based method is also effective for LF-MMI based TDNN models, especially for the domain adaptation. In this study, we generalized the KLD regularized model adaptation to train domain-specific TDNN acoustic models. A few distinct and important observations have been obtained. Experiments were performed on the Cantonese accent, in-car and far-field noise Mandarin speech recognition tasks. Results demonstrated that the proposed domain adapted models can achieve around relative 7–29% word error rate reduction on these tasks, even when the adaptation utterances are only around 1 K.  相似文献   

18.
This paper addresses accent1 issues in large vocabulary continuous speech recognition. Cross-accent experiments show that the accent problem is very dominant in speech recognition. Analysis based on multivariate statistical tools (principal component analysis and independent component analysis) confirms that accent is one of the key factors in speaker variability. Considering different applications, we proposed two methods for accent adaptation. When a certain amount of adaptation data was available, pronunciation dictionary modeling was adopted to reduce recognition errors caused by pronunciation mistakes. When a large corpus was collected for each accent type, accent-dependent models were trained and a Gaussian mixture model-based accent identification system was developed for model selection. We report experimental results for the two schemes and verify their efficiency in each situation.  相似文献   

19.
Research on noise robust speech recognition has mainly focused on dealing with relatively stationary noise that may differ from the noise conditions in most living environments. In this paper, we introduce a recognition system that can recognize speech in the presence of multiple rapidly time-varying noise sources as found in a typical family living room. To deal with such severe noise conditions, our recognition system exploits all available information about speech and noise; that is spatial (directional), spectral and temporal information. This is realized with a model-based speech enhancement pre-processor, which consists of two complementary elements, a multi-channel speech–noise separation method that exploits spatial and spectral information, followed by a single channel enhancement algorithm that uses the long-term temporal characteristics of speech obtained from clean speech examples. Moreover, to compensate for any mismatch that may remain between the enhanced speech and the acoustic model, our system employs an adaptation technique that combines conventional maximum likelihood linear regression with the dynamic adaptive compensation of the variance of the Gaussians of the acoustic model. Our proposed system approaches human performance levels by greatly improving the audible quality of speech and substantially improving the keyword recognition accuracy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号