共查询到20条相似文献,搜索用时 15 毫秒
1.
Speech production errors characteristic of dysarthria are chiefly responsible for the low accuracy of automatic speech recognition (ASR) when used by people diagnosed with it. A person with dysarthria produces speech in a rather reduced acoustic working space, causing typical measures of speech acoustics to have values in ranges very different from those characterizing unimpaired speech. It is unlikely then that models trained on unimpaired speech will be able to adjust to this mismatch when acted on by one of the currently well-studied adaptation algorithms (which make no attempt to address this extent of mismatch in population characteristics).In this work, we propose an interpolation-based technique for obtaining a prior acoustic model from one trained on unimpaired speech, before adapting it to the dysarthric talker. The method computes a ‘background’ model of the dysarthric talker's general speech characteristics and uses it to obtain a more suitable prior model for adaptation (compared to the speaker-independent model trained on unimpaired speech). The approach is tested with a corpus of dysarthric speech acquired by our research group, on speech of sixteen talkers with varying levels of dysarthria severity (as quantified by their intelligibility). This interpolation technique is tested in conjunction with the well-known maximum a posteriori (MAP) adaptation algorithm, and yields improvements of up to 8% absolute and up to 40% relative, over the standard MAP adapted baseline. 相似文献
2.
This paper aims to adapt the Clonal Selection Algorithm (CSA) which is usually used to explain the basic features of artificial immune systems to the learning of Neural Networks, instead of Back Propagation. The CSA was first applied to a real world problem (IRIS database) then compared with an artificial immune network. CSA performance was contrasted with other versions of genetic algorithms such as: Differential Evolution (DE), Multiple Populations Genetic Algorithms (MPGA). The tested application in the simulation studies were IRIS (vegetal database) and TIMIT (phonetic database). The results obtained show that DE convergence speeds were faster than the ones of multiple population genetic algorithm and genetic algorithms, therefore DE algorithm seems to be a promising approach to engineering optimization problems. On the other hand, CSA demonstrated good performance at the level of pattern recognition, since the recognition rate was equal to 99.11% for IRIS database and 76.11% for TIMIT. Finally, the MPGA succeeded in generalizing all phonetic classes in a homogeneous way: 60% for the vowels and 63% for the fricatives, 68% for the plosives. 相似文献
3.
Yong Lü Author VitaeAuthor Vitae Lin Zhou Author Vitae Author Vitae 《Pattern recognition》2010,43(9):3093-3099
In this paper, we propose a multi-environment model adaptation method based on vector Taylor series (VTS) for robust speech recognition. In the training phase, the clean speech is contaminated with noise at different signal-to-noise ratio (SNR) levels to produce several types of noisy training speech and each type is used to obtain a noisy hidden Markov model (HMM) set. In the recognition phase, the HMM set which best matches the testing environment is selected, and further adjusted to reduce the environmental mismatch by the VTS-based model adaptation method. In the proposed method, the VTS approximation based on noisy training speech is given and the testing noise parameters are estimated from the noisy testing speech using the expectation-maximization (EM) algorithm. The experimental results indicate that the proposed multi-environment model adaptation method can significantly improve the performance of speech recognizers and outperforms the traditional model adaptation method and the linear regression-based multi-environment method. 相似文献
4.
针对声音效果变化引起的语音声学特性的改变,提出基于声学模型自适应的方法。分析了正常模式下训练的声学模型在识别其他声效模式下语音的表现;根据随机段模型的模型特性,将最大似然线性回归方法引入到随机段模型系统中,并利用自适应后的声学模型来识别对应的声效模式下的语音。在“863-test”测试集上进行的汉语连续语音识别实验显示,正常模式下训练的声学模型识别其他四种声效模式下的语音时,识别精度均有较大程度的下降;而自适应后的系统在识别对应的声效模式的语音时,识别精度有了明显的改观。表明了基于声学模型自适应的方法在解决语音识别中声音效果变化问题上的有效性。 相似文献
5.
针对不同语料库之间数据分布差异问题,提出一种基于深度自编码器子域自适应的跨库语音情感识别算法.首先,该算法采用两个深度自编码器分别获取源域和目标域表征性强的低维情感特征;然后,利用基于LMMD(local maximum mean discrepancy)的子域自适应模块,实现源域和目标域在不同低维情感类别空间中的特征分布对齐;最后,使用带标签的源域数据进行有监督地训练该模型.在eNTERFACE库为源域、Berlin库为目标域的跨库识别方案中,所提算法的跨库识别准确率相比于其他算法提升了5.26%~19.73%;在Berlin库为源域、eNTERFACE库为目标域的跨库识别方案中,所提算法的跨库识别准确率相比于其他算法提升了7.34%~8.18%.因此,所提方法可以有效地提取不同语料库的共有情感特征并提升了跨库语音情感识别的性能. 相似文献
6.
7.
域自适应算法被广泛应用于跨库语音情感识别中;然而,许多域自适应算法在追求减小域差异的同时,丧失了目标域样本的鉴别性,导致其以高密度的形式存在于模型决策边界处,降低了模型的性能。基于此,提出一种基于决策边界优化域自适应(DBODA)的跨库语音情感识别方法。首先利用卷积神经网络进行特征处理,随后将特征送入最大化核范数及均值差异(MNMD)模块,在减小域间差异的同时,最大化目标域情感预测概率矩阵的核范数,从而提升目标域样本的鉴别性并优化决策边界。在以Berlin、eNTERFACE和CASIA语音库为基准库设立的六组跨库实验中,所提方法的平均识别精度领先于其他算法1.68~11.01个百分点,说明所提模型有效降低了决策边界的样本密度,提升了预测的准确性。 相似文献
8.
Hong-Kwang Jeff Kuo Yuqing Gao 《IEEE transactions on audio, speech, and language processing》2006,14(3):873-881
Traditional statistical models for speech recognition have mostly been based on a Bayesian framework using generative models such as hidden Markov models (HMMs). This paper focuses on a new framework for speech recognition using maximum entropy direct modeling, where the probability of a state or word sequence given an observation sequence is computed directly from the model. In contrast to HMMs, features can be asynchronous and overlapping. This model therefore allows for the potential combination of many different types of features, which need not be statistically independent of each other. In this paper, a specific kind of direct model, the maximum entropy Markov model (MEMM), is studied. Even with conventional acoustic features, the approach already shows promising results for phone level decoding. The MEMM significantly outperforms traditional HMMs in word error rate when used as stand-alone acoustic models. Preliminary results combining the MEMM scores with HMM and language model scores show modest improvements over the best HMM speech recognizer. 相似文献
9.
Huang Zhengwei Xue Wentao Mao Qirong Zhan Yongzhao 《Multimedia Tools and Applications》2017,76(5):6785-6799
Multimedia Tools and Applications - Research in emotion recognition seeks to develop insights into the variances of features of emotion in one common domain. However, automatic emotion recognition... 相似文献
10.
《Computer Speech and Language》2014,28(6):1287-1297
This paper proposes an efficient speech data selection technique that can identify those data that will be well recognized. Conventional confidence measure techniques can also identify well-recognized speech data. However, those techniques require a lot of computation time for speech recognition processing to estimate confidence scores. Speech data with low confidence should not go through the time-consuming recognition process since they will yield erroneous spoken documents that will eventually be rejected. The proposed technique can select the speech data that will be acceptable for speech recognition applications. It rapidly selects speech data with high prior confidence based on acoustic likelihood values and using only speech and monophone models. Experiments show that the proposed confidence estimation technique is over 50 times faster than the conventional posterior confidence measure while providing equivalent data selection performance for speech recognition and spoken document retrieval. 相似文献
11.
Seçkin Uluskan Abhijeet Sangwan John H. L. Hansen 《International Journal of Speech Technology》2017,20(4):799-811
Distant speech capture in lecture halls and auditoriums offers unique challenges in algorithm development for automatic speech recognition. In this study, a new adaptation strategy for distant noisy speech is created by the means of phoneme classes. Unlike previous approaches which adapt the acoustic model to the features, the proposed phoneme-class based feature adaptation (PCBFA) strategy adapts the distant data features to the present acoustic model which was previously trained on close microphone speech. The essence of PCBFA is to create a transformation strategy which makes the distributions of phoneme-classes of distant noisy speech similar to those of a close talk microphone acoustic model in a multidimensional MFCC space. To achieve this task, phoneme-classes of distant noisy speech are recognized via artificial neural networks. PCBFA is the adaptation of features rather than adaptation of acoustic models. The main idea behind PCBFA is illustrated via conventional Gaussian mixture model–Hidden Markov model (GMM–HMM) although it can be extended to new structures in automatic speech recognition (ASR). The new adapted features together with the new and improved acoustic models produced by PCBFA are shown to outperform those created only by acoustic model adaptations for ASR and keyword spotting. PCBFA offers a new powerful understanding in acoustic-modeling of distant speech. 相似文献
12.
Large margin hidden Markov models for speech recognition 总被引:1,自引:0,他引:1
Hui Jiang Xinwei Li Chaojun Liu 《IEEE transactions on audio, speech, and language processing》2006,14(5):1584-1595
In this paper, motivated by large margin classifiers in machine learning, we propose a novel method to estimate continuous-density hidden Markov model (CDHMM) for speech recognition according to the principle of maximizing the minimum multiclass separation margin. The approach is named large margin HMM. First, we show this type of large margin HMM estimation problem can be formulated as a constrained minimax optimization problem. Second, we propose to solve this constrained minimax optimization problem by using a penalized gradient descent algorithm, where the original objective function, i.e., minimum margin, is approximated by a differentiable function and the constraints are cast as penalty terms in the objective function. The new training method is evaluated in the speaker-independent isolated E-set recognition and the TIDIGITS connected digit string recognition tasks. Experimental results clearly show that the large margin HMMs consistently outperform the conventional HMM training methods. It has been consistently observed that the large margin training method yields significant recognition error rate reduction even on top of some popular discriminative training methods. 相似文献
13.
14.
《Computer Speech and Language》2007,21(2):247-265
In this paper, an improved method of model complexity selection for nonnative speech recognition is proposed by using maximum a posteriori (MAP) estimation of bias distributions. An algorithm is described for estimating hyper-parameters of the priors of the bias distributions, and an automatic accent classification algorithm is also proposed for integration with dynamic model selection and adaptation. Experiments were performed on the WSJ1 task with American English speech, British accented speech, and mandarin Chinese accented speech. Results show that the use of prior knowledge of accents enabled more reliable estimation of bias distributions with very small amounts of adaptation speech, or without adaptation speech. Recognition results show that the new approach is superior to the previous maximum expected likelihood (MEL) method, especially when adaptation data are very limited. 相似文献
15.
提出基于发音特征的声调建模改进方法,并将其用于随机段模型的一遍解码中。根据普通话的发音特点,确定了用于区别汉语元音、辅音信息的7种发音特征,并以此为目标值利用阶层式多层感知器计算语音信号属于发音特征的35个类别后验概率,将该概率作为发音特征与传统的韵律特征一起用于声调建模。根据随机段模型的解码特点,在两层剪枝后对保留下来的路径计算其声调模型概率得分,加权后加入路径总的概率得分中。在“863-test”测试集上进行的实验结果显示,使用了新的发音特征集合中声调模型的识别精度提高了3.11%;融入声调信息后随机段模型的字错误率从13.67%下降到12.74%。表明了将声调信息应用到随机段模型的可行性。 相似文献
16.
《International journal of man-machine studies》1985,22(5):523-547
In this paper, “continuous speech recognition” problem is given a clear mathematical formulation as the search for that sequence of basic speech units that best fits the input acoustic pattern. For this purpose spoken language models in the form of hierarchical transition networks are introduced, where lower level subnetworks describe the basic units as possible sequences of spectral states. The units adopted in this paper are either whole words or smaller subword elements, called diphones. The recognition problem thus becomes that of finding the best path through the network, a task carried out by the linguistic decoder. By using this approach, knowledge sources at different levels are strongly integrated. In this way, early decision making based on partial information (in particular any segmentation operation or the speech/silence distinction) is avoided; usually this is a significant source or errors. Instead, decisions are deferred to the linguistic decoder, which possesses all the necessary pieces of information. The properties that a linguistic decoder must possess in order to operate in real-time are listed, and then a best-few algorithm with partial traceback of explored paths, satisfying the above requisites, is described. In particular, the amount of storage needed is almost constant for any sentence length, the computation is approximately linear with sentence length, and the interpretation of early words in a sentence may be possible long before the speaker has finished talking. Experimental results with two systems, one with words and the other with diphones as basic speech units, are reported. Finally, relative merits of words and diphones are discussed, taking into account aspects such as the storage and computing time requirements, their relative ability to deal with phonological variations and to discriminate between similar words, their speaker adaptation capability, and the ease with which it is possible to change the vocabulary and the language dependencies. 相似文献
17.
Tao Ma Sundararajan Srinivasan Georgios Lazarou Joseph Picone 《International Journal of Speech Technology》2014,17(1):11-16
Hidden Markov models (HMMs) with Gaussian mixture distributions rely on an assumption that speech features are temporally uncorrelated, and often assume a diagonal covariance matrix where correlations between feature vectors for adjacent frames are ignored. A Linear Dynamic Model (LDM) is a Markovian state-space model that also relies on hidden state modeling, but explicitly models the evolution of these hidden states using an autoregressive process. An LDM is capable of modeling higher order statistics and can exploit correlations of features in an efficient and parsimonious manner. In this paper, we present a hybrid LDM/HMM decoder architecture that postprocesses segmentations derived from the first pass of an HMM-based recognition. This smoothed trajectory model is complementary to existing HMM systems. An Expectation-Maximization (EM) approach for parameter estimation is presented. We demonstrate a 13 % relative WER reduction on the Aurora-4 clean evaluation set, and a 13 % relative WER reduction on the babble noise condition. 相似文献
18.
将标准普通话语音数据训练得到的声学模型应用于新疆维吾尔族说话人非母语汉语语音识别时,由于说话人的普通话发音存在较大偏误,将导致识别率急剧下降。针对这一问题,将多发音字典技术应用于新疆维吾尔族说话人汉语语音识别中,通过统计分析识别器的识别错误,建立音素混淆矩阵,获取音素的发音候选项。利用剪枝策略对发音候选项进行剪枝整合,扩展出符合维吾尔族说话人汉语发音规律的替代字典。对三种剪枝方法产生的发音字典的识别结果进行了对比。实验结果表明,使用相对最大剪枝策略产生的发音字典可以显著提高系统识别率。 相似文献
19.
《Computer Speech and Language》2001,15(2):127-148
The aim of this work is to show the ability of stochastic regular grammars to generate accurate language models which can be well integrated, allocated and handled in a continuous speech recognition system. For this purpose, a syntactic version of the well-known n -gram model, called k -testable language in the strict sense (k -TSS), is used. The complete definition of a k -TSS stochastic finite state automaton is provided in the paper. One of the difficulties arising in representing a language model through a stochastic finite state network is that the recursive schema involved in the smoothing procedure must be adopted in the finite state formalism to achieve an efficient implementation of the backing-off mechanism. The use of the syntactic back-off smoothing technique applied to k -TSS language modelling allowed us to obtain a self-contained smoothed model integrating several k -TSS automata in a unique smoothed and integrated model, which is also fully defined in the paper. The proposed formulation leads to a very compact representation of the model parameters learned at training time: probability distribution and model structure. The dynamic expansion of the structure at decoding time allows an efficient integration in a continuous speech recognition system using a one-step decoding procedure. An experimental evaluation of the proposed formulation was carried out on two Spanish corpora. These experiments showed that regular grammars generate accurate language models (k -TSS) that can be efficiently represented and managed in real speech recognition systems, even for high values of k, leading to very good system performance. 相似文献
20.
针对Conformer编码器的声学输入网络对FBank语音信息提取不足和通道特征信息缺失问题,提出一种RepVGG-SE-Conformer的端到端语音识别方法。首先,利用RepVGG的多分支结构增强模型的语音信息提取能力,而在模型推理时通过结构重参数化将多分支融合为单分支,以降低计算复杂度、加快模型推理速度。然后,利用基于压缩和激励网络的通道注意力机制弥补缺失的通道特征信息,以提高语音识别准确率。最后,在公开数据集Aishell-1上的实验结果表明:相较于Conformer,所提出方法的字错误率降低了10.67%,验证了方法的先进性。此外,RepVGG-SE声学输入网络能够有效提高多种Transformer变体的端到端语音识别模型整体性能,具有很好的泛化能力。 相似文献