首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
将标准普通话语音数据训练得到的声学模型应用于新疆维吾尔族说话人非母语汉语语音识别时,由于说话人的普通话发音存在较大偏误,将导致识别率急剧下降。针对这一问题,将多发音字典技术应用于新疆维吾尔族说话人汉语语音识别中,通过统计分析识别器的识别错误,建立音素混淆矩阵,获取音素的发音候选项。利用剪枝策略对发音候选项进行剪枝整合,扩展出符合维吾尔族说话人汉语发音规律的替代字典。对三种剪枝方法产生的发音字典的识别结果进行了对比。实验结果表明,使用相对最大剪枝策略产生的发音字典可以显著提高系统识别率。  相似文献   

针对汉语语音识别中协同发音现象引起的语音信号的易变性,提出一种基于音节的声学建模方法。首先建立基于音节的声学模型以解决音节内部声韵母之间的音变现象,并提出以音节内双音子模型来初始化基于音节声学模型的参数以缓解训练数据稀疏的问题;然后引入音节之间的过渡模型来处理音节之间的协同发音问题。在“863-test”测试集上进行的汉语连续语音识别实验显示汉语字的相对错误率下降了12.13%,表明了基于音节的声学模型和音节间过渡模型相结合在解决汉语协同发音问题上的有效性。  相似文献   

综合了语音识别中常用的高斯混合模型和人工神经网络框架优点的Tandem特征提取方法应用于维吾尔语声学模型训练中,经过一系列后续处理,将原始的MFCC特征转化为Tandem特征,以此作为基于隐马尔可夫统计模型的语音识别系统的输入,并使用最小音素错误区分性训练准则训练声学模型,进而完成在测试集上的识别实验。实验结果显示,Tandem区分性训练方法使识别系统的单词错误率比原先的基于最大似然估计准则的系统相对减少13%。  相似文献   

汉语语音识别中基频特征的直接声学建模方法   总被引:1,自引:1,他引:0       下载免费PDF全文
提出了隐条件随机场对断续基音频率序列进行直接声学建模的方法,该方法针对汉语语音中基频值在清音段连续,浊音段断续的特点,根据隐条件随机场区别于隐马尔可夫模型的重要特性——无需对观察值采用统一的建模方式,直接对不连续基频值与连续谱特征观察值一起进行声学建模。大词汇语音库上的汉语带调音节分类实验表明,隐条件随机场下对断续基音频率序列的直接建模较使用清音段人工平滑基频特征的识别率有明显的提高,还给出了与不同区分性准则训练的隐马尔可夫声学模型的实验性能的比较。  相似文献   

The high error rate in spontaneous speech recognition is due in part to the poor modeling of pronunciation variations. An analysis of acoustic data reveals that pronunciation variations include both complete changes and partial changes. Complete changes are the replacement of a canonical phoneme by another alternative phone, such as ‘b’ being pronounced as ‘p’. Partial changes are the variations within the phoneme, such as nasalization, centralization, voiceless, voiced, etc. Most current work in pronunciation modeling attempts to represent pronunciation variations either by alternative phonetic representations or by the concatenation of subphone units at the hidden Markov state level. In this paper, we show that partial changes are a lot less clear-cut than previously assumed and cannot be modeled by mere representation by alternate phones or a concatenation of phone units. We propose modeling partial changes through acoustic model reconstruction. We first propose a partial change phone model (PCPM) to differentiate pronunciation variations. In order to improve the model resolution without increasing the parameter size too much, PCPM is used as a hidden model and merged into the pre-trained acoustic model through model reconstruction. To avoid model confusion, auxiliary decision trees are established for PCPM triphones, and one auxiliary decision tree can only be used by one standard decision tree. The acoustic model reconstruction on triphones is equivalent to decision tree merging. The effectiveness of this approach is evaluated on the 1997 Hub4NE Mandarin Broadcast News corpus (1997 MBN) with different styles of speech. It gives a significant 2.39% syllable error rate absolute reduction in spontaneous speech.  相似文献   

Audio-visual speech modeling for continuous speech recognition   总被引:3,自引:0,他引:3  
This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate  相似文献   

International Journal of Speech Technology - Speech emotion recognition is one of the fastest growing areas of interest in the field of affective computing. Emotion detection aids...  相似文献   

Distant speech capture in lecture halls and auditoriums offers unique challenges in algorithm development for automatic speech recognition. In this study, a new adaptation strategy for distant noisy speech is created by the means of phoneme classes. Unlike previous approaches which adapt the acoustic model to the features, the proposed phoneme-class based feature adaptation (PCBFA) strategy adapts the distant data features to the present acoustic model which was previously trained on close microphone speech. The essence of PCBFA is to create a transformation strategy which makes the distributions of phoneme-classes of distant noisy speech similar to those of a close talk microphone acoustic model in a multidimensional MFCC space. To achieve this task, phoneme-classes of distant noisy speech are recognized via artificial neural networks. PCBFA is the adaptation of features rather than adaptation of acoustic models. The main idea behind PCBFA is illustrated via conventional Gaussian mixture model–Hidden Markov model (GMM–HMM) although it can be extended to new structures in automatic speech recognition (ASR). The new adapted features together with the new and improved acoustic models produced by PCBFA are shown to outperform those created only by acoustic model adaptations for ASR and keyword spotting. PCBFA offers a new powerful understanding in acoustic-modeling of distant speech.  相似文献   

We describe the automatic determination of a large and complicated acoustic model for speech recognition by using variational Bayesian estimation and clustering (VBEC) for speech recognition. We propose an efficient method for decision tree clustering based on a Gaussian mixture model (GMM) and an efficient model search algorithm for finding an appropriate acoustic model topology within the VBEC framework. GMM-based decision tree clustering for triphone HMM states features a novel approach designed to reduce the overly large number of computations to a practical level by utilizing the statistics of monophone hidden Markov model states. The model search algorithm also reduces the search space by utilizing the characteristics of the acoustic model. The experimental results confirmed that VBEC automatically and rapidly yielded an optimum model topology with the highest performance.  相似文献   

Language Resources and Evaluation - This paper introduces the first large vocabulary speech recognition system (LVSR) for the Central Kurdish language, named Jira. The Kurdish language is an...  相似文献   

In this paper we are concerned with the problem of the adaptation of non-native speech in a large-vocabulary speech recognition system for Modern Standard Arabic (MSA). A technique to adapt Hidden Markov Models (HMMs) to foreign accents by using Genetic Algorithms (GAs) in unsupervised mode is presented. The implementation requirements of GAs, such as genetic operators and objective function, have been selected to give more reliability to a global linear transformation matrix. The Minimum Phone Error (MPE) criterion is used as an objective function. The West Point Language Data Consortium (LDC) modern standard Arabic database is used throughout our experiments. Results show that significant decrease of word error rate has been achieved by the evolutionary-based approach compared to conventional Maximum Likelihood Linear Regression (MLLR), Maximum a posteriori (MAP) techniques and to the adaptation combining MLLR and MPE-based training.  相似文献   

Feature Fusion plays an important role in speech emotion recognition to improve the classification accuracy by combining the most popular acoustic features for speech emotion recognition like energy, pitch and mel frequency cepstral coefficients. However the performance of the system is not optimal because of the computational complexity of the system, which occurs due to high dimensional correlated feature set after feature fusion. In this paper, a two stage feature selection method is proposed. In first stage feature selection, appropriate features are selected and fused together for speech emotion recognition. In second stage feature selection, optimal feature subset selection techniques [sequential forward selection (SFS) and sequential floating forward selection (SFFS)] are used to eliminate the curse of dimensionality problem due to high dimensional feature vector after feature fusion. Finally the emotions are classified by using several classifiers like Linear Discriminant Analysis (LDA), Regularized Discriminant Analysis (RDA), Support Vector Machine (SVM) and K Nearest Neighbor (KNN). The performance of overall emotion recognition system is validated over Berlin and Spanish databases by considering classification rate. An optimal uncorrelated feature set is obtained by using SFS and SFFS individually. Results reveal that SFFS is a better choice as a feature subset selection method because SFS suffers from nesting problem i.e it is difficult to discard a feature after it is retained into the set. SFFS eliminates this nesting problem by making the set not to be fixed at any stage but floating up and down during the selection based on the objective function. Experimental results showed that the efficiency of the classifier is improved by 15–20 % with two stage feature selection method when compared with performance of the classifier with feature fusion.  相似文献   

Dysarthria is a motor speech disorder caused by neurological injury of the motor component of the motor-speech system. Because it affects respiration, phonation, and articulation, it leads to different types of impairments in intelligibility, audibility, and efficiency of vocal communication. Speech Assistive Technology (SAT) has been developed with different approaches for dysarthric speech and in this paper we focus on the approach that is based on modeling of pronunciation patterns. We present an approach that integrates multiple pronunciation patterns for enhancement of dysarthric speech recognition. This integration is performed by weighting the responses of an Automatic Speech Recognition (ASR) system when different language model restrictions are set. The weight for each response is estimated by a Genetic Algorithm (GA) that also optimizes the structure of the implementation technique (Metamodels) which is based on discrete Hidden Markov Models (HMMs). The GA makes use of dynamic uniform mutation/crossover to further diversify the candidate sets of weights and structures to improve the performance of the Metamodels. To test the approach with a larger vocabulary than in previous works, we orthographically and phonetically labeled extended acoustic resources from the Nemours database of dysarthric speech. ASR tests on these resources with the proposed approach showed recognition accuracies over those obtained with standard Metamodels and a well used speaker adaptation technique. These results were statistically significant.  相似文献   

This paper describes a set of modeling techniques for detecting a small vocabulary of keywords in running conversational speech. The techniques are applied in the context of a hidden Markov model (HMM) based continuous speech recognition (CSR) approach to keyword spotting. The word spotting task is derived from the Switchboard conversational speech corpus, and involves unconstrained conversational speech utterances spoken over the public switched telephone network. The utterances in this task contain many of the artifacts that are characteristic of unconstrained speech as it appears in many telecommunications based automatic speech recognition (ASR) applications. Results are presented for an experimental study that was performed on this task. Performance was measured by computing the percentage correct keyword detection over a range of false alarm rates evaluated over 2·2 h of speech for a 20 keyword vocabulary. The results of the study demonstrate the importance of several techniques. These techniques include the use of decision tree based allophone clustering for defining acoustic subword units, different representations for non-vocabulary words appearing in the input utterance, and the definition of simple language models for keyword detection. Decision tree based allophone clustering resulted in a significant increase in keyword detection performance over that obtained using tri-phone based subword units while at the same time reducing the size of the inventory of subword acoustic models by 40%. More complex representations of non-vocabulary speech were also found to significantly improve keyword detection performance; however, these representations also resulted in a significant increase in computational complexity.  相似文献   

This research explores the various indicators for non-verbal cues of speech and provides a method of building a paralinguistic profile of these speech characteristics which determines the emotional state of the speaker. Since a major part of human communication consists of vocalization, a robust approach that is capable of classifying and segmenting an audio stream into silent and voiced regions and developing a paralinguistic profile for the same is presented. The data consisting of disruptions is first segmented into frames and this data is analyzed by exploiting short term acoustic features, temporal characteristics of speech and measures of verbal productivity. A matrix is finally developed relating the paralinguistic properties of average pitch, energy, rate of speech, silence duration and loudness to their respective context. Happy and confident states possessed high values of energy and rate of speech and less silence duration whereas tense and sad states showed low values of energy and speech rate and high periods of silence. Paralanguage was found to be an important cue to decipher the implicit meaning in a speech sample.  相似文献   

Current machine translation systems are far from being perfect. However, such systems can be used in computer-assisted translation to increase the productivity of the (human) translation process. The idea is to use a text-to-text translation system to produce portions of target language text that can be accepted or amended by a human translator using text or speech. These user-validated portions are then used by the text-to-text translation system to produce further, hopefully improved suggestions. There are different alternatives of using speech in a computer-assisted translation system: From pure dictated translation to simple determination of acceptable partial translations by reading parts of the suggestions made by the system. In all the cases, information from the text to be translated can be used to constrain the speech decoding search space. While pure dictation seems to be among the most attractive settings, unfortunately perfect speech decoding does not seem possible with the current speech processing technology and human error-correcting would still be required. Therefore, approaches that allow for higher speech recognition accuracy by using increasingly constrained models in the speech recognition process are explored here. All these approaches are presented under the statistical framework. Empirical results support the potential usefulness of using speech within the computer-assisted translation paradigm.  相似文献   

Approximate maximum likelihood (ML) hidden Markov modeling using the most likely state sequence (MLSS) is examined and compared with the exact ML approach that considers all possible state sequences. It is shown that for any hidden Markov model (HMM), the difference between the approximate and the exact normalized likelihood functions cannot exceed the logarithm of the number of states divided by the dimension of the output vectors (frame length), which is negligible for typically used values of vector dimension (128–256) and number of states (2–30). Furthermore, for Gaussian HMMs and a given observation sequence, the MLSS is typically the sequence of nearest neighbor states in the Itakura-Saito sense, and the posterior probability of any state sequence which departs from the MLSS in a single time instant decays exponentially with the frame length. Hence, for a sufficiently large frame length the exact and approximate ML approaches provide similar model estimates and likelihood values. The results and their implications on speech recognition are demonstrated in a set of experiments.  相似文献   

Speech production errors characteristic of dysarthria are chiefly responsible for the low accuracy of automatic speech recognition (ASR) when used by people diagnosed with it. A person with dysarthria produces speech in a rather reduced acoustic working space, causing typical measures of speech acoustics to have values in ranges very different from those characterizing unimpaired speech. It is unlikely then that models trained on unimpaired speech will be able to adjust to this mismatch when acted on by one of the currently well-studied adaptation algorithms (which make no attempt to address this extent of mismatch in population characteristics).In this work, we propose an interpolation-based technique for obtaining a prior acoustic model from one trained on unimpaired speech, before adapting it to the dysarthric talker. The method computes a ‘background’ model of the dysarthric talker's general speech characteristics and uses it to obtain a more suitable prior model for adaptation (compared to the speaker-independent model trained on unimpaired speech). The approach is tested with a corpus of dysarthric speech acquired by our research group, on speech of sixteen talkers with varying levels of dysarthria severity (as quantified by their intelligibility). This interpolation technique is tested in conjunction with the well-known maximum a posteriori (MAP) adaptation algorithm, and yields improvements of up to 8% absolute and up to 40% relative, over the standard MAP adapted baseline.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号