首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
During the last decade, the most significant advances in the field of continuous speech recognition (CSR) have arisen from the use of hidden Markov models (HMM) for acoustic modeling. These models address one of the major issues for CSR: simultaneous modeling of temporal and frequency distortions in the speech signal. In the HMM, the temporal dimension is managed through an oriented states graph, each state accounting for the local frequency distortions through a probability density function. In this study, improvement of the HMM performance is expected from the introduction of a very effective non-parametric probability density function estimate: the k-nearest neighbors (k-nn) estimate.First, experiments on a short-term speech spectrum identification task are performed to compare the k-nn estimate and the widespread estimate based on mixtures of Gaussian functions. Then adaptations implied by the integration of the k-nn estimate in an HMM-based recognition system are developed. An optimal training protocol is obtained based on the introduction of the membership coefficients in the HMM parameters. The membership coefficients measure the degree of association between a reference acoustic vector and a HMM state. The training procedure uses the expectation-maximization (EM) algorithm applied to the membership coefficient estimation. Its convergence is shown according to the maximum likelihood criterion. This study leads to the development of a baseline k-nn/HMM recognition system which is evaluated on the TIMIT speech database. Further improvements of the k-nn/HMM system are finally sought through the introduction of a temporal information into the representation space (delta coefficients) and the adaptation of the references (mainly, gender modeling and contextual modeling).  相似文献   

2.
LVCSR systems are usually based on continuous density HMMs, which are typically implemented using Gaussian mixture distributions. Such statistical modeling systems tend to operate slower than real-time, largely because of the heavy computational overhead of the likelihood evaluation. The objective of our research is to investigate approximate methods that can substantially reduce the computational cost in likelihood evaluation without obviously degrading the recognition accuracy. In this paper, the most common techniques to speed up the likelihood computation are classified into three categories, namely machine optimization, model optimization, and algorithm optimization. Each category is surveyed and summarized by describing and analyzing the basic ideas of the corresponding techniques. The distribution of the numerical values of Gaussian mixtures within a GMM model are evaluated and analyzed to show that computations of some Gaussians are unnecessary and can thus be eliminated. Two commonly used techniques for likelihood approximation, namely VQ-based Gaussian selection and partial distance elimination, are analyzed in detail. Based on the analyses, a fast likelihood computation approach called dynamic Gaussian selection (DGS) is proposed. DGS approach is a one-pass search technique which generates a dynamic shortlist of Gaussians for each state during the procedure of likelihood computation. In principle, DGS is an extension of both techniques of partial distance elimination and best mixture prediction, and it does not require additional memory for the storage of Gaussian shortlists. DGS algorithm has been implemented by modifying the likelihood computation procedure in HTK 3.4 system. Experimental results on TIMIT and WSJ0 corpora indicate that this approach can speed up the likelihood computation significantly without introducing apparent additional recognition error.  相似文献   

3.
The main recognition procedure in modern HMM-based continuous speech recognition systems is Viterbi algorithm. Viterbi algorithm finds out the best acoustic sequence according to input speech in the search space using dynamic programming. In this paper, dynamic programming is replaced by a search method which is based on particle swarm optimization. The major idea is focused on generating initial population of particles as the speech segmentation vectors. The particles try to achieve the best segmentation by an updating method during iterations. In this paper, a new method of particles representation and recognition process is introduced which is consistent with the nature of continuous speech recognition. The idea was tested on bi-phone recognition and continuous speech recognition workbenches and the results show that the proposed search method reaches the performance of the Viterbi segmentation algorithm ; however, there is a slight degradation in the accuracy rate.  相似文献   

4.
5.
基于语音识别的汉语发音自动评分系统的设计与实现   总被引:6,自引:0,他引:6  
语音识别技术的发展使得人与计算机的交互成为可能,针对目前对外汉语中发音教学的不足,在结合了语音识别的相关原理,提出了在对外汉语教学领域中汉语自动发音水平评价系统的设计,详细地描述了系统的结构、功能及流程.介绍了系统实现中的关键技术和步骤:动态时间弯折算法、语料库的建立、声韵分割技术以及评价分级标准.通过小范围的试验,表明该系统对留学生汉语发音水平的测试有一定的参考价值.  相似文献   

6.
An HMM-based threshold model approach for gesture recognition   总被引:13,自引:0,他引:13  
A new method is developed using the hidden Markov model (HMM) based technique. To handle nongesture patterns, we introduce the concept of a threshold model that calculates the likelihood threshold of an input pattern and provides a confirmation mechanism for the provisionally matched gesture patterns. The threshold model is a weak model for all trained gestures in the sense that its likelihood is smaller than that of the dedicated gesture model for a given gesture. Consequently, the likelihood can be used as an adaptive threshold for selecting proper gesture model. It has, however, a large number of states and needs to be reduced because the threshold model is constructed by collecting the states of all gesture models in the system. To overcome this problem, the states with similar probability distributions are merged, utilizing the relative entropy measure. Experimental results show that the proposed method can successfully extract trained gestures from continuous hand motion with 93.14% reliability  相似文献   

7.
Over the period of 1987-1991, a series of theoretical and experimental results have suggested that multilayer perceptrons (MLP) are an effective family of algorithms for the smooth estimation of high-dimension probability density functions that are useful in continuous speech recognition. The early form of this work has focused on hidden Markov models (HMM) that are independent of phonetic context. More recently, the theory has been extended to context-dependent models. The authors review the basic principles of their hybrid HMM/MLP approach and describe a series of improvements that are analogous to the system modifications instituted for the leading conventional HMM systems over the last few years. Some of these methods directly trade off computational complexity for reduced requirements of memory and memory bandwidth. Results are presented on the widely used Resource Management speech database that has been distributed by the US National Institute of Standards and Technology.  相似文献   

8.
This paper presents a new test to distinguish between meaningful and non-meaningful HMM-modeled activity patterns in human activity recognition systems. Operating as a hypothesis test, alternative models are generated from available classes and the decision is based on a likelihood ratio test (LRT). The proposed test differs from traditional LRTs in two aspects. Firstly, the likelihood ratio, which is called pairwise likelihood ratio (PLR), is based on each pair of HMMs. Models for non-meaningful patterns are not required. Secondly, the distribution of the likelihood ratios, rather than a fixed threshold, is used as the measurement. Multiple measurements from multiple PLR tests are combined to improve the rejection accuracy. The advantage of the proposed test is that the establishment of such a test relies only on the meaningful samples.  相似文献   

9.
Hypo and hyperarticulation refer to the production of speech with respectively a reduction and an increase of the articulatory efforts compared to the neutral style. Produced consciously or not, these variations of articulatory efforts depend upon the surrounding environment, the communication context and the motivation of the speaker with regard to the listener. The goal of this work is to integrate hypo and hyperarticulation into speech synthesizers, such that they are more realistic by automatically adapting their way of speaking to the contextual situation, like humans do. Based on our preliminary work, this paper provides a thorough and detailed study on the analysis and synthesis of hypo and hyperarticulated speech. It is divided into three parts. In the first one, we focus on both acoustic and phonetic modifications due to articulatory effort changes. The second part aims at developing a HMM-based speech synthesizer allowing a continuous control of the degree of articulation. This requires to first tackle the issue of speaking style adaptation to derive hypo and hyperarticulated speech from the neutral synthesizer. Once this is done, an interpolation and extrapolation of the resulting models enables to finely tune the voice so that it is generated with the desired articulatory efforts. Finally the third and last part focuses on a perceptual study of speech with a variable articulation degree, where it is analyzed how intelligibility and various other voice dimensions are affected.  相似文献   

10.
Describes a hidden Markov model-based approach designed to recognize off-line unconstrained handwritten words for large vocabularies. After preprocessing, a word image is segmented into letters or pseudoletters and represented by two feature sequences of equal length, each consisting of an alternating sequence of shape-symbols and segmentation-symbols, which are both explicitly modeled. The word model is made up of the concatenation of appropriate letter models consisting of elementary HMMs and an HMM-based interpolation technique is used to optimally combine the two feature sets. Two rejection mechanisms are considered depending on whether or not the word image is guaranteed to belong to the lexicon. Experiments carried out on real-life data show that the proposed approach can be successfully used for handwritten word recognition  相似文献   

11.
12.
新型的笔式交互技术要求能够高效地识别用户手势,适应用户的手绘风格。本文建立了基于隐形马尔可夫的手势识别模型,在此基础上提出了在重采样阶段的中点补偿方法和编码阶段的方向编码优化方法。实验结果表明该识别模型能以更精简的采样点数量表示手势并给出良好的识别结果,减少了模型训练的运算量。  相似文献   

13.
14.
Nowadays, several computational techniques for speech recognition have been proposed. These techniques suppose an important improvement in real time applications where speaker interacts with speech recognition systems. Although researchers proposed many methods, none of them solve the high false alarm problem when far-field speakers interfere in a human-machine conversation. This paper presents a two-class (speech and non-speech classes) decision-tree based approach for combining new speech pulse features in a VAD (Voice Activity Detector) for rejecting far-field speech in speech recognition systems. This Decision Tree is applied over the speech pulses obtained by a baseline VAD composed of a frame feature extractor, a HMM-based (Hidden Markov Model) segmentation module and a pulse detector. The paper also presents a detailed analysis of a great amount of features for discriminating between close and far-field speech. The detection error obtained with the proposed VAD is the lowest compared to other well-known VADs.  相似文献   

15.
In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.  相似文献   

16.
Audio-visual speech modeling for continuous speech recognition   总被引:3,自引:0,他引:3  
This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate  相似文献   

17.
In this article we present an efficient approach to modeling the acoustic features for the tasks of recognizing various paralinguistic phenomena. Instead of the standard scheme of adapting the Universal Background Model (UBM), represented by the Gaussian Mixture Model (GMM), normally used to model the frame-level acoustic features, we propose to represent the UBM by building a monophone-based Hidden Markov Model (HMM). We present two approaches: transforming the monophone-based segmented HMM–UBM to a GMM–UBM and proceeding with the standard adaptation scheme, or to perform the adaptation directly on the HMM–UBM. Both approaches give superior results than the standard adaptation scheme (GMM–UBM) in both the emotion recognition task and the alcohol detection task. Furthermore, with the proposed method we were able to achieve better results than the current state-of-the-art systems in both tasks.  相似文献   

18.
基于Transformer的端到端语音识别系统获得广泛的普及,但Transformer中的多头自注意力机制对输入序列的位置信息不敏感,同时它灵活的对齐方式在面对带噪语音时泛化性能较差。针对以上问题,首先提出使用时序卷积神经网络(TCN)来加强神经网络模型对位置信息的捕捉,其次在上述基础上融合连接时序分类(CTC),提出TCN-Transformer-CTC模型。在不使用任何语言模型的情况下,在中文普通话开源语音数据库AISHELL-1上的实验结果表明,TCN-Transformer-CTC相较于Transformer字错误率相对降低了10.91%,模型最终字错误率降低至5.31%,验证了提出的模型具有一定的先进性。  相似文献   

19.
Neural networks with fixed input length are not able to train and test data with variable lengths in one network size. This issue is very crucial when the neural networks need to deal with signals of variable lengths, such as speech. Though various methods have been proposed in segmentation and feature extraction to deal with variable lengths of the data, the size of the input data to the neural networks still has to be fixed. A novel Self-Adjustable Neural Network (SANN) is presented in this paper, to enable the network to adjust itself according to different data input sizes. The proposed method is applied to the speech recognition of Malay vowels and TIMIT isolated words. SANN is benchmarked against the standard and state-of-the-art recogniser, Hidden Markov Model (HMM). The results showed that SANN was better than HMM in recognizing the Malay vowels. However, HMM outperformed SANN in recognising the TIMIT isolated words.  相似文献   

20.
One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号