首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In automatic speech recognition (ASR) systems, the speech signal is captured and parameterized at front end and evaluated at back end using the statistical framework of hidden Markov model (HMM). The performance of these systems depend critically on both the type of models used and the methods adopted for signal analysis. Researchers have proposed a variety of modifications and extensions for HMM based acoustic models to overcome their limitations. In this review, we summarize most of the research work related to HMM-ASR which has been carried out during the last three decades. We present all these approaches under three categories, namely conventional methods, refinements and advancements of HMM. The review is presented in two parts (papers): (i) An overview of conventional methods for acoustic phonetic modeling, (ii) Refinements and advancements of acoustic models. Part I explores the architecture and working of the standard HMM with its limitations. It also covers different modeling units, language models and decoders. Part II presents a review on the advances and refinements of the conventional HMM techniques along with the current challenges and performance issues related to ASR.  相似文献   

2.
近几年来,基于端到端模型的语音识别系统因其相较于传统混合模型的结构简洁性和易于训练性而得到广泛的应用,并在汉语和英语等大语种上取得了显著的效果.本文将自注意力机制和链接时序分类损失代价函数相结合,将这种端到端模型应用到维吾尔语语音识别上.考虑到维吾尔语属于典型的黏着语,其丰富的构词形式使得维吾尔语的词汇量异常庞大,本文引入字节对编码算法进行建模单元的生成,从而获得合适的端到端建模输出单元.在King-ASR450维吾尔语数据集上,提出的算法明显优于基于隐马尔可夫模型的经典混合系统和基于双向长短时记忆网络的端到端模型,最终识别词准确率为91.35%.  相似文献   

3.
A degradation in the performance of automatic speech recognition systems (ASR) is observed in mismatched training and testing conditions. One of the reasons for this degradation is due to the presence of emotions in the speech. The main objective of this work is to improve the performance of ASR in the presence of emotional conditions using prosody modification. The influence of different emotions on the prosody parameters is exploited in this work. Emotion conversion methods are employed to generate the word level non-uniform prosody modified speech. Modification factors for prosodic components such as pitch, duration and energy are used. The prosody modification is done in two ways. Firstly, emotion conversion is done at the testing stage to generate the neutral speech from the emotional speech. Secondly, the ASR is trained with the generated emotional speech from the neutral speech. In this work, the presence of emotions in speech is studied for the Telugu ASR systems. A new database of IIIT-H Telugu speech corpus is collected to build the large vocabulary neutral Telugu speech ASR system. The emotional speech samples from IITKGP-SESC Telugu corpus are used for testing it. The emotions of anger, happiness and compassion are considered during the evaluation. An improvement in the performance of ASR systems is observed in the prosody modified speech.  相似文献   

4.
Parallel integration of automatic speech recognition (ASR) models and statistical machine translation (MT) models is an unexplored research area in comparison to the large amount of works done on integrating them in series, i.e., speech-to-speech translation. Parallel integration of these models is possible when we have access to the speech of a target language text and to its corresponding source language text, like a computer-assisted translation system. To our knowledge, only a few methods for integrating ASR models with MT models in parallel have been studied. In this paper, we systematically study a number of different translation models in the context of the $N$-best list rescoring. As an alternative to the $N$ -best list rescoring, we use ASR word graphs in order to arrive at a tighter integration of ASR and MT models. The experiments are carried out on two tasks: English-to-German with an ASR vocabulary size of 17 K words, and Spanish-to-English with an ASR vocabulary of 58 K words. For the best method, the MT models reduce the ASR word error rate by a relative of 18% and 29% on the 17 K and the 58 K tasks, respectively.   相似文献   

5.
Recent developments in research on humanoid robots and interactive agents have highlighted the importance of and expectation on automatic speech recognition (ASR) as a means of endowing such an agent with the ability to communicate via speech. This article describes some of the approaches pursued at NTT Communication Science Laboratories (NTT-CSL) for dealing with such challenges in ASR. In particular, we focus on methods for fast search through finite-state machines, Bayesian solutions for modeling and classification of speech, and a discriminative training approach for minimizing errors in large vocabulary continuous speech recognition.  相似文献   

6.
This paper presents our work in automatic speech recognition (ASR) in the context of under-resourced languages with application to Vietnamese. Different techniques for bootstrapping acoustic models are presented. First, we present the use of acoustic–phonetic unit distances and the potential of crosslingual acoustic modeling for under-resourced languages. Experimental results on Vietnamese showed that with only a few hours of target language speech data, crosslingual context independent modeling worked better than crosslingual context dependent modeling. However, it was outperformed by the latter one, when more speech data were available. We concluded, therefore, that in both cases, crosslingual systems are better than monolingual baseline systems. The proposal of grapheme-based acoustic modeling, which avoids building a phonetic dictionary, is also investigated in our work. Finally, since the use of sub-word units (morphemes, syllables, characters, etc.) can reduce the high out-of-vocabulary rate and improve the lack of text resources in statistical language modeling for under-resourced languages, we propose several methods to decompose, normalize and combine word and sub-word lattices generated from different ASR systems. The proposed lattice combination scheme results in a relative syllable error rate reduction of 6.6% over the sentence MAP baseline method for a Vietnamese ASR task.   相似文献   

7.
语料资源缺乏的连续语音识别方法的研究   总被引:2,自引:0,他引:2  
由于少数民族语言有其本身的特点, 不能简单地套用现有的连续语音识别的方法. 本文以蒙古语为例, 研讨了声学和语言模型的建立, 并在日本国际电气通信基础技术研究所的连续语音识别器上实现了蒙古语的语音识别系统. 本文侧重于语言模型的建立, 基于蒙古语黏着性语言特点, 提出用相似词聚类方法建立多类N-gram模型. 实验结果显示, 应用我们提出的语言模型, 识别精度比用传统的词的N-gram识别法提高了5.5%.  相似文献   

8.
As mobile computing devices grow smaller and as in-car computing platforms become more common, we must augment traditional methods of human-computer interaction. Although speech interfaces have existed for years, the constrained system resources of pervasive devices, such as limited memory and processing capabilities, present new challenges. We provide an overview of embedded automatic speech recognition (ASR) on the pervasive device and discuss its ability to help us develop pervasive applications that meet today's marketplace needs. ASR recognizes spoken words and phrases. State-of-the-art ASR uses a phoneme-based approach for speech modeling: it gives each phoneme (or elementary speech sound) in the language under consideration a statistical representation expressing its acoustic properties.  相似文献   

9.
Visual speech information plays an important role in automatic speech recognition (ASR) especially when audio is corrupted or even inaccessible. Despite the success of audio-based ASR, the problem of visual speech decoding remains widely open. This paper provides a detailed review of recent advances in this research area. In comparison with the previous survey [97] which covers the whole ASR system that uses visual speech information, we focus on the important questions asked by researchers and summarize the recent studies that attempt to answer them. In particular, there are three questions related to the extraction of visual features, concerning speaker dependency, pose variation and temporal information, respectively. Another question is about audio-visual speech fusion, considering the dynamic changes of modality reliabilities encountered in practice. In addition, the state-of-the-art on facial landmark localization is briefly introduced in this paper. Those advanced techniques can be used to improve the region-of-interest detection, but have been largely ignored when building a visual-based ASR system. We also provide details of audio-visual speech databases. Finally, we discuss the remaining challenges and offer our insights into the future research on visual speech decoding.  相似文献   

10.
Histogram equalization (HEQ) is one of the most efficient and effective techniques that have been used to reduce the mismatch between training and test acoustic conditions. However, most of the current HEQ methods are merely performed in a dimension-wise manner and without allowing for the contextual relationships between consecutive speech frames. In this paper, we present several novel HEQ approaches that exploit spatial-temporal feature distribution characteristics for speech feature normalization. The automatic speech recognition (ASR) experiments were carried out on the Aurora-2 standard noise-robust ASR task. The performance of the presented approaches was thoroughly tested and verified by comparisons with the other popular HEQ methods. The experimental results show that for clean-condition training, our approaches yield a significant word error rate reduction over the baseline system, and also give competitive performance relative to the other HEQ methods compared in this paper.  相似文献   

11.
The mismatch between system training and operating conditions can seriously deteriorate the performance of automatic speech recognition (ASR) systems. Various techniques have been proposed to solve this problem in a specified speech environment. Employment of these techniques often involves modification on the ASR system structure. In this paper, we propose an environment-independent (EI) ASR model parameter adaptation approach based on Bayesian parametric representation (BPR), which is able to adapt ASR models to new environments without changing the structure of an ASR system. The parameter set of BPR is optimized by a maximum joint likelihood criterion which is consistent with that of the hidden Markov model (HMM)-based ASR model through an independent expectation-maximization (EM) procedure. Variations of the proposed approach are investigated in the experiments designed in two different speech environments: one is the noisy environment provided by the AURORA 2 database, and the other is the network environment provided by the NTIMIT database. Performances of the proposed EI ASR model compensation approach are compared to those of the cepstral mean normalization (CMN) approach, which is one of the standard techniques for additive noise compensation. The experimental results show that performances of ASR models in different speech environments are significantly improved after being adapted by the proposed BPR model compensation approach  相似文献   

12.
Automatic speech recognition (ASR) systems rely almost exclusively on short-term segment-level features (MFCCs), while ignoring higher level suprasegmental cues that are characteristic of human speech. However, recent experiments have shown that categorical representations of prosody, such as those based on the Tones and Break Indices (ToBI) annotation standard, can be used to enhance speech recognizers. However, categorical prosody models are severely limited in scope and coverage due to the lack of large corpora annotated with the relevant prosodic symbols (such as pitch accent, word prominence, and boundary tone labels). In this paper, we first present an architecture for augmenting a standard ASR with symbolic prosody. We then discuss two novel, unsupervised adaptation techniques for improving, respectively, the quality of the linguistic and acoustic components of our categorical prosody models. Finally, we implement the augmented ASR by enriching ASR lattices with the adapted categorical prosody models. Our experiments show that the proposed unsupervised adaptation techniques significantly improve the quality of the prosody models; the adapted prosodic language and acoustic models reduce binary pitch accent (presence versus absence) classification error rate by 13.8% and 4.3%, respectively (relative to the seed models) on the Boston University Radio News Corpus, while the prosody-enriched ASR exhibits a 3.1% relative reduction in word error rate (WER) over the baseline system.  相似文献   

13.
14.
The design of Spoken Dialog Systems cannot be considered as the simple combination of speech processing technologies. Indeed, speech-based interface design has been an expert job for a long time. It necessitates good skills in speech technologies and low-level programming. Moreover, rapid development and reusability of previously designed systems remains uneasy. This makes optimality and objective evaluation of design very difficult. The design process is therefore a cyclic process composed of prototype releases, user satisfaction surveys, bug reports and refinements. It is well known that human intervention for testing is time-consuming and above all very expensive. This is one of the reasons for the recent interest in dialog simulation for evaluation as well as for design automation and optimization. In this paper we expose a probabilistic framework for a realistic simulation of spoken dialogs in which the major components of a dialog system are modeled and parameterized thanks to independent data or expert knowledge. Especially, an Automatic Speech Recognition (ASR) system model and a User Model (UM) have been developed. The ASR model, based on articulatory similarities in language models, provides task-adaptive performance prediction and Confidence Level (CL) distribution estimation. The user model relies on the Bayesian Networks (BN) paradigm and is used both for user behavior modeling and Natural Language Understanding (NLU) modeling. The complete simulation framework has been used to train a reinforcement-learning agent on two different tasks. These experiments helped to point out several potentially problematic dialog scenarios.  相似文献   

15.
Despite the significant progress of automatic speech recognition (ASR) in the past three decades, it could not gain the level of human performance, particularly in the adverse conditions. To improve the performance of ASR, various approaches have been studied, which differ in feature extraction method, classification method, and training algorithms. Different approaches often utilize complementary information; therefore, to use their combination can be a better option. In this paper, we have proposed a novel approach to use the best characteristics of conventional, hybrid and segmental HMM by integrating them with the help of ROVER system combination technique. In the proposed framework, three different recognizers are created and combined, each having its own feature set and classification technique. For design and development of the complete system, three separate acoustic models are used with three different feature sets and two language models. Experimental result shows that word error rate (WER) can be reduced about 4% using the proposed technique as compared to conventional methods. Various modules are implemented and tested for Hindi Language ASR, in typical field conditions as well as in noisy environment.  相似文献   

16.
In this paper, a spoken query system is demonstrated which can be used to access the latest agricultural commodity prices and weather information in Kannada language using mobile phone. The spoken query system consists of Automatic Speech Recognition (ASR) models, Interactive Voice Response System (IVRS) call flow, Agricultural Marketing Network (AGMARKNET) and India Meteorological Department (IMD) databases. The ASR models are developed by using the Kaldi speech recognition toolkit. The task specific speech data is collected from the different dialect regions of Karnataka (a state in India speaks Kannada language) to develop ASR models. The web crawler is used to get the commodity price and weather information from AGMARKNET and IMD websites. The postgresql database management system is used to manage the crawled data. The 80 and 20% of validated speech data is used for system training and testing respectively. The accuracy and Word Error Rate (WER) of ASR models are highlighted and end to end spoken query system is developed for Kannada language.  相似文献   

17.
Dysarthria is a neurological impairment of controlling the motor speech articulators that compromises the speech signal. Automatic Speech Recognition (ASR) can be very helpful for speakers with dysarthria because the disabled persons are often physically incapacitated. Mel-Frequency Cepstral Coefficients (MFCCs) have been proven to be an appropriate representation of dysarthric speech, but the question of which MFCC-based feature set represents dysarthric acoustic features most effectively has not been answered. Moreover, most of the current dysarthric speech recognisers are either speaker-dependent (SD) or speaker-adaptive (SA), and they perform poorly in terms of generalisability as a speaker-independent (SI) model. First, by comparing the results of 28 dysarthric SD speech recognisers, this study identifies the best-performing set of MFCC parameters, which can represent dysarthric acoustic features to be used in Artificial Neural Network (ANN)-based ASR. Next, this paper studies the application of ANNs as a fixed-length isolated-word SI ASR for individuals who suffer from dysarthria. The results show that the speech recognisers trained by the conventional 12 coefficients MFCC features without the use of delta and acceleration features provided the best accuracy, and the proposed SI ASR recognised the speech of the unforeseen dysarthric evaluation subjects with word recognition rate of 68.38%.  相似文献   

18.
Conventional Hidden Markov Model (HMM) based Automatic Speech Recognition (ASR) systems generally utilize cepstral features as acoustic observation and phonemes as basic linguistic units. Some of the most powerful features currently used in ASR systems are Mel-Frequency Cepstral Coefficients (MFCCs). Speech recognition is inherently complicated due to the variability in the speech signal which includes within- and across-speaker variability. This leads to several kinds of mismatch between acoustic features and acoustic models and hence degrades the system performance. The sensitivity of MFCCs to speech signal variability motivates many researchers to investigate the use of a new set of speech feature parameters in order to make the acoustic models more robust to this variability and thus improve the system performance. The combination of diverse acoustic feature sets has great potential to enhance the performance of ASR systems. This paper is a part of ongoing research efforts aspiring to build an accurate Arabic ASR system for teaching and learning purposes. It addresses the integration of complementary features into standard HMMs for the purpose to make them more robust and thus improve their recognition accuracies. The complementary features which have been investigated in this work are voiced formants and Pitch in combination with conventional MFCC features. A series of experimentations under various combination strategies were performed to determine which of these integrated features can significantly improve systems performance. The Cambridge HTK tools were used as a development environment of the system and experimental results showed that the error rate was successfully decreased, the achieved results seem very promising, even without using language models.  相似文献   

19.
The success of using Hidden Markov Models (HMMs) for speech recognition application has motivated the adoption of these models for handwriting recognition especially the online handwriting that has large similarity with the speech signal as a sequential process. Some languages such as Arabic, Farsi and Urdo include large number of delayed strokes that are written above or below most letters and usually written delayed in time. These delayed strokes represent a modeling challenge for the conventional left-right HMM that is commonly used for Automatic Speech Recognition (ASR) systems. In this paper, we introduce a new approach for handling delayed strokes in Arabic online handwriting recognition using HMMs. We also show that several modeling approaches such as context based tri-grapheme models, speaker adaptive training and discriminative training that are currently used in most state-of-the-art ASR systems can provide similar performance improvement for Hand Writing Recognition (HWR) systems. Finally, we show that using a multi-pass decoder that use the computationally less expensive models in the early passes can provide an Arabic large vocabulary HWR system with practical decoding time. We evaluated the performance of our proposed Arabic HWR system using two databases of small and large lexicons. For the small lexicon data set, our system achieved competing results compared to the best reported state-of-the-art Arabic HWR systems. For the large lexicon, our system achieved promising results (accuracy and time) for a vocabulary size of 64k words with the possibility of adapting the models for specific writers to get even better results.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号