首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
将标准普通话语音数据训练得到的声学模型应用于新疆维吾尔族说话人非母语汉语语音识别时,由于说话人的普通话发音存在较大偏误,将导致识别率急剧下降。针对这一问题,将多发音字典技术应用于新疆维吾尔族说话人汉语语音识别中,通过统计分析识别器的识别错误,建立音素混淆矩阵,获取音素的发音候选项。利用剪枝策略对发音候选项进行剪枝整合,扩展出符合维吾尔族说话人汉语发音规律的替代字典。对三种剪枝方法产生的发音字典的识别结果进行了对比。实验结果表明,使用相对最大剪枝策略产生的发音字典可以显著提高系统识别率。  相似文献   

2.
This article presents an approach for the automatic recognition of non-native speech. Some non-native speakers tend to pronounce phonemes as they would in their native language. Model adaptation can improve the recognition rate for non-native speakers, but has difficulties dealing with pronunciation errors like phoneme insertions or substitutions. For these pronunciation mismatches, pronunciation modeling can make the recognition system more robust. Our approach is based on acoustic model transformation and pronunciation modeling for multiple non-native accents. For acoustic model transformation, two approaches are evaluated: MAP and model re-estimation. For pronunciation modeling, confusion rules (alternate pronunciations) are automatically extracted from a small non-native speech corpus. This paper presents a novel approach to introduce confusion rules in the recognition system which are automatically learned through pronunciation modelling. The modified HMM of a foreign spoken language phoneme includes its canonical pronunciation along with all the alternate non-native pronunciations, so that spoken language phonemes pronounced correctly by a non-native speaker could be recognized. We evaluate our approaches on the European project HIWIRE non-native corpus which contains English sentences pronounced by French, Italian, Greek and Spanish speakers. Two cases are studied: the native language of the test speaker is either known or unknown. Our approach gives better recognition results than the classical acoustic adaptation of HMM when the foreign origin of the speaker is known. We obtain 22% WER reduction compared to the reference system.  相似文献   

3.
This paper presents our work in automatic speech recognition (ASR) in the context of under-resourced languages with application to Vietnamese. Different techniques for bootstrapping acoustic models are presented. First, we present the use of acoustic–phonetic unit distances and the potential of crosslingual acoustic modeling for under-resourced languages. Experimental results on Vietnamese showed that with only a few hours of target language speech data, crosslingual context independent modeling worked better than crosslingual context dependent modeling. However, it was outperformed by the latter one, when more speech data were available. We concluded, therefore, that in both cases, crosslingual systems are better than monolingual baseline systems. The proposal of grapheme-based acoustic modeling, which avoids building a phonetic dictionary, is also investigated in our work. Finally, since the use of sub-word units (morphemes, syllables, characters, etc.) can reduce the high out-of-vocabulary rate and improve the lack of text resources in statistical language modeling for under-resourced languages, we propose several methods to decompose, normalize and combine word and sub-word lattices generated from different ASR systems. The proposed lattice combination scheme results in a relative syllable error rate reduction of 6.6% over the sentence MAP baseline method for a Vietnamese ASR task.   相似文献   

4.
Pronunciation variation is a major obstacle in improving the performance of Arabic automatic continuous speech recognition systems. This phenomenon alters the pronunciation spelling of words beyond their listed forms in the pronunciation dictionary, leading to a number of out of vocabulary word forms. This paper presents a direct data-driven approach to model within-word pronunciation variations, in which the pronunciation variants are distilled from the training speech corpus. The proposed method consists of performing phoneme recognition, followed by a sequence alignment between the observation phonemes generated by the phoneme recognizer and the reference phonemes obtained from the pronunciation dictionary. The unique collected variants are then added to dictionary as well as to the language model. We started with a Baseline Arabic speech recognition system based on Sphinx3 engine. The Baseline system is based on a 5.4 hours speech corpus of modern standard Arabic broadcast news, with a pronunciation dictionary of 14,234 canonical pronunciations. The Baseline system achieves a word error rate of 13.39%. Our results show that while the expanded dictionary alone did not add appreciable improvements, the word error rate is significantly reduced by 2.22% when the variants are represented within the language model.  相似文献   

5.
6.
7.
Dysarthria is a motor speech disorder caused by neurological injury of the motor component of the motor-speech system. Because it affects respiration, phonation, and articulation, it leads to different types of impairments in intelligibility, audibility, and efficiency of vocal communication. Speech Assistive Technology (SAT) has been developed with different approaches for dysarthric speech and in this paper we focus on the approach that is based on modeling of pronunciation patterns. We present an approach that integrates multiple pronunciation patterns for enhancement of dysarthric speech recognition. This integration is performed by weighting the responses of an Automatic Speech Recognition (ASR) system when different language model restrictions are set. The weight for each response is estimated by a Genetic Algorithm (GA) that also optimizes the structure of the implementation technique (Metamodels) which is based on discrete Hidden Markov Models (HMMs). The GA makes use of dynamic uniform mutation/crossover to further diversify the candidate sets of weights and structures to improve the performance of the Metamodels. To test the approach with a larger vocabulary than in previous works, we orthographically and phonetically labeled extended acoustic resources from the Nemours database of dysarthric speech. ASR tests on these resources with the proposed approach showed recognition accuracies over those obtained with standard Metamodels and a well used speaker adaptation technique. These results were statistically significant.  相似文献   

8.
9.
对于具有大量特征数据和复杂发音变化的英语语音,与单词相比,在隐马尔可夫模型(HMM)中存在更多问题,例如维特比算法的复杂度计算和高斯混合模型中的概率分布问题。为了实现基于HMM和聚类的独立于说话人的英语语音识别系统,提出了用于降低语音特征参数维数的分段均值算法、聚类交叉分组算法和HMM分组算法的组合形式。实验结果表明,与单个HMM模型相比,该算法不仅提高了英语语音的识别率近3%,而且提高系统的识别速度20.1%。  相似文献   

10.
针对传统的语音识别系统采用数据驱动并利用语言模型来决策最优的解码路径,导致在部分场景下的解码结果存在明显的音对字错的问题,提出一种基于韵律特征辅助的端到端语音识别方法,利用语音中的韵律信息辅助增强正确汉字组合在语言模型中的概率。在基于注意力机制的编码-解码语音识别框架的基础上,首先利用注意力机制的系数分布提取发音间隔、发音能量等韵律特征;然后将韵律特征与解码端结合,从而显著提升了发音相同或相近、语义歧义情况下的语音识别准确率。实验结果表明,该方法在1 000 h及10 000 h级别的语音识别任务上分别较端到端语音识别基线方法在准确率上相对提升了5.2%和5.0%,进一步改善了语音识别结果的可懂度。  相似文献   

11.
We compared the performance of an automatic speech recognition system using n-gram language models, HMM acoustic models, as well as combinations of the two, with the word recognition performance of human subjects who either had access to only acoustic information, had information only about local linguistic context, or had access to a combination of both. All speech recordings used were taken from Japanese narration and spontaneous speech corpora.Humans have difficulty recognizing isolated words taken out of context, especially when taken from spontaneous speech, partly due to word-boundary coarticulation. Our recognition performance improves dramatically when one or two preceding words are added. Short words in Japanese mainly consist of post-positional particles (i.e. wa, ga, wo, ni, etc.), which are function words located just after content words such as nouns and verbs. So the predictability of short words is very high within the context of the one or two preceding words, and thus recognition of short words is drastically improved. Providing even more context further improves human prediction performance under text-only conditions (without acoustic signals). It also improves speech recognition, but the improvement is relatively small.Recognition experiments using an automatic speech recognizer were conducted under conditions almost identical to the experiments with humans. The performance of the acoustic models without any language model, or with only a unigram language model, were greatly inferior to human recognition performance with no context. In contrast, prediction performance using a trigram language model was superior or comparable to human performance when given a preceding and a succeeding word. These results suggest that we must improve our acoustic models rather than our language models to make automatic speech recognizers comparable to humans in recognition performance under conditions where the recognizer has limited linguistic context.  相似文献   

12.
In this paper, we investigate the contribution of tone in a Hidden Markov Model (HMM)-based speech synthesis of Ibibio (ISO 693-3: nic; Ethnologue: IBB), an under-resourced language. We review the language’s speech characteristics, required for building the front end components of the design and propose a finite state transducer (FST), useful for modelling the language’s tonetactics. The existing speech database of Ibibio is also studied and the quality of synthetic speech examined through a spectral analysis of voices obtained from two synthesis experiments, with and without tone feature labels. A confusion matrix classifying the results of a controlled listening test for both experiments is constructed and statistics comparing their performance quality presented. Results obtained revealed that synthesis systems with tone feature labels outperformed synthesis systems without tone feature labels, as more tone confusions were perceived by listeners in the latter.  相似文献   

13.
一种用于方言口音语音识别的字典自适应技术   总被引:2,自引:1,他引:1  
基于标准普通话的语音识别系统在识别带有方言口音的普通话时,识别率会下降很多。针对这一问题,论文介绍了一种“字典自适应技术”。文中首先提出了一种自动标注算法,然后以此为基础,通过分析语音数据,统计出带有方言口音普通话的发音规律,然后把这个规律编码到标准普通话字典里,构造出体现这种方言发音特征的新字典,最后把新字典整合于搜索框架,用于识别带有该方言口音的普通话,使识别率得到显著提高。  相似文献   

14.
母语与非母语语音识别声学建模   总被引:1,自引:1,他引:0       下载免费PDF全文
曾定  刘加 《计算机工程》2010,36(8):170-172
为了兼容母语与非母语说话人之间的发音变化,提出一种新的声学模型建模方法。分析中国人受母语影响产生的英语发音变化,利用中国人英语发音数据库自适应得到语音模型,采用声学模型融合技术构建融合2种发音规律的识别模型。实验结果证明,中国人英语发音的语音识别率提高了13.4%,但标准英语的语音识别率仅下降1.1%。  相似文献   

15.
Audio-visual speech recognition (AVSR) has shown impressive improvements over audio-only speech recognition in the presence of acoustic noise. However, the problems of region-of-interest detection and feature extraction may influence the recognition performance due to the visual speech information obtained typically from planar video data. In this paper, we deviate from the traditional visual speech information and propose an AVSR system integrating 3D lip information. The Microsoft Kinect multi-sensory device was adopted for data collection. The different feature extraction and selection algorithms were applied to planar images and 3D lip information, so as to fuse the planar images and 3D lip feature into the visual-3D lip joint feature. For automatic speech recognition (ASR), the fusion methods were investigated and the audio-visual speech information was integrated into a state-synchronous two stream Hidden Markov Model. The experimental results demonstrated that our AVSR system integrating 3D lip information improved the recognition performance of traditional ASR and AVSR system in acoustic noise environments.  相似文献   

16.
In speech recognition research,because of the variety of languages,corresponding speech recognition systems need to be constructed for different languages.Especially in a dialect speech recognition system,there are many special words and oral language features.In addition,dialect speech data is very scarce.Therefore,constructing a dialect speech recognition system is difficult.This paper constructs a speech recognition system for Sichuan dialect by combining a hidden Markov model(HMM)and a deep long short-term memory(LSTM)network.Using the HMM-LSTM architecture,we created a Sichuan dialect dataset and implemented a speech recognition system for this dataset.Compared with the deep neural network(DNN),the LSTM network can overcome the problem that the DNN only captures the context of a fixed number of information items.Moreover,to identify polyphone and special pronunciation vocabularies in Sichuan dialect accurately,we collect all the characters in the dataset and their common phoneme sequences to form a lexicon.Finally,this system yields a 11.34%character error rate on the Sichuan dialect evaluation dataset.As far as we know,it is the best performance for this corpus at present.  相似文献   

17.
18.
In recent years, the use of morphological decomposition strategies for Arabic Automatic Speech Recognition (ASR) has become increasingly popular. Systems trained on morphologically decomposed data are often used in combination with standard word-based approaches, and they have been found to yield consistent performance improvements. The present article contributes to this ongoing research endeavour by exploring the use of the ‘Morphological Analysis and Disambiguation for Arabic’ (MADA) tools for this purpose. System integration issues concerning language modelling and dictionary construction, as well as the estimation of pronunciation probabilities, are discussed. In particular, a novel solution for morpheme-to-word conversion is presented which makes use of an N-gram Statistical Machine Translation (SMT) approach. System performance is investigated within a multi-pass adaptation/combination framework. All the systems described in this paper are evaluated on an Arabic large vocabulary speech recognition task which includes both Broadcast News and Broadcast Conversation test data. It is shown that the use of MADA-based systems, in combination with word-based systems, can reduce the Word Error Rates by up to 8.1% relative.  相似文献   

19.
A common requirement in speech technology is to align two different symbolic representations of the same linguistic ‘message’. For instance, we often need to align letters of words listed in a dictionary with the corresponding phonemes specifying their pronunciation. As dictionaries become ever bigger, manual alignment becomes less and less tenable yet automatic alignment is a hard problem for a language like English. In this paper, we describe the use of a form of the expectation-maximization (EM) algorithm to learn alignments of English text and phonemes, starting from a variety of initializations. We use the British English Example Pronunciation (BEEP) dictionary of almost 200,000 words in this work. The quality of alignment is difficult to determine quantitatively since no ‘gold standard’ correct alignment exists. We evaluate the success of our algorithm indirectly from the performance of a pronunciation by analogy system using the aligned dictionary data as a knowledge base for inferring pronunciations. We find excellent performance—the best so far reported in the literature. There is very little dependence on the start point for alignment, indicating that the EM search space is strongly convex. Since the aligned BEEP dictionary is a potentially valuable resource, it is made freely available for research use.  相似文献   

20.
情感识别在人机交互中具有重要意义,为了提高情感识别准确率,将语音与文本特征融合。语音特征采用了声学特征和韵律特征,文本特征采用了基于情感词典的词袋特征(Bag-of-words,BoW)和N-gram模型。将语音与文本特征分别进行特征层融合与决策层融合,比较它们在IEMOCAP四类情感识别的效果。实验表明,语音与文本特征融合比单一特征在情感识别中表现更好;决策层融合比在特征层融合识别效果好。且基于卷积神经网络(Convolutional neural network,CNN)分类器,语音与文本特征在决策层融合中不加权平均召回率(Unweighted average recall,UAR)达到了68.98%,超过了此前在IEMOCAP数据集上的最好结果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号