首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
在学习语音的过程中,找出学习者发音的错误并加以改进是非常重要的。错音检测技术就是自动诊断语流中错误发音的技术,也是计算机辅助发音训练研究的主要内容之一。该文总结了错音检测技术的研究和应用现状,分别介绍了基于语音识别、基于错音网络和基于声学语音学的错音检测技术。在此基础上又介绍了错音检测技术在计算机辅助发音训练系统中的应用,以及汉语自动发音评估技术的发展。文章最后给出了作者的分析和建议。  相似文献   

2.
为探索智能语音技术在英语发音学习中的作用,开展了面向中国人朗读英语句子的音素发音自动检错技术研究.首先收集了45个人录制的900句英文朗读发音,并由两位专家对音素发音中的错误进行详细的标注,然后基于语音识别技术建立的句子朗读发音中音素自动检错系统,并针对中国人英语发音时最为常见的错读和漏读两大问题,分别提出音素独立检错阈值和限定音素对齐识别网络的方法,对音素检错系统进行了优化,显著地提高了系统的性能,最终系统的召回率和正确率分别达到49%和52%,接近人工专家间的69%召回率下59%的正确率的性能.  相似文献   

3.
基于语音识别技术的英语口语教学系统   总被引:1,自引:0,他引:1  
许多计算机辅助英语学习的应用欠缺口语学习的评估和反馈.描述了一个采用语音识别技术的英语口语学习系统.除了通常的发音评分外,还提供基于音素关联和音素识别的错误检测功能.结合纠正知识库的改进建议和韵律修正语音,可以及时地给学习者以帮助.实验结果表明,能够纠正有一定基础学习者的多数非故意错误.  相似文献   

4.
近年来,发音属性常常被用于计算机辅助发音训练系统(CAPT)中.该文针对使用发音属性的一些难点,提出了 一种建模细颗粒度发音属性(FSA)的方法,并在跨语言属性识别、发音偏误检测中进行测试.最终,得到了最优平均识别准确率约为95%的属性检测器组;在两个二语测试集上的偏误检测表明,相比基线,基于FSA的方法均获得了超过1...  相似文献   

5.
将标准普通话语音数据训练得到的声学模型应用于新疆维吾尔族说话人非母语汉语语音识别时,由于说话人的普通话发音存在较大偏误,将导致识别率急剧下降。针对这一问题,将多发音字典技术应用于新疆维吾尔族说话人汉语语音识别中,通过统计分析识别器的识别错误,建立音素混淆矩阵,获取音素的发音候选项。利用剪枝策略对发音候选项进行剪枝整合,扩展出符合维吾尔族说话人汉语发音规律的替代字典。对三种剪枝方法产生的发音字典的识别结果进行了对比。实验结果表明,使用相对最大剪枝策略产生的发音字典可以显著提高系统识别率。  相似文献   

6.
目前,面向蒙古语的语音识别语音库资源相对稀缺,但存在较多的电视剧、广播等蒙古语音频和对应的文本。该文提出基于语音识别的蒙古语长音频语音文本自动对齐方法,实现蒙古语电视剧语音的自动标注,扩充了蒙古语语音库。在前端处理阶段,使用基于高斯混合模型的语音端点检测技术筛选并删除噪音段;在语音识别阶段,构建基于前向型序列记忆网络的蒙古语声学模型;最后基于向量空间模型,将语音识别得到的假设序列和参考音素序列进行句子级别的动态时间归整算法匹配。实验结果表明,与基于Needleman-Wunsch算法的语音对齐比较,该文提出的蒙古语长音频语音文本自动对齐方法的对齐正确率提升了31.09%。  相似文献   

7.
藏语语音合成及语音学研究中,经常需要切分音素。人工切分费时费力,但是由于藏语语料缺乏,训练的藏语声学模型不够精确和鲁棒,自动切分的音素边界不够准确。以藏语拉萨方言为研究对象,在确定拉萨方言音素集、建立拉萨方言发音词典的基础上,通过计算音素模型间的距离,确定了拉萨方言和英语的共同音素,融合拉萨方言和英语GMM HMM模型,并自动判断语音中的静音和短时停顿,构造语音对应的词网络,查询发音词典,将词网络扩展为模型(音素)网络,使用Viterbi算法将每一帧特征参数对应到模型的每一个状态上,进而对音素进行切分。实验表明,切分效果要优于单纯的藏语模型方法。  相似文献   

8.
赵博  檀晓红 《计算机应用》2009,29(3):761-763
许多许多计算机辅助英语学习的应用,都忽略了口语的教学,或者缺乏对口语学习结果良好的评估和反馈。对于这一问题,语音识别技术可以从待评价语音与参考模型以及参考语音的相似程度给以评价,作为矫正的依据。该文描述了一个采用语音识别技术的英语口语学习系统。除了通常的发音评分外,提供的矫正还包括基于音素关联和音素识别的错误检测及韵律修正。依据错误类型查询纠正知识库的改进建议,可以及时的给学习者以帮助。实验结果表明,能够纠正有一定基础学习者非故意的多数错误。  相似文献   

9.
针对具有大段连续文本标注、但无时间标签的电视剧语音提出了一种半监督自动语音分割算法。首先采用原始的标注文本构建一个有偏的语言模型,然后将该语言模型以一种半监督的方式用于电视剧语音识别中,最后利用自动语音识别的解码结果对传统的基于距离度量、模型分类以及基于音素识别的语音分割算法进行改进。在英国科幻电视剧“神秘博士”数据集合上的实验结果表明,提出的半监督自动语音分割算法能够取得明显优于传统语音分割算法的性能,不仅有效解决了电视剧语音识别中大段连续音频的自动分割问题,还能对相应的大段连续文本标注进行分段,保证分割后各语音段时间标签及其对应文本的准确性。  相似文献   

10.
自动部件标注是一项复杂的视觉识别任务,但传统训练算法不适用于分布差异下的参数学习。为此,将部件标注描述为基于结构化输出的分类问题,提出一种支持结构化模型的自适应学习算法。通过引入基于相似度的正则算子,重新定义结构化支持向量机的损失函数,使训练损失度和源-目标参数差异度同时最小化。实验结果表明,与传统监督学习算法相比,该算法可使标注准确率提升2%~4%,同时指出部件位置特征的分布差异相比外观特征对自适应学习性能的影响更大。  相似文献   

11.
12.
The high error rate in spontaneous speech recognition is due in part to the poor modeling of pronunciation variations. An analysis of acoustic data reveals that pronunciation variations include both complete changes and partial changes. Complete changes are the replacement of a canonical phoneme by another alternative phone, such as ‘b’ being pronounced as ‘p’. Partial changes are the variations within the phoneme, such as nasalization, centralization, voiceless, voiced, etc. Most current work in pronunciation modeling attempts to represent pronunciation variations either by alternative phonetic representations or by the concatenation of subphone units at the hidden Markov state level. In this paper, we show that partial changes are a lot less clear-cut than previously assumed and cannot be modeled by mere representation by alternate phones or a concatenation of phone units. We propose modeling partial changes through acoustic model reconstruction. We first propose a partial change phone model (PCPM) to differentiate pronunciation variations. In order to improve the model resolution without increasing the parameter size too much, PCPM is used as a hidden model and merged into the pre-trained acoustic model through model reconstruction. To avoid model confusion, auxiliary decision trees are established for PCPM triphones, and one auxiliary decision tree can only be used by one standard decision tree. The acoustic model reconstruction on triphones is equivalent to decision tree merging. The effectiveness of this approach is evaluated on the 1997 Hub4NE Mandarin Broadcast News corpus (1997 MBN) with different styles of speech. It gives a significant 2.39% syllable error rate absolute reduction in spontaneous speech.  相似文献   

13.
In automatic speech recognition, the phone has probably been a dominating sub-word unit for more than one decade. Context Dependent phone or triphone modeling accounts for contextual variations between adjacent phones and state tying addresses modeling of triphones that are not seen during training. Recently, syllable is gaining momentum as a new sub-word unit. Syllable being a larger unit than a phone addresses the severe contextual variations between phones within it. Therefore, it is more stable than a phone and models pronunciation variability in a systematic way. Tamil language has challenging features like agglutination and morpho-phonology. In this paper, attempts have been made to provide solutions to these issues by using the syllable as a sub-word unit in an acoustic model. Initially, a small vocabulary context independent word models and a medium vocabulary context dependent phone models are developed. Subsequently, an algorithm based on prosodic syllable is proposed and two experiments have been conducted. First, syllable based context independent models have been trained and tested. Despite large number of syllables, this system has performed reasonably well compared to context independent word models in terms of word error rate and out of vocabulary words. Subsequently, in the second experiment, syllable information is integrated in conventional triphone modeling wherein cross-syllable triphones are replaced with monophones and the number of context dependent phone models is reduced by 22.76% in untied units. In spite of reduction in the number of models, the accuracy of the proposed system is comparable to that of the baseline triphone system.  相似文献   

14.
Pronunciation variations in spontaneous speech can be classified into complete changes and partial changes. A complete change is the replacement of a canonical phoneme by another alternative phone, such as 'b' being pronounced as 'p'. Partial changes are variations within the phoneme such as nasalization, centralization and voiced. Most current work in pronunciation modeling for spontaneous Mandarin speech remains at the phone level and can model only complete changes, not partial changes. In this paper, we show that partial changes are much less clear-cut than previously assumed and cannot be modelled by mere representation by alternate phone units. We present a solution for modeling both complete changes and partial changes in spontaneous Mandarin speech. In order to model complete changes, we adapted the decision tree-based pronunciation modeling from English to Mandarin to predict alternate pronunciations. To solve the data sparseness problem, we used cross-domain data to estimate pronunciation variability. To discard the unreliable alternative pronunciations, we proposed a likelihood ratio test as a confidence measure to evaluate the degree of phonetic confusions. In order to model partial changes, we proposed partial change phone models (PCPM) with acoustic model reconstruction. PCPMs are regarded as extended units of standard phoneme or initial/final subword units, and can be used efficiently to represent partial changes. In order to avoid model confusion, we generated auxiliary decision trees for PCPM triphones, and used decision tree merge to perform acoustic model reconstruction. The effectiveness of these approaches was evaluated on the 1997 Hub4NE Mandarin Broadcast News corpus with different styles of speech. Our phone level pronunciation modeling provided an absolute 0.9% syllable error rate reduction, and the acoustic model reconstruction approach was more efficient than that to cover pronunciation variations, yielding a significant 2.39% absolute reduction in syllable error rate for spontaneous speech. In addition, our proposed method deals with partial changes at the acoustic model level and can be applied to any automatic speech recognition system based on subword units.  相似文献   

15.
16.
陈斌  牛铜  张连海  李弼程  屈丹 《自动化学报》2014,40(12):2899-2907
提出了一种基于动态加权的数据选取方法, 并应用到连续语音识别的声学模型区分性训练中. 该方法联合后验概率和音素准确率选取数据, 首先, 采用后验概率的Beam算法裁剪词图, 在此基础上依据候选词所在候选路径的错误率, 基于后验概率动态的赋予候选词不同的权值; 其次, 通过统计音素对之间的混淆程度, 给易混淆音素对动态地加以不同的惩罚权重, 计算音素准确率; 最后, 在估计得到弧段期望准确率分布的基础上, 采用高斯函数形式对所有竞争弧段的期望音素准确率软加权.实验结果表明, 与最小音素错误准则相比, 该动态加权方法识别准确率提高了0.61%, 可有效减少训练时间.  相似文献   

17.
Source recording device recognition is an important emerging research field in digital media forensics. The literature has mainly focused on the source recording device identification problem, whereas few studies have focused on the source recording device verification problem. Sparse representation based classification methods have shown promise for many applications. This paper proposes a source cell phone verification scheme based on sparse representation. It can be further divided into three schemes which utilize exemplar dictionary, unsupervised learned dictionary and supervised learned dictionary respectively. Specifically, the discriminative dictionary learned by supervised learning algorithm, which considers the representational and discriminative power simultaneously compared to the unsupervised learning algorithm, is utilized to further improve the performances of verification systems based on sparse representation. Gaussian supervectors (GSVs) based on MFCCs, which have shown to be effective in capturing the intrinsic characteristics of recording devices, are utilized for constructing and learning dictionary. SCUTPHONE, which is a corpus of speech recordings from 15 cell phones, is presented. Evaluation experiments are conducted on three corpora of speech recordings from cell phones and demonstrate the effectiveness of the proposed methods for cell phone verification. In addition, the influences of number of target examples in the exemplar dictionary and size of the unsupervised learned dictionary on source cell phone verification performance are also analyzed.  相似文献   

18.
In this paper we introduce a set of related confidence measures for large vocabulary continuous speech recognition (LVCSR) based on local phone posterior probability estimates output by an acceptor HMM acoustic model. In addition to their computational efficiency, these confidence measures are attractive as they may be applied at the state-, phone-, word- or utterance-levels, potentially enabling discrimination between different causes of low confidence recognizer output, such as unclear acoustics or mismatched pronunciation models. We have evaluated these confidence measures for utterance verification using a number of different metrics. Experiments reveal several trends in “profitability of rejection", as measured by the unconditional error rate of a hypothesis test. These trends suggest that crude pronunciation models can mask the relatively subtle reductions in confidence caused by out-of-vocabulary (OOV) words and disfluencies, but not the gross model mismatches elicited by non-speech sounds. The observation that a purely acoustic confidence measure can provide improved performance over a measure based upon both acoustic and language model information for data drawn from the Broadcast News corpus, but not for data drawn from the North American Business News corpus suggests that the quality of model fit offered by a trigram language model is reduced for Broadcast News data. We also argue that acoustic confidence measures may be used to inform the search for improved pronunciation models.  相似文献   

19.
Humans are able to recognise a word before its acoustic realisation is complete. This in contrast to conventional automatic speech recognition (ASR) systems, which compute the likelihood of a number of hypothesised word sequences, and identify the words that were recognised on the basis of a trace back of the hypothesis with the highest eventual score, in order to maximise efficiency and performance. In the present paper, we present an ASR system, SpeM, based on principles known from the field of human word recognition that is able to model the human capability of ‘early recognition’ by computing word activation scores (based on negative log likelihood scores) during the speech recognition process.Experiments on 1463 polysyllabic words in 885 utterances showed that 64.0% (936) of these polysyllabic words were recognised correctly at the end of the utterance. For 81.1% of the 936 correctly recognised polysyllabic words the local word activation allowed us to identify the word before its last phone was available, and 64.1% of those words were already identified one phone after their lexical uniqueness point.We investigated two types of predictors for deciding whether a word is considered as recognised before the end of its acoustic realisation. The first type is related to the absolute and relative values of the word activation, which trade false acceptances for false rejections. The second type of predictor is related to the number of phones of the word that have already been processed and the number of phones that remain until the end of the word. The results showed that SpeM’s performance increases if the amount of acoustic evidence in support of a word increases and the risk of future mismatches decreases.  相似文献   

20.
在无关的发音质量评估系统中,需要先识别出待测语音的说话内容,才能进行准确评估。真实的评测数据往往有很多不利的因素影响识别正确率,包括噪声、方言口音、信道噪声、说话随意性等。针对这些不利因素,本文对声学模型进行了深入的研究,包括:在训练数据中加入背景噪声,增强了模型的抗噪声能力;采用基于说话人的倒谱均值方差规整(SCMVN),降低信道及说话人个体特性的影响;用和待测语音相同地域的朗读数据做最大后验概率(MAP)自适应,使模型带有当地方言口音的发音特点;用自然口语数据做MAP自适应,使模型较好地描述自然口语中比较随意的发音现象。实验结果表明,使用这些措施之后,使待测语音的识别正确率相对提高了44.1%,从而使机器评分和专家评分的相关系数相对提高了6.3%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号