首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 296 毫秒
1.
Computer-aided pronunciation training(CAPT) technologies enable the use of automatic speech recognition to detect mispronunciations in second language(L2) learners' speech. In order to further facilitate learning, we aim to develop a principle-based method for generating a gradation of the severity of mispronunciations. This paper presents an approach towards gradation that is motivated by auditory perception. We have developed a computational method for generating a perceptual distance(PD) between two spoken phonemes. This is used to compute the auditory confusion of native language(L1). PD is found to correlate well with the mispronunciations detected in CAPT system for Chinese learners of English,i.e., L1 being Chinese(Mandarin and Cantonese) and L2 being US English. The results show that auditory confusion is indicative of pronunciation confusions in L2 learning. PD can also be used to help us grade the severity of errors(i.e.,mispronunciations that confuse more distant phonemes are more severe) and accordingly prioritize the order of corrective feedback generated for the learners.  相似文献   

2.
Speech translation is a technology that helps people communicate across different languages. The most commonly used speech translation model is composed of automatic speech recognition, machine translation and text-to-speech synthesis components, which share information only at the text level. However, spoken communication is different from written communication in that it uses rich acoustic cues such as prosody in order to transmit more information through non-verbal channels. This paper is concerned with speech-to-speech translation that is sensitive to this paralinguistic information. Our long-term goal is to make a system that allows users to speak a foreign language with the same expressiveness as if they were speaking in their own language. Our method works by reconstructing input acoustic features in the target language. From the many different possible paralinguistic features to handle, in this paper we choose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training an end-to-end model that maps source-language duration and power information into the target language. Two approaches are investigated: linear regression and neural network models. We evaluate the proposed methods and show that paralinguistic information in the input speech of the source language can be reflected in the output speech of the target language.  相似文献   

3.
随着自动大规模语音识别的不断发展,以自动语音识别为基础的计算机辅助发音教学也随之进步,作为传统教学方法的补充,它极大地弥补了传统教育资源不足以及传统教育方法无法及时给学习者反馈的缺陷。二语学习者的发音偏误确认和评价在计算机辅助发音训练中是较为重要的研究课题之一。针对二语者发音偏误的确认任务中缺少二语偏误发音标注问题,该文提出了一种基于声学音素向量和孪生网络的方法,将带有配对信息的成对的语音特征作为系统输入,通过神经网络将语音特征映射到高层表示,期望将不同的音素区分开。训练过程引入了孪生网络,依照输出的两个音素向量是否来自于同一类音素来调整和优化输出向量之间的距离,并通过相应的损失函数实现优化过程。结果表明使用基于余弦最大间隔距离损失函数的孪生网络获得了89.93%的准确率,优于实验中其它方法。此方法应用在发音偏误确认任务时,不使用标注的二语发音偏误数据训练的情况下,也获得了89.19%的诊断正确率。  相似文献   

4.
Speaker variability is known to have an adverse impact on speech systems that process linguistic content, such as speech and language recognition. However, speech production changes in individuals due to stress and emotions have similarly detrimental effect also on the task of speaker recognition as they introduce mismatch with the speaker models typically trained on modal speech. The focus of this study is on the analysis of stress-induced variations in speech and design of an automatic stress level assessment scheme that could be used in directing stress-dependent acoustic models or normalization strategies. Current stress detection methods typically employ a binary decision based on whether the speaker is or not under stress. In reality, the amount of stress in individuals varies and can change gradually. Using speech and biometric data collected in a real-world, variable-stress level law enforcement training scenario, this study considers two methods for stress level assessment. The first approach uses a nearest neighbor clustering scheme at the vowel token and sentence levels to classify speech data into three levels of stress. The second approach employs Euclidean distance metrics within the multi-dimensional feature space to provide real-time stress level tracking capability. Evaluations on audio data confirmed by biometric readings show both methods to be effective in assessment of stress level within a speaker (average accuracy of 55.6?% in a 3-way classification task). In addition, an impact of high-level stress on in-set speaker recognition is evaluated and shown to reduce the accuracy from 91.7?% (low/mid stress) to 21.4?% (high level stress).  相似文献   

5.
A precise identification of prosodic phenomena and the construction of tools able to properly manage such phenomena are essential steps to disambiguate the meaning of certain utterances. In particular they are useful for a wide variety of tasks: automatic recognition of spontaneous speech, automatic enhancement of speech-generation systems, solving ambiguities in natural language interpretation, the construction of large annotated language resources, such as prosodically tagged speech corpora, and teaching languages to foreign students using Computer Aided Language Learning (CALL) systems. This paper presents a study on the automatic detection of prosodic prominence in continuous speech, with particular reference to American English, but with good prospects of application to other languages. Prosodic prominence involves two different prosodic features: pitch accent and stress accent. Pitch accent is acoustically connected with fundamental frequency (F0) movements and overall syllable energy, whereas stress exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. This paper shows that a careful measurement of these acoustic parameters, as well as the identification of their connection to prosodic parameters, makes it possible to build an automatic system capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature. Two different prominence detectors were studied and developed: the first uses a training corpus to set up thresholds properly, while the second uses a pure unsupervised method. In both cases, it is worth stressing that only acoustic parameters derived directly from speech waveforms are exploited.  相似文献   

6.
搭配的正确使用是区分地道英语使用者和普通学习者的一个重要特征.通过分析中国英语学习者语料库(CLEC),可以发现动名词搭配错误是英语学习者易犯的错误.本文提出一种可用于纠正英语学习者动名词搭配错误的层次语言模型.该语言模型考虑到了句子内部词语之间的依赖关系,将句子处理为不同的层次的子句,同一个句子内部的单词高度相关,不同子句内的单词相关性弱.该语言模型对于句子成分的变化得到的结果更加稳定,而且搭配信息得到浓缩,得到的语言模型更加精确.本文将模型用于生成分类器特征和结果排序.这种层次语言模型应用到英语动名词搭配的检错纠错中,对比传统语言模型,会有更好的效果.  相似文献   

7.
现有的情感自动标注方法大多仅从声学层或语言层提取单一识别特征,而彝语受分支方言多、复杂性高等因素的影响,对其使用单层情感特征进行自动标注的正确率较低。利用彝语情感词缀丰富等特点,提出一种双层特征融合方法,分别从声学层和语言层提取情感特征,采用生成序列和按需加入单元的方法完成特征序列对齐,最后通过相应的特征融合和自动标注算法来实现情感自动标注过程。以某扶贫日志数据库中的彝语语音和文本数据为样本,分别采用三种不同分类器进行对比实验。结果表明分类器对自动标注结果影响不明显,而双层特征融合后的自动标注正确率明显提高,正确率从声学层的48.1%和语言层的34.4%提高到双层融合的64.2%。  相似文献   

8.
现有的情感自动标注方法大多仅从声学层或语言层提取单一识别特征,而彝语受分支方言多、复杂性高等因素的影响,对其使用单层情感特征进行自动标注的正确率较低。利用彝语情感词缀丰富等特点,提出一种双层特征融合方法,分别从声学层和语言层提取情感特征,采用生成序列和按需加入单元的方法完成特征序列对齐,最后通过相应的特征融合和自动标注算法来实现情感自动标注过程。以某扶贫日志数据库中的彝语语音和文本数据为样本,分别采用三种不同分类器进行对比实验。结果表明分类器对自动标注结果影响不明显,而双层特征融合后的自动标注正确率明显提高,正确率从声学层的48.1%和语言层的34.4%提高到双层融合的64.2%。  相似文献   

9.
英语介词纠错系统,针对英语学习者英语语言中常见的介词错误进行计算机自动纠正.首先,对标注过得语料库中介词错误进行了分类统计,总结出21种常见介词,在英语wiki语料库中利用计算机自动错误插值算法获得训练集合.然后在训练集合基础之上,通过使用基于最大熵模型的分类器,选择了包括上下文、介词补足语等特征,在训练集上进行模型的训练,最后使用模型对于输入句子进行预测并纠正存在的使用错误.在NUCLE语料的实验中,给出了语料处理、模型特点、训练语料的大小、迭代次数对于测试集效果的影响,并且比较了朴素贝叶斯模型的结果,最后在测试数据达到27.68的F值,相对于CoNLL2013的shared task中最好结果有小幅提升.  相似文献   

10.
English lexical stress is acoustically related to combination of duration, intensity, fundamental frequency (F0) and vowel quality. Errors in any or all of these correlates could interfere with production of the stress contrast, but it is unknown which correlates are most difficult for L1 Bengali speakers to acquire. This study compares the use of these correlates in the production of English lexical stress contrasts by 10 L1 English and 20 L1 Bengali speakers. The results showed that L1 Bengali speakers produced significantly less native like stress patterns, although they used all four acoustic correlates to distinguish stressed from unstressed syllables. L1 English speakers reduced vowel duration significantly more in the unstressed vowels compared to L1 Bengali speakers and degree of intensity and F0 increase in stressed vowels by L1 English speakers was higher than that by L1 Bengali speakers. There were also significant differences in formant patterns across speaker groups, such that L1 Bengali speakers produced English like vowel reduction in certain unstressed syllables, but in other cases, L1 Bengali speakers had tendency to either not reduce or incorrectly reduce vowels in unstressed syllables. The results suggest that L1 Bengali speakers’ production of English lexical stress contrast is influenced by L1 language experience and L1 phonology.  相似文献   

11.
In this paper a study concerning the evaluation and analysis of natural language tweets is presented. Based on our experience in text summarisation, we carry out a deep analysis on user's perception through the evaluation of tweets manual and automatically generated from news. Specifically, we consider two key issues of a tweet: its informativeness and its interestingness. Therefore, we analyse: (1) do users equally perceive manual and automatic tweets?; (2) what linguistic features a good tweet may have to be interesting, as well as informative? The main challenge of this proposal is the analysis of tweets to help companies in their positioning and reputation on the Web. Our results show that: (1) automatically informative and interesting natural language tweets can be generated as a result of summarisation approaches; and (2) we can characterise good and bad tweets based on specific linguistic features not present in other types of tweets.  相似文献   

12.
大规模语料库的手工韵律标注消耗大量的时间和人力。这篇论文的目的在于研究如何充分利用少量的手工标注数据训练得到尽可能精确的语音重音自动标注器。论文列举并对比了四种训练方法的效果。在训练中结合声学分类器和语言学分类器,同时使用了综合分类器做后期优化。在实验中,使用机器数据训练声学分类器,并将有限的手工数据用于后期综合分类器能得到最佳的标注正确率。最终的正确率达到了94.0%,与手工标注的正确率上限97.2%比较接近。  相似文献   

13.
基于语音识别技术的英语口语教学系统   总被引:1,自引:0,他引:1  
许多计算机辅助英语学习的应用欠缺口语学习的评估和反馈.描述了一个采用语音识别技术的英语口语学习系统.除了通常的发音评分外,还提供基于音素关联和音素识别的错误检测功能.结合纠正知识库的改进建议和韵律修正语音,可以及时地给学习者以帮助.实验结果表明,能够纠正有一定基础学习者的多数非故意错误.  相似文献   

14.
15.
英语学习者书面语法错误检测和修改系统可为作文自动评分提供参数,评测作文整体质量;也可用于计算机辅助英语教学,为学生提供书面纠错反馈,促进其二语写作能力的发展。该文概述了近十年来自然语言处理技术在英语学习者语法错误自动检测研究中的应用,首先介绍了基于大规模本族语和学习者语料库的三种数据驱动的系统设计方法,然后讨论了语误检测系统的评测标准,最后提出了提高现有系统准确率的一些建议。  相似文献   

16.
A major challenge for the identification of singers from monaural popular music recording is to remove or alleviate the influence of accompaniments. Our system is realized in two stages. In the first stage, we exploit computational auditory scene analysis (CASA) to segregate the singing voice units from a mixture signal. First, the pitch of singing voice is estimated to extract the pitch-based features of each unit in an acoustic vector. These features are then exploited to estimate the binary time-frequency (T-F) masks, where 1 indicates that the corresponding T-F unit is dominated by the singing voice, and 0 indicates otherwise. These regions dominated by the singing voice are considered reliable, and other units are unreliable or missing. Thus the acoustic vector is incomplete. In the second stage, two missing feature methods, the reconstruction of acoustic vector and the marginalization, are used to identify the singer by dealing with the incomplete acoustic vectors. For the reconstruction of acoustic vector, the complete acoustic vector is first reconstructed and then converted to obtain the Gammatone frequency cepstral coefficients (GFCCs), which are further used to identify the singer. For the marginalization, the probabilities that the voice belonging to a certain singer are computed on the basis of only the reliable components. We find that the reconstruction method outperforms the marginalization method, while both methods have significantly good performances, especially at signal-to-accompaniment ratios (SARs) of 0 dB and ??3 dB, in contrast to another system.  相似文献   

17.
语料资源缺乏的连续语音识别方法的研究   总被引:2,自引:0,他引:2  
由于少数民族语言有其本身的特点, 不能简单地套用现有的连续语音识别的方法. 本文以蒙古语为例, 研讨了声学和语言模型的建立, 并在日本国际电气通信基础技术研究所的连续语音识别器上实现了蒙古语的语音识别系统. 本文侧重于语言模型的建立, 基于蒙古语黏着性语言特点, 提出用相似词聚类方法建立多类N-gram模型. 实验结果显示, 应用我们提出的语言模型, 识别精度比用传统的词的N-gram识别法提高了5.5%.  相似文献   

18.
对于英语等"重音节拍语言",重音是一个非常重要的韵律学特征。针对传统特征提取中固定帧长方式存在的缺点,使用基音同步帧特征分析方法,提出了基于动态帧长的基音同步能量和基音同步峰值特征。在使用新特征对英语连续语音进行词重音检测时发现,联合使用新特征与传统特征,可使误识率下降6.65%。  相似文献   

19.
Positive effects of learner control decrease when learners do not perceive the control given to them, make suboptimal choices, or are cognitively overloaded by the amount of choice. This study proposes shared control (i.e., learners choose from a pre-selection of suitable tasks) over highly variable tasks to tackle these problems. Ninety-four students participated in a 2 × 2 factorial experiment with the factors control (system, shared) and variability of surface features (low, high). Results show superior effects on training performance, transfer test performance, and task involvement of shared control when learners can choose from pre-selected tasks with surface features that are different from the surface features of previous tasks.  相似文献   

20.
Traditional studies of speaker state focus primarily upon one-stage classification techniques using standard acoustic features. In this article, we investigate multiple novel features and approaches to two recent tasks in speaker state detection: level-of-interest (LOI) detection and intoxication detection. In the task of LOI prediction, we propose a novel Discriminative TFIDF feature to capture important lexical information and a novel Prosodic Event detection approach using AuToBI; we combine these with acoustic features for this task using a new multilevel multistream prediction feedback and similarity-based hierarchical fusion learning approach. Our experimental results outperform published results of all systems in the 2010 Interspeech Paralinguistic Challenge – Affect Subchallenge. In the intoxication detection task, we evaluate the performance of Prosodic Event-based, phone duration-based, phonotactic, and phonetic-spectral based approaches, finding that a combination of the phonotactic and phonetic-spectral approaches achieve significant improvement over the 2011 Interspeech Speaker State Challenge – Intoxication Subchallenge baseline. We discuss our results using these new features and approaches and their implications for future research.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号