首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
该文对不同语速下,人工标注的维吾尔语连续语音语料中各音素进行共振峰频率、音长、音强的统计分析,并完成辅-元结构下的塞音、塞擦音的声学特征分析。该文通过美尔频率倒谱系数与共振峰频率等声学特征的融合及模型状态数的修改,对维吾尔语音素识别的声学模型进行了改进,并验证了不同声学特征对音素识别的影响。相比于基线系统,改进后声学模型的识别率取得一定提升。同时,利用语音学知识分析维吾尔语易混淆音素产生原因,为音素识别声学模型的进一步改进提供参考依据。  相似文献   

2.
以建立维吾尔语连续音素识别基础平台为目标,在HTK(基于隐马尔可夫模型的工具箱)的基础上,首次研究了其语言相关环节的几项关键技术;结合维吾尔语的语言特征,完成了用于语言模型建立和语音语料库建设的维吾尔语基础文本设计;根据具体技术指标,录制了较大规模语音语料库;确定音素作为基元,训练了维吾尔语声学模型;在基于字母的N-gram语言模型下,得出了从语音句子向字母序列句子的识别结果;统计了维吾尔语32个音素的识别率,给出了容易混淆的音素及其根源分析,为进一步提高识别率奠定了基础。  相似文献   

3.
为降低声学特征在语音识别系统中的音素识别错误率,提高系统性能,提出一种子空间高斯混合模型和深度神经网络结合提取特征的方法,分析了子空间高斯混合模型的参数规模并在减少计算复杂度后将其与深度神经网络串联进一步提高音素识别率。把经过非线性特征变换的语音数据输入模型,找到深度神经网络结构的最佳配置,建立学习与训练更可靠的网络模型进行特征提取,通过比较音素识别错误率来判断系统性能。实验仿真结果证明,基于该系统提取的特征明显优于传统声学模型。  相似文献   

4.
陈斌  牛铜  张连海  李弼程  屈丹 《自动化学报》2014,40(12):2899-2907
提出了一种基于动态加权的数据选取方法, 并应用到连续语音识别的声学模型区分性训练中. 该方法联合后验概率和音素准确率选取数据, 首先, 采用后验概率的Beam算法裁剪词图, 在此基础上依据候选词所在候选路径的错误率, 基于后验概率动态的赋予候选词不同的权值; 其次, 通过统计音素对之间的混淆程度, 给易混淆音素对动态地加以不同的惩罚权重, 计算音素准确率; 最后, 在估计得到弧段期望准确率分布的基础上, 采用高斯函数形式对所有竞争弧段的期望音素准确率软加权.实验结果表明, 与最小音素错误准则相比, 该动态加权方法识别准确率提高了0.61%, 可有效减少训练时间.  相似文献   

5.
为实现中英文民航陆空通话语音识别,提出一种基于深度学习的跨语种民航陆空通话语音识别方法.基于共享隐层的卷积深度神经网络(CDNN)建立一个跨语种声学模型;将中文音素和英文音素(CMU)融合用于构建混合语言模型;在此基础上将CMU标准英文音素映射为TIMIT标准英文音素重构语言模型用于识别;为了缩短训练和解码的时间,在提取特征阶段加入低帧率.实验结果表明,卷积深度神经网络声学模型可较好地应用于民航陆空通话领域;音素映射方法能够进一步提高识别性能;加入低帧率后有效缩短了训练时间且使词错误率下降到4.28%.  相似文献   

6.
将标准普通话语音数据训练得到的声学模型应用于新疆维吾尔族说话人非母语汉语语音识别时,由于说话人的普通话发音存在较大偏误,将导致识别率急剧下降。针对这一问题,将多发音字典技术应用于新疆维吾尔族说话人汉语语音识别中,通过统计分析识别器的识别错误,建立音素混淆矩阵,获取音素的发音候选项。利用剪枝策略对发音候选项进行剪枝整合,扩展出符合维吾尔族说话人汉语发音规律的替代字典。对三种剪枝方法产生的发音字典的识别结果进行了对比。实验结果表明,使用相对最大剪枝策略产生的发音字典可以显著提高系统识别率。  相似文献   

7.
短时声学特征参数如MFCC,PLP作为输入向量的高斯混合模型(GMM)的隐马尔可夫模型(HMM)的经典模型在大词汇连续语音识别系统(LVCSR)已取得了良好识别效果。但针对短时声学特征区分性差的特点,本文提出采用神经网络多层感知器(MLP)产生的两种类型差异特征HATs与TANDEM代替短时特征,分别训练GMM参数模型。实验结果表明,差异特征的GMHMM的LVCSR系统优于传统的短时特征的系统;为了更进一步提高系统识别率,该文又将两种类型差异特征HATs与TANDEM进行复合,构成MLPs特征流重建GMHMM,系统的错字率(CER)有2%~3.8%的明显改善。  相似文献   

8.
本文介绍了一种基于词图的并行音素识别方法的自动语种识别系统,基于词图的并行音素识别方法是并行音素识别方法的一个扩展,它用识别产生的词图来描述声学候选结果空间,比并行音素识别方法中用最佳路径音子序列包含更丰富的信息。通过真实环境广播语音测试表明,该方法比并行音素识别方法识别性能提升了约6%,在每个语种约4小时的训练数据下,跟其他的几种语种识别方法也有可比的性能。  相似文献   

9.
为提高语音情感识别精度,对基本声学特征构建的多维特征集合,采用二次特征选择方法综合考虑特征参数与情感类别之间的内在特性,从而建立优化的、具有有效情感可分性的特征子集;在语音情感识别阶段,设计二叉树结构的多分类器以综合考虑系统整体性能与复杂度,采用核融合方法改进SVM模型,使用多核SVM识别混淆度最大的情感。算法在Berlin情感语音库五种情感状态的样本上进行验证,实验结果表明二次特征选择与核融合相结合的方法在有效提高情感识别精度的同时,对噪声具有一定的鲁棒性。  相似文献   

10.
基于mel标度频谱和音素分割的汉语语音单词端点检测方法   总被引:3,自引:0,他引:3  
利用语音声学信号的频谱分析来寻找连续语音信号帧的分割点,再结合音素分割方法,成功的提高了分割精度。实验表明mel标度频谱法比传统的以信号的短时能量,过零率等简单特征作为判决特征参数的语音端点检测方法更适合语音的分割。  相似文献   

11.
This paper proposes a new feature extraction technique using wavelet based sub-band parameters (WBSP) for classification of unaspirated Hindi stop consonants. The extracted acoustic parameters show marked deviation from the values reported for English and other languages, Hindi having distinguishing manner based features. Since acoustic parameters are difficult to be extracted automatically for speech recognition. Mel Frequency Cepstral Coefficient (MFCC) based features are usually used. MFCC are based on short time Fourier transform (STFT) which assumes the speech signal to be stationary over a short period. This assumption is specifically violated in case of stop consonants. In WBSP, from acoustic study, the features derived from CV syllables have different weighting factors with the middle segment having the maximum. The wavelet transform has been applied to splitting of signal into 8 sub-bands of different bandwidths and the variation of energy in different sub-bands is also taken into account. WBSP gives improved classification scores. The number of filters used (8) for feature extraction in WBSP is less compared to the number (24) used for MFCC. Its classification performance has been compared with four other techniques using linear classifier. Further, Principal components analysis (PCA) has also been applied to reduce dimensionality.  相似文献   

12.
声母识别在构音障碍评估中有重要临床意义,而声母时长短、不平稳,传统方法的识别效果不理想.本文使用小波变换对声母信号进行多尺度分析,提取出新的声母特征向量(DWTMFC-CT),可以更精细刻画相似声母的差别,然后利用模糊多类支持向量机进行声母的识剐.为降低模糊支持向量机进行多分类时所带来的计算复杂度,使用两阶段算法.实验...  相似文献   

13.
In phoneme recognition experiments, it was found that approximately 75% of misclassified frames were assigned labels within the same broad phonetic group (BPG). While the phoneme can be described as the smallest distinguishable unit of speech, phonemes within BPGs contain very similar characteristics and can be easily confused. However, different BPGs, such as vowels and stops, possess very different spectral and temporal characteristics. In order to accommodate the full range of phonemes, acoustic models of speech recognition systems calculate input features from all frequencies over a large temporal context window. A new phoneme classifier is proposed consisting of a modular arrangement of experts, with one expert assigned to each BPG and focused on discriminating between phonemes within that BPG. Due to the different temporal and spectral structure of each BPG, novel feature sets are extracted using mutual information, to select a relevant time-frequency (TF) feature set for each expert. To construct a phone recognition system, the output of each expert is combined with a baseline classifier under the guidance of a separate BPG detector. Considering phoneme recognition experiments using the TIMIT continuous speech corpus, the proposed architecture afforded significant error rate reductions up to 5% relative  相似文献   

14.
Audio-visual speech modeling for continuous speech recognition   总被引:3,自引:0,他引:3  
This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate  相似文献   

15.
Due to the various techniques used in experimental phonetics and the language inventories, more and more has been learned about the nature of stops of the world's languages. Stop consonants occur in all languages, with voiceless unaspirated stops being the most common. The differences in voice onset time (VOT) have been termed lead vs. short lag, where VOT itself is defined as the timing between the onset of phonation and the release of the occlusion of the vocal tract.For Hungarian, no systematic analysis of the stops has been carried out thus far. This paper aims to investigate the acoustic and perceptual properties of VOTs of the three Hungarian voiceless stops when they appear in isolation (in syllables and in words) but also when they occur in spontaneous speech.The results of the acoustic analysis show a clear difference between careful and spontaneous speech. Bilabials and velars are significantly shorter in fluent speech than in careful speech (18.51 msec and 35.31 msec respectively, as opposed to 24.64 msec and 50.17 msec) while dentals seem to be unchanged (23.3 msec as opposed to 26.59 msec). Therefore, the actual duration of VOT is characteristic of the place of the articulation of stops in spontaneous speech, and VOTs of bilabials and dentals do not differ from each other in careful speech. Vowels following the stops influence them more in careful than in spontaneous speech, which can also be explained by the experimentally confirmed phenomenon of the changing quality of the present-day Hungarian vowels into the neutral vowel. Voice onset time is a specific feature of the Hungarian unaspirated plosive consonants. A further experiment was carried out to define the actual function of the VOTs of the voiceless stops in the Hungarian listeners' perception.  相似文献   

16.
介绍了颜色矩、Hu矩、Zernike矩、小波矩等特征提取算法,改进了大小特征提取算法。针对单一特征提取算法提取特征信息不全面,不能区别对待识别样本,识别率低等问题提出了一种改进特征提取算法,该算法由上述五种算法通过特征距离自优化组合生成。介绍了算法公式,执行流程,结合项目建立了特征库。通过选取几类易于混淆的水果进行识别试验,结果表明采用改进特征提取算法的识别率明显优于单一特征提取算法,只是识别的平均时间略有延长,但可满足实时识别的要求的别足。  相似文献   

17.
This paper presents an investigation into ways of integrating articulatory features into hidden Markov model (HMM)-based parametric speech synthesis. In broad terms, this may be achieved by estimating the joint distribution of acoustic and articulatory features during training. This may in turn be used in conjunction with a maximum-likelihood criterion to produce acoustic synthesis parameters for generating speech. Within this broad approach, we explore several variations that are possible in the construction of an HMM-based synthesis system which allow articulatory features to influence acoustic modeling: model clustering, state synchrony and cross-stream feature dependency. Performance is evaluated using the RMS error of generated acoustic parameters as well as formal listening tests. Our results show that the accuracy of acoustic parameter prediction and the naturalness of synthesized speech can be improved when shared clustering and asynchronous-state model structures are adopted for combined acoustic and articulatory features. Most significantly, however, our experiments demonstrate that modeling the dependency between these two feature streams can make speech synthesis systems more flexible. The characteristics of synthetic speech can be easily controlled by modifying generated articulatory features as part of the process of producing acoustic synthesis parameters.   相似文献   

18.
19.
一种基于语音学知识的汉语辅音分类方法   总被引:3,自引:0,他引:3  
文章提出一种提高汉语辅音识别性能的框架,在此框架下构造了一个基于声学—语音层分析的多级分类器,实现对全部汉语辅音的无重叠分类,测试了将辅音分类结果与概率统计模型结合的效果。重点讨论了用于汉语辅音分类的几种特征参数提取技术和实验结果。文章所提取的特征参数包括非嗓音段持续时间(DUP)、归一化的有效频带能量趋势等,涉及时域、频域和小波变换域等不同分析处理方法,特征参数简单、有效,具有较好的与后接元音无关和非特定人性质。分类器将21个汉语辅音分为5类,狖m,n,l,r狚,狖b,d,g狚,狖p,t,k,f,h狚,狖zh,ch,sh狚,狖z,c,s,j,q,x狚;其分类正确率分别达97.21%、97.10%,97.70%,93.31%和94.80%。实验所用的语音资料库包括21个话者的孤立字汉语辅音发音资料。  相似文献   

20.
The cascading appearance-based (CAB) feature extraction technique has established itself as the state-of-the-art in extracting dynamic visual speech features for speech recognition. In this paper, we will focus on investigating the effectiveness of this technique for the related speaker verification application. By investigating the speaker verification ability of each stage of the cascade we will demonstrate that the same steps taken to reduce static speaker and environmental information for the visual speech recognition application also provide similar improvements for visual speaker recognition. A further study is conducted comparing synchronous HMM (SHMM) based fusion of CAB visual features and traditional perceptual linear predictive (PLP) acoustic features to show that higher complexity inherit in the SHMM approach does not appear to provide any improvement in the final audio–visual speaker verification system over simpler utterance level score fusion.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号