首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 671 毫秒
1.
根据语音合成与语音识别等语音应用研究的需求,从文本分析模块入手,第一次利用“维吾尔语语音声学参数库”,选择了带有塞音和塞擦音的556个双音节词,重点提取了塞音、塞擦音的位置在双音节词的词中和词尾时嗓音起始时间VOT特征参数并对此进行了统计分析,首次从实验语音学的角度提出了清塞音、清塞擦音送气、不送气的概念。通过分析得出结论:(1)利用VOT均值可以明显地把维吾尔语中的塞音、塞擦音从清浊类别上区分开来。(2)从嗓音起始时间类型上来看,清塞音出现在双音节词第二音节的音节首(词中)时表现出不送气的特点,而位置出现在第二音节的音节末(词尾)时,有时读成送气音,有时读成不送气音,根据实际情况和个人发音习惯的不同可以自由变读。研究结果对维吾尔语语音合成自然度和语音识别系统识别率的提高有非常大的作用。  相似文献   

2.
This paper proposes a new feature extraction technique using wavelet based sub-band parameters (WBSP) for classification of unaspirated Hindi stop consonants. The extracted acoustic parameters show marked deviation from the values reported for English and other languages, Hindi having distinguishing manner based features. Since acoustic parameters are difficult to be extracted automatically for speech recognition. Mel Frequency Cepstral Coefficient (MFCC) based features are usually used. MFCC are based on short time Fourier transform (STFT) which assumes the speech signal to be stationary over a short period. This assumption is specifically violated in case of stop consonants. In WBSP, from acoustic study, the features derived from CV syllables have different weighting factors with the middle segment having the maximum. The wavelet transform has been applied to splitting of signal into 8 sub-bands of different bandwidths and the variation of energy in different sub-bands is also taken into account. WBSP gives improved classification scores. The number of filters used (8) for feature extraction in WBSP is less compared to the number (24) used for MFCC. Its classification performance has been compared with four other techniques using linear classifier. Further, Principal components analysis (PCA) has also been applied to reduce dimensionality.  相似文献   

3.
In general, speech is made with sequences of consonants (fricatives, nasals and stops), vowels and glides. The classification of the stop consonants remains one of the most challenging problems in speech recognition. In this paper, we propose a new approach based on the normalized energy in frequency bands in the release and closure phases in order to characterize and classify the Arabic stop consonants (/b/, /d/, /t/, /k/ and /q/) and to recognize the CV syllable. Classification experiments were performed using decision algorithms on stop consonants C and CV syllables extracted from an Arabic corpus. The results yielded to an overall stop consonants classification of 90.27% and syllables CV recognition upper than 90% for all stops.  相似文献   

4.
Amita Dev 《AI & Society》2009,23(4):603-612
As development of the speech recognition system entirely depends upon the spoken language used for its development, and the very fact that speech technology is highly language dependent and reverse engineering is not possible, there is an utmost need to develop such systems for Indian languages. In this paper we present the implementation of a time delay neural network system (TDNN) in a modular fashion by exploiting the hidden structure of previously phonetic subcategory network for recognition of Hindi consonants. For the present study we have selected all the Hindi phonemes for srecognition. A vocabulary of 207 Hindi words was designed for the task-specific environment and used as a database. For the recognition of phoneme, a three-layered network was constructed and the network was trained using the back propagation learning algorithm. Experiments were conducted to categorize the Hindi voiced, unvoiced stops, semi vowels, vowels, nasals and fricatives. A close observation of confusion matrix of Hindi stops revealed maximum confusion of retroflex stops with their non-retroflex counterparts.  相似文献   

5.
A precise identification of prosodic phenomena and the construction of tools able to properly manage such phenomena are essential steps to disambiguate the meaning of certain utterances. In particular they are useful for a wide variety of tasks: automatic recognition of spontaneous speech, automatic enhancement of speech-generation systems, solving ambiguities in natural language interpretation, the construction of large annotated language resources, such as prosodically tagged speech corpora, and teaching languages to foreign students using Computer Aided Language Learning (CALL) systems. This paper presents a study on the automatic detection of prosodic prominence in continuous speech, with particular reference to American English, but with good prospects of application to other languages. Prosodic prominence involves two different prosodic features: pitch accent and stress accent. Pitch accent is acoustically connected with fundamental frequency (F0) movements and overall syllable energy, whereas stress exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. This paper shows that a careful measurement of these acoustic parameters, as well as the identification of their connection to prosodic parameters, makes it possible to build an automatic system capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature. Two different prominence detectors were studied and developed: the first uses a training corpus to set up thresholds properly, while the second uses a pure unsupervised method. In both cases, it is worth stressing that only acoustic parameters derived directly from speech waveforms are exploited.  相似文献   

6.
Fuzzy algorithms provide a simpler and more powerful approach than statistical decision methods for describing non-ideal (fuzzy) environments in which there exists no precise boundary between the categories due to inherent vagueness rather than randomness. This paper attempts to demonstrate the effectiveness of such an algorithm when applied to the computer recognition of patterns of biological origin such as Telugu unaspirated plosives in initial position of large number of utterances in CVC context. A multieategorizer is described in which the fuzzy processor embodies a fuzzy property extractor and a similarity matrix generator. A provision fur controlling fuzziness in property sets had been made by keeping two parameters. ‘exponential’ and ‘denominational’ fuzzifiers, in the components of property matrices ; their effect on recognition score is also studied.

Machines’ performances are explained by plotting curves and through confusion matrices when transition, duration and slope of transition from the point of transient release of stop closure to the steady state of only first two formants were used as input features. Voiced stops are differentiated more easily than unvoiced stops, with the maximum overall recognition score ranging from 60% for dentals to 85% for bilabials. The fuzzy hedge ‘ slightly ’ when applied to property sets reduces the confusion from that of the hedge ‘ very ’ and consecutive utilizations of the operations ‘CONT’, ‘ OIL’ and ‘INT’ resulted in a wide variation of about 20 to 25% in the recognition score. Such a variation is found to be insignificant beyond an optimum value of the exponential fuzzifier’.  相似文献   

7.
8.
The high error rate in spontaneous speech recognition is due in part to the poor modeling of pronunciation variations. An analysis of acoustic data reveals that pronunciation variations include both complete changes and partial changes. Complete changes are the replacement of a canonical phoneme by another alternative phone, such as ‘b’ being pronounced as ‘p’. Partial changes are the variations within the phoneme, such as nasalization, centralization, voiceless, voiced, etc. Most current work in pronunciation modeling attempts to represent pronunciation variations either by alternative phonetic representations or by the concatenation of subphone units at the hidden Markov state level. In this paper, we show that partial changes are a lot less clear-cut than previously assumed and cannot be modeled by mere representation by alternate phones or a concatenation of phone units. We propose modeling partial changes through acoustic model reconstruction. We first propose a partial change phone model (PCPM) to differentiate pronunciation variations. In order to improve the model resolution without increasing the parameter size too much, PCPM is used as a hidden model and merged into the pre-trained acoustic model through model reconstruction. To avoid model confusion, auxiliary decision trees are established for PCPM triphones, and one auxiliary decision tree can only be used by one standard decision tree. The acoustic model reconstruction on triphones is equivalent to decision tree merging. The effectiveness of this approach is evaluated on the 1997 Hub4NE Mandarin Broadcast News corpus (1997 MBN) with different styles of speech. It gives a significant 2.39% syllable error rate absolute reduction in spontaneous speech.  相似文献   

9.
The porting of a speech recognition system to a new language is usually a time-consuming and expensive process since it requires collecting, transcribing, and processing a large amount of language-specific training sentences. This work presents techniques for improved cross-language transfer of speech recognition systems to new target languages. Such techniques are particularly useful for target languages where minimal amounts of training data are available. We describe a novel method to produce a language-independent system by combining acoustic models from a number of source languages. This intermediate language-independent acoustic model is used to bootstrap a target-language system by applying language adaptation. For our experiments, we use acoustic models of seven source languages to develop a target Greek acoustic model. We show that our technique significantly outperforms a system trained from scratch when less than 8 h of read speech is available  相似文献   

10.
汉语普通话易混淆音素的识别   总被引:2,自引:2,他引:0       下载免费PDF全文
针对汉语普通话语音识别中易混淆音素的声学特征,把小波包分解理论应用在感觉加权线性预测(PLP)特征中,提出一种新的特征参数提取算法,可以更精确地描述易混淆音素的频谱特征。使甩高斯混合模型对新的声学特征进行分类,从而达到区分的目的。实验结果证明,新的特征参数识别结果优于使用传统PLP特征参数的识别结果,识别错误率下降30%以上。  相似文献   

11.
The development of an audiovisual pronunciation teaching and training method and software system is discussed in this article. The method is designed to help children with speech and hearing disorders gain better control over their speech production. The teaching method is drawn up for progression from individual sound preparation to practice of sounds in sentences for four languages: English, Swedish, Slovenian, and Hungarian. The system is a general language-independent measuring tool and database editor. This database editor makes it possible to construct modules for all participant languages and for different sound groups. Two modules are under development for the system in all languages: one for teaching and training vowels to hearing-impaired children and the other for correction of misarticulated fricative sounds. In the article we present the measuring methods, the used distance score calculations of the visualized speech spectra, and problems in the evaluation of the new multimedia tool.  相似文献   

12.
A general method which combines formant synthesis by rule and time-domain concatenation is proposed. This method utilizes the advantages of both techniques by maintaining naturalness while minimizing difficulties such as prosodic modification and spectral discontinuities at the point of concatenation. An integrated sampled natural glottal source (Matsui et al., 1991) and sampled voiceless consonants were incorporated into a real-time text-to-speech formant synthesizer. In special cases, voicing amplitude envelopes and formant transitions dirived from natural speech were also utilized. Several listening tests were performed to evaluate these methods. We obtained a significant overall improvement in intelligibility over our previous formant synthesizer. Such improvements in intelligibility were previously obtained with a Japanese text-to-speech system using a related hybrid system (Kamai and Matsui, 1993), indicating the applicability of this method for multi-lingual synthesis. The results of subjective analyses showed that these methods can alo improve naturalness and listenability factors.  相似文献   

13.
14.
The paper proposes a diphone/sub-syllable method for Arabic Text-to-Speech (ATTS) systems. The proposed approach exploits the particular syllabic structure of the Arabic words. For good quality, the boundaries of the speech segments are chosen to occur only at the sustained portion of vowels. The speech segments consists of consonants–half vowels, half vowel–consonants, half vowels, middle portion of vowels, and suffix consonants. The minimum set consists of about 310 segments for classical Arabic.  相似文献   

15.
16.
This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word and phrasal level by examination of supra-segmental parameters. A word level segmentationer has been developed which can indicate the word boundaries with acceptable precision for both languages. The ultimate aim is to increase the robustness of speech recognition on the language modelling level by the detection of word and phrase boundaries, and thus we can significantly decrease the searching space during the decoding process. Searching space reduction is highly important in the case of agglutinative languages. In Hungarian and in Finnish, if stress is present, this is always on the first syllable of the word stressed. Thus if stressed syllables can be detected, these must be at the beginning of the word. We have developed different algorithms based either on a rule-based or a data-driven approach. The rule-based algorithms and HMM-based methods are compared. The best results were obtained by data-driven algorithms using the time series of fundamental frequency and energy together. Syllable length was found to be much less effective, hence was discarded. By use of supra-segmental features, word boundaries can be marked with high accuracy, even if we are unable to find all of them. The method we evaluated is easily adaptable to other fixed-stress languages. To investigate this we adapted our data-driven method to the Finnish language and obtained similar results.  相似文献   

17.
在无关的发音质量评估系统中,需要先识别出待测语音的说话内容,才能进行准确评估。真实的评测数据往往有很多不利的因素影响识别正确率,包括噪声、方言口音、信道噪声、说话随意性等。针对这些不利因素,本文对声学模型进行了深入的研究,包括:在训练数据中加入背景噪声,增强了模型的抗噪声能力;采用基于说话人的倒谱均值方差规整(SCMVN),降低信道及说话人个体特性的影响;用和待测语音相同地域的朗读数据做最大后验概率(MAP)自适应,使模型带有当地方言口音的发音特点;用自然口语数据做MAP自适应,使模型较好地描述自然口语中比较随意的发音现象。实验结果表明,使用这些措施之后,使待测语音的识别正确率相对提高了44.1%,从而使机器评分和专家评分的相关系数相对提高了6.3%。  相似文献   

18.
以维吾尔语为例研究自然语料缺乏的民族语言连续语音识别方法。采用HTK通过人工标注的少量语料生成种子模型,引导大语音数据构建声学模型,利用palmkit工具生成统计语言模型,以Julius工具实现连续语音识别。实验用64个维语母语者自由发话的6 400个 短句语音建立单音素声学模型,由100 MB文本、6万词词典生成基于词类的3-gram语言模型,测试结果表明,该方法的识别率为 72.5%,比单用HTK提高4.2个百分点。  相似文献   

19.
The purpose of this paper is the application of the Genetic Algorithms (GAs) to the supervised classification level, in order to recognize Standard Arabic (SA) fricative consonants of continuous, naturally spoken, speech. We have used GAs because of their advantages in resolving complicated optimization problems where analytic methods fail. For that, we have analyzed a corpus that contains several sentences composed of the thirteen types of fricative consonants in the initial, medium and final positions, recorded by several male Jordanian speakers. Nearly all the world’s languages contain at least one fricative sound. The SA language occupies a rather exceptional position in that nearly half of its consonants are fricatives and nearly half of fricative inventory is situated far back in the uvular, pharyngeal and glottal areas. We have used Mel-Frequency Cepstral analysis method to extract vocal tract coefficients from the speech signal. Among a set of classifiers like Bayesian, likelihood and distance classifier, we have used the distance one. It is based on the classification measure criterion. So, we formulate the supervised classification as a function optimization problem and we have used the decision rule Mahalanobis distance as the fitness function for the GA evaluation. We report promising results with a classification recognition accuracy of 82%.  相似文献   

20.
This paper presents our work in automatic speech recognition (ASR) in the context of under-resourced languages with application to Vietnamese. Different techniques for bootstrapping acoustic models are presented. First, we present the use of acoustic–phonetic unit distances and the potential of crosslingual acoustic modeling for under-resourced languages. Experimental results on Vietnamese showed that with only a few hours of target language speech data, crosslingual context independent modeling worked better than crosslingual context dependent modeling. However, it was outperformed by the latter one, when more speech data were available. We concluded, therefore, that in both cases, crosslingual systems are better than monolingual baseline systems. The proposal of grapheme-based acoustic modeling, which avoids building a phonetic dictionary, is also investigated in our work. Finally, since the use of sub-word units (morphemes, syllables, characters, etc.) can reduce the high out-of-vocabulary rate and improve the lack of text resources in statistical language modeling for under-resourced languages, we propose several methods to decompose, normalize and combine word and sub-word lattices generated from different ASR systems. The proposed lattice combination scheme results in a relative syllable error rate reduction of 6.6% over the sentence MAP baseline method for a Vietnamese ASR task.   相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号