共查询到20条相似文献,搜索用时 671 毫秒
1.
根据语音合成与语音识别等语音应用研究的需求,从文本分析模块入手,第一次利用“维吾尔语语音声学参数库”,选择了带有塞音和塞擦音的556个双音节词,重点提取了塞音、塞擦音的位置在双音节词的词中和词尾时嗓音起始时间VOT特征参数并对此进行了统计分析,首次从实验语音学的角度提出了清塞音、清塞擦音送气、不送气的概念。通过分析得出结论:(1)利用VOT均值可以明显地把维吾尔语中的塞音、塞擦音从清浊类别上区分开来。(2)从嗓音起始时间类型上来看,清塞音出现在双音节词第二音节的音节首(词中)时表现出不送气的特点,而位置出现在第二音节的音节末(词尾)时,有时读成送气音,有时读成不送气音,根据实际情况和个人发音习惯的不同可以自由变读。研究结果对维吾尔语语音合成自然度和语音识别系统识别率的提高有非常大的作用。 相似文献
2.
This paper proposes a new feature extraction technique using wavelet based sub-band parameters (WBSP) for classification of unaspirated Hindi stop consonants. The extracted acoustic parameters show marked deviation from the values reported for English and other languages, Hindi having distinguishing manner based features. Since acoustic parameters are difficult to be extracted automatically for speech recognition. Mel Frequency Cepstral Coefficient (MFCC) based features are usually used. MFCC are based on short time Fourier transform (STFT) which assumes the speech signal to be stationary over a short period. This assumption is specifically violated in case of stop consonants. In WBSP, from acoustic study, the features derived from CV syllables have different weighting factors with the middle segment having the maximum. The wavelet transform has been applied to splitting of signal into 8 sub-bands of different bandwidths and the variation of energy in different sub-bands is also taken into account. WBSP gives improved classification scores. The number of filters used (8) for feature extraction in WBSP is less compared to the number (24) used for MFCC. Its classification performance has been compared with four other techniques using linear classifier. Further, Principal components analysis (PCA) has also been applied to reduce dimensionality. 相似文献
3.
Karim Tahiry Badia Mounir Ilham Mounir Laila Elmazouzi Abdelmajid Farchi 《International Journal of Speech Technology》2017,20(4):869-880
In general, speech is made with sequences of consonants (fricatives, nasals and stops), vowels and glides. The classification of the stop consonants remains one of the most challenging problems in speech recognition. In this paper, we propose a new approach based on the normalized energy in frequency bands in the release and closure phases in order to characterize and classify the Arabic stop consonants (/b/, /d/, /t/, /k/ and /q/) and to recognize the CV syllable. Classification experiments were performed using decision algorithms on stop consonants C and CV syllables extracted from an Arabic corpus. The results yielded to an overall stop consonants classification of 90.27% and syllables CV recognition upper than 90% for all stops. 相似文献
4.
Amita Dev 《AI & Society》2009,23(4):603-612
As development of the speech recognition system entirely depends upon the spoken language used for its development, and the
very fact that speech technology is highly language dependent and reverse engineering is not possible, there is an utmost
need to develop such systems for Indian languages. In this paper we present the implementation of a time delay neural network
system (TDNN) in a modular fashion by exploiting the hidden structure of previously phonetic subcategory network for recognition
of Hindi consonants. For the present study we have selected all the Hindi phonemes for srecognition. A vocabulary of 207 Hindi
words was designed for the task-specific environment and used as a database. For the recognition of phoneme, a three-layered
network was constructed and the network was trained using the back propagation learning algorithm. Experiments were conducted
to categorize the Hindi voiced, unvoiced stops, semi vowels, vowels, nasals and fricatives. A close observation of confusion
matrix of Hindi stops revealed maximum confusion of retroflex stops with their non-retroflex counterparts. 相似文献
5.
A precise identification of prosodic phenomena and the construction of tools able to properly manage such phenomena are essential steps to disambiguate the meaning of certain utterances. In particular they are useful for a wide variety of tasks: automatic recognition of spontaneous speech, automatic enhancement of speech-generation systems, solving ambiguities in natural language interpretation, the construction of large annotated language resources, such as prosodically tagged speech corpora, and teaching languages to foreign students using Computer Aided Language Learning (CALL) systems. This paper presents a study on the automatic detection of prosodic prominence in continuous speech, with particular reference to American English, but with good prospects of application to other languages. Prosodic prominence involves two different prosodic features: pitch accent and stress accent. Pitch accent is acoustically connected with fundamental frequency (F0) movements and overall syllable energy, whereas stress exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. This paper shows that a careful measurement of these acoustic parameters, as well as the identification of their connection to prosodic parameters, makes it possible to build an automatic system capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature. Two different prominence detectors were studied and developed: the first uses a training corpus to set up thresholds properly, while the second uses a pure unsupervised method. In both cases, it is worth stressing that only acoustic parameters derived directly from speech waveforms are exploited. 相似文献
6.
SANKAR KUMAR PAL DWIJESH DUTTA MAJUMDER 《International journal of systems science》2013,44(8):873-886
Fuzzy algorithms provide a simpler and more powerful approach than statistical decision methods for describing non-ideal (fuzzy) environments in which there exists no precise boundary between the categories due to inherent vagueness rather than randomness. This paper attempts to demonstrate the effectiveness of such an algorithm when applied to the computer recognition of patterns of biological origin such as Telugu unaspirated plosives in initial position of large number of utterances in CVC context. A multieategorizer is described in which the fuzzy processor embodies a fuzzy property extractor and a similarity matrix generator. A provision fur controlling fuzziness in property sets had been made by keeping two parameters. ‘exponential’ and ‘denominational’ fuzzifiers, in the components of property matrices ; their effect on recognition score is also studied. Machines’ performances are explained by plotting curves and through confusion matrices when transition, duration and slope of transition from the point of transient release of stop closure to the steady state of only first two formants were used as input features. Voiced stops are differentiated more easily than unvoiced stops, with the maximum overall recognition score ranging from 60% for dentals to 85% for bilabials. The fuzzy hedge ‘ slightly ’ when applied to property sets reduces the confusion from that of the hedge ‘ very ’ and consecutive utilizations of the operations ‘CONT’, ‘ OIL’ and ‘INT’ resulted in a wide variation of about 20 to 25% in the recognition score. Such a variation is found to be insignificant beyond an optimum value of the exponential fuzzifier’. 相似文献
7.
8.
The high error rate in spontaneous speech recognition is due in part to the poor modeling of pronunciation variations. An analysis of acoustic data reveals that pronunciation variations include both complete changes and partial changes. Complete changes are the replacement of a canonical phoneme by another alternative phone, such as ‘b’ being pronounced as ‘p’. Partial changes are the variations within the phoneme, such as nasalization, centralization, voiceless, voiced, etc. Most current work in pronunciation modeling attempts to represent pronunciation variations either by alternative phonetic representations or by the concatenation of subphone units at the hidden Markov state level. In this paper, we show that partial changes are a lot less clear-cut than previously assumed and cannot be modeled by mere representation by alternate phones or a concatenation of phone units. We propose modeling partial changes through acoustic model reconstruction. We first propose a partial change phone model (PCPM) to differentiate pronunciation variations. In order to improve the model resolution without increasing the parameter size too much, PCPM is used as a hidden model and merged into the pre-trained acoustic model through model reconstruction. To avoid model confusion, auxiliary decision trees are established for PCPM triphones, and one auxiliary decision tree can only be used by one standard decision tree. The acoustic model reconstruction on triphones is equivalent to decision tree merging. The effectiveness of this approach is evaluated on the 1997 Hub4NE Mandarin Broadcast News corpus (1997 MBN) with different styles of speech. It gives a significant 2.39% syllable error rate absolute reduction in spontaneous speech. 相似文献
9.
Nikos Chatzichrisafis Vassilios Diakoloukas Vassilios Digalakis Costas Harizakis 《IEEE transactions on audio, speech, and language processing》2007,15(3):928-938
The porting of a speech recognition system to a new language is usually a time-consuming and expensive process since it requires collecting, transcribing, and processing a large amount of language-specific training sentences. This work presents techniques for improved cross-language transfer of speech recognition systems to new target languages. Such techniques are particularly useful for target languages where minimal amounts of training data are available. We describe a novel method to produce a language-independent system by combining acoustic models from a number of source languages. This intermediate language-independent acoustic model is used to bootstrap a target-language system by applying language adaptation. For our experiments, we use acoustic models of seven source languages to develop a target Greek acoustic model. We show that our technique significantly outperforms a system trained from scratch when less than 8 h of read speech is available 相似文献
10.
11.
K. Vicsi P. Roach A. Öster Z. Kacic P. Barczikay A. Tantos F. Csatári Zs. Bakcsi A. Sfakianaki 《International Journal of Speech Technology》2000,3(3-4):289-300
The development of an audiovisual pronunciation teaching and training method and software system is discussed in this article. The method is designed to help children with speech and hearing disorders gain better control over their speech production. The teaching method is drawn up for progression from individual sound preparation to practice of sounds in sentences for four languages: English, Swedish, Slovenian, and Hungarian. The system is a general language-independent measuring tool and database editor. This database editor makes it possible to construct modules for all participant languages and for different sound groups. Two modules are under development for the system in all languages: one for teaching and training vowels to hearing-impaired children and the other for correction of misarticulated fricative sounds. In the article we present the measuring methods, the used distance score calculations of the visualized speech spectra, and problems in the evaluation of the new multimedia tool. 相似文献
12.
A general method which combines formant synthesis by rule and time-domain concatenation is proposed. This method utilizes the advantages of both techniques by maintaining naturalness while minimizing difficulties such as prosodic modification and spectral discontinuities at the point of concatenation. An integrated sampled natural glottal source (Matsui et al., 1991) and sampled voiceless consonants were incorporated into a real-time text-to-speech formant synthesizer. In special cases, voicing amplitude envelopes and formant transitions dirived from natural speech were also utilized. Several listening tests were performed to evaluate these methods. We obtained a significant overall improvement in intelligibility over our previous formant synthesizer. Such improvements in intelligibility were previously obtained with a Japanese text-to-speech system using a related hybrid system (Kamai and Matsui, 1993), indicating the applicability of this method for multi-lingual synthesis. The results of subjective analyses showed that these methods can alo improve naturalness and listenability factors. 相似文献
13.
14.
The paper proposes a diphone/sub-syllable method for Arabic Text-to-Speech (ATTS) systems. The proposed approach exploits the particular syllabic structure of the Arabic words. For good quality, the boundaries of the speech segments are chosen to occur only at the sustained portion of vowels. The speech segments consists of consonants–half vowels, half vowel–consonants, half vowels, middle portion of vowels, and suffix consonants. The minimum set consists of about 310 segments for classical Arabic. 相似文献
15.
16.
This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word
and phrasal level by examination of supra-segmental parameters. A word level segmentationer has been developed which can indicate
the word boundaries with acceptable precision for both languages. The ultimate aim is to increase the robustness of speech
recognition on the language modelling level by the detection of word and phrase boundaries, and thus we can significantly
decrease the searching space during the decoding process. Searching space reduction is highly important in the case of agglutinative
languages.
In Hungarian and in Finnish, if stress is present, this is always on the first syllable of the word stressed. Thus if stressed
syllables can be detected, these must be at the beginning of the word. We have developed different algorithms based either
on a rule-based or a data-driven approach. The rule-based algorithms and HMM-based methods are compared. The best results
were obtained by data-driven algorithms using the time series of fundamental frequency and energy together. Syllable length
was found to be much less effective, hence was discarded. By use of supra-segmental features, word boundaries can be marked
with high accuracy, even if we are unable to find all of them. The method we evaluated is easily adaptable to other fixed-stress
languages. To investigate this we adapted our data-driven method to the Finnish language and obtained similar results. 相似文献
17.
在无关的发音质量评估系统中,需要先识别出待测语音的说话内容,才能进行准确评估。真实的评测数据往往有很多不利的因素影响识别正确率,包括噪声、方言口音、信道噪声、说话随意性等。针对这些不利因素,本文对声学模型进行了深入的研究,包括:在训练数据中加入背景噪声,增强了模型的抗噪声能力;采用基于说话人的倒谱均值方差规整(SCMVN),降低信道及说话人个体特性的影响;用和待测语音相同地域的朗读数据做最大后验概率(MAP)自适应,使模型带有当地方言口音的发音特点;用自然口语数据做MAP自适应,使模型较好地描述自然口语中比较随意的发音现象。实验结果表明,使用这些措施之后,使待测语音的识别正确率相对提高了44.1%,从而使机器评分和专家评分的相关系数相对提高了6.3%。 相似文献
18.
19.
The purpose of this paper is the application of the Genetic Algorithms (GAs) to the supervised classification level, in order
to recognize Standard Arabic (SA) fricative consonants of continuous, naturally spoken, speech. We have used GAs because of
their advantages in resolving complicated optimization problems where analytic methods fail. For that, we have analyzed a
corpus that contains several sentences composed of the thirteen types of fricative consonants in the initial, medium and final
positions, recorded by several male Jordanian speakers. Nearly all the world’s languages contain at least one fricative sound.
The SA language occupies a rather exceptional position in that nearly half of its consonants are fricatives and nearly half
of fricative inventory is situated far back in the uvular, pharyngeal and glottal areas. We have used Mel-Frequency Cepstral
analysis method to extract vocal tract coefficients from the speech signal. Among a set of classifiers like Bayesian, likelihood
and distance classifier, we have used the distance one. It is based on the classification measure criterion. So, we formulate
the supervised classification as a function optimization problem and we have used the decision rule Mahalanobis distance as
the fitness function for the GA evaluation. We report promising results with a classification recognition accuracy of 82%. 相似文献
20.
《IEEE transactions on audio, speech, and language processing》2009,17(8):1471-1482