首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The paper proposes a diphone/sub-syllable method for Arabic Text-to-Speech (ATTS) systems. The proposed approach exploits the particular syllabic structure of the Arabic words. For good quality, the boundaries of the speech segments are chosen to occur only at the sustained portion of vowels. The speech segments consists of consonants–half vowels, half vowel–consonants, half vowels, middle portion of vowels, and suffix consonants. The minimum set consists of about 310 segments for classical Arabic.  相似文献   

2.
This paper presents a new hybrid method for continuous Arabic speech recognition based on triphones modelling. To do this, we apply Support Vectors Machine (SVM) as an estimator of posterior probabilities within the Hidden Markov Models (HMM) standards. In this work, we describe a new approach of categorising Arabic vowels to long and short vowels to be applied on the labeling phase of speech signals. Using this new labeling method, we deduce that SVM/HMM hybrid model is more efficient then HMMs standards and the hybrid system Multi-Layer Perceptron (MLP) with HMM. The obtained results for the Arabic speech recognition system based on triphones are 64.68 % with HMMs, 72.39 % with MLP/HMM and 74.01 % for SVM/HMM hybrid model. The WER obtained for the recognition of continuous speech by the three systems proves the performance of SVM/HMM by obtaining the lowest average for 4 tested speakers 11.42 %.  相似文献   

3.
Text-to-speech system (TTS), known also as speech synthesizer, is one of the important technology in the last years due to the expanding field of applications. Several works on speech synthesizer have been made on English and French, whereas many other languages, including Arabic, have been recently taken into consideration. The area of Arabic speech synthesis has not sufficient progress and it is still in its first stage with a low speech quality. In fact, speech synthesis systems face several problems (e.g. speech quality, articulatory effect, etc.). Different methods were proposed to solve these issues, such as the use of large and different unit sizes. This method is mainly implemented with the concatenative approach to improve the speech quality and several works have proved its effectiveness. This paper presents an efficient Arabic TTS system based on statistical parametric approach and non-uniform units speech synthesis. Our system includes a diacritization engine. Modern Arabic text is written without mention the vowels, called also diacritic marks. Unfortunately, these marks are very important to define the right pronunciation of the text which explains the incorporation of the diacritization engine to our system. In this work, we propose a simple approach based on deep neural networks. Deep neural networks are trained to directly predict the diacritic marks and to predict the spectral and prosodic parameters. Furthermore, we propose a new simple stacked neural network approach to improve the accuracy of the acoustic models. Experimental results show that our diacritization system allows the generation of full diacritized text with high precision and our synthesis system produces high-quality speech.  相似文献   

4.
为了解决语言障碍者与健康人之间的交流障碍问题,提出了一种基于神经网络的手语到情感语音转换方法。首先,建立了手势语料库、人脸表情语料库和情感语音语料库;然后利用深度卷积神经网络实现手势识别和人脸表情识别,并以普通话声韵母为合成单元,训练基于说话人自适应的深度神经网络情感语音声学模型和基于说话人自适应的混合长短时记忆网络情感语音声学模型;最后将手势语义的上下文相关标注和人脸表情对应的情感标签输入情感语音合成模型,合成出对应的情感语音。实验结果表明,该方法手势识别率和人脸表情识别率分别达到了95.86%和92.42%,合成的情感语音EMOS得分为4.15,合成的情感语音具有较高的情感表达程度,可用于语言障碍者与健康人之间正常交流。  相似文献   

5.
Building a large vocabulary continuous speech recognition (LVCSR) system requires a lot of hours of segmented and labelled speech data. Arabic language, as many other low-resourced languages, lacks such data, but the use of automatic segmentation proved to be a good alternative to make these resources available. In this paper, we suggest the combination of hidden Markov models (HMMs) and support vector machines (SVMs) to segment and to label the speech waveform into phoneme units. HMMs generate the sequence of phonemes and their frontiers; the SVM refines the frontiers and corrects the labels. The obtained segmented and labelled units may serve as a training set for speech recognition applications. The HMM/SVM segmentation algorithm is assessed using both the hit rate and the word error rate (WER); the resulting scores were compared to those provided by the manual segmentation and to those provided by the well-known embedded learning algorithm. The results show that the speech recognizer built upon the HMM/SVM segmentation outperforms in terms of WER the one built upon the embedded learning segmentation of about 0.05%, even in noisy background.  相似文献   

6.
7.
为了提高语音验证技术的有效性,提出了一种基于共振峰合成、修改时长和调节韵律的随机语音验证码生成方法。该方法选择音素作为语音合成单元,基于规则在合成过程中设定随机语速参数,以及调整单元之间的连接规则来实现韵律的随机调整,使得语速和韵律具有不确定性和不可预测性,从而有效降低了自动语音识别技术(ASR)对语音码的识别率,增强了语音验证码的抗攻击性。合成的语音验证码的人耳识别率达到了90%左右,ASR的识别率为28.8%,主观平均判分(MOS)为4分,语音码的可懂度和清晰度达到了满意的效果。实验结果验证了所提方法的可行性。  相似文献   

8.
This paper investigates the unique pharyngeal and uvular consonants of Arabic from the point of view of automatic speech recognition (ASR). Comparisons of the recognition error rates for these phonemes are analyzed in five experiments that involve different combinations of native and non-native Arabic speakers. The most three confusing consonants for every investigated consonant are discussed. All experiments use the Hidden Markov Model Toolkit (HTK) and the Language Data Consortium (LDC) WestPoint Modern Standard Arabic (MSA) database. Results confirm that these Arabic distinct consonants are a major source of difficulty for Arabic ASR. While the recognition rate for certain of these unique consonants such as // can drop below 35% when uttered by non-native speakers, there is advantage to include non-native speakers in ASR. Besides, regional differences in pronunciation of MSA by native Arabic speakers require the attention of Arabic ASR research.  相似文献   

9.
In this paper, we deal with a pre-processing based on speech envelope modulation for intelligibility enhancement in reverberant large dimension public enclosed spaces. In fact, the blurring effect due to reverberation alters the speech perception in such conditions. This phenomenon results from the masking of consonants by the reverberated tails of the previous vowels. This is particularly accentuated for elderly persons suffering from presbycusis. The proposed pre-processing is inspired from the steady-state suppression technique which consists in the detection of the steady-state portions of speech and the multiplication of their waveforms with an attenuation coefficient in order to decrease their masking effect. While the steady-state suppression technique is performed in the frequency domain, the pre-processing described in this paper is rather performed in the temporal domain. Its key novelty consists in the detection of the speech voiced segments using a priori knowledge about the distributions of the powers and the durations of voiced and unvoiced phonemes. The performances of this pre-processing are evaluated with an objective criterion and with subjective listening tests involving normal hearing persons and using a set of nonsense Vowel–Consonant–Vowel syllables and railway station vocal announcements.  相似文献   

10.
In this work, Quantum clustering (QC) algorithm is applied to a labeled dataset of Arabic vowels. The accuracy and processing time are, then, compared with nonhierarchical kernel approaches for unsupervised clustering; namely, k-means, self-organizing map and fuzzy c-means. The choice of speech data is according to large database statistics which reveal that vowels class represents about 60–70% of Arabic speech whereas the remaining percentage is distributed among other sounds. The analysis features, in this work, are the mel-frequency cepstarl coefficients. The results show that all algorithms are competitive from accuracy point of view while QC still guarantees the solution stability.  相似文献   

11.
Spectro-temporal representation of speech has become one of the leading signal representation approaches in speech recognition systems in recent years. This representation suffers from high dimensionality of the features space which makes this domain unsuitable for practical speech recognition systems. In this paper, a new clustering based method is proposed for secondary feature selection/extraction in the spectro-temporal domain. In the proposed representation, Gaussian mixture models (GMM) and weighted K-means (WKM) clustering techniques are applied to spectro-temporal domain to reduce the dimensions of the features space. The elements of centroid vectors and covariance matrices of clusters are considered as attributes of the secondary feature vector of each frame. To evaluate the efficiency of the proposed approach, the tests were conducted for new feature vectors on classification of phonemes in main categories of phonemes in TIMIT database. It was shown that by employing the proposed secondary feature vector, a significant improvement was revealed in classification rate of different sets of phonemes comparing with MFCC features. The average achieved improvements in classification rates of voiced plosives comparing to MFCC features is 5.9% using WKM clustering and 6.4% using GMM clustering. The greatest improvement is about 7.4% which is obtained by using WKM clustering in classification of front vowels comparing to MFCC features.  相似文献   

12.
In general, speech is made with sequences of consonants (fricatives, nasals and stops), vowels and glides. The classification of the stop consonants remains one of the most challenging problems in speech recognition. In this paper, we propose a new approach based on the normalized energy in frequency bands in the release and closure phases in order to characterize and classify the Arabic stop consonants (/b/, /d/, /t/, /k/ and /q/) and to recognize the CV syllable. Classification experiments were performed using decision algorithms on stop consonants C and CV syllables extracted from an Arabic corpus. The results yielded to an overall stop consonants classification of 90.27% and syllables CV recognition upper than 90% for all stops.  相似文献   

13.
14.
This paper describes the principles for developing the AZBAT phonetic alphabet, which was created by analogy with the DARPAbet phonetic alphabet of the English language and is oriented to creating the speech corpus and system for synthesis of Chechen speech. The experience of developers of other phonetic alphabets and databases was used; account was also taken of the features of pronunciation and graphics, rules of compatibility, and variability of phonemes, which had been described in the works of well-known Chechen philologists. The classification of vowels and consonant phonemes is given, according to which each phoneme has the attributes that are necessary to implement the program code. The designed system for synthesis of Chechen speech is assigned a basic set of acoustic phonetic elements that consists of diphones and allophones. This set will be complied with to build an acoustic phonetic database that is the basis of a system for automatic synthesis of Chechen speech.  相似文献   

15.
Tone information is very important to speech recognition in a tonal language such as Thai. In this article, we present a method for isolated Thai tone recognition. First, we define three sets of tone features to capture the characteristics of Thai tones and employ a feedforward neural network to classify tones based on these features. Next, we describe several experiments using the proposed features. The experiments are designed to study the effect of initial consonants, vowels, and final consonants on tone recognition. We find that there are some correlations between tones and other phonemes, and the recognition performances are satisfying. A human perception test is then conducted to judge the recognition rate. The recognition rate of a human is much lower than that of a machine. Finally, we explore various combination schemes to enhance the recognition rate. Further improvements are found in most experiments.  相似文献   

16.
17.
In recent studies, the chaotic behavior of a signal is confirmed using scalogram analysis of continuous-wavelet transform. Chaotic component of a speech signal can be verified through scalogram analysis, since, it investigates the periodicity content of a signal. The periodicity analysis helps in proving that a signal is not periodic, which is an essential condition on chaotic activity. In this work, a scale-index based on scalogram-analysis is calculated for a set of recordings of Arabic vowels. Also, Largest-Lyapunov Exponents (LLE) are computed for these recordings. The obtained measures are, then, compared. The comparison proves the efficacy of scale index for confirming chaotic behavior even for highly-periodic waveforms which is the case in speech vowels. Additionally, it is noted that both LLE and scale-index exhibit classification ability for Arabic vowels.  相似文献   

18.
The speech signal is modeled using zerocrossing interval distribution of the signal in time domain. The distributions of these parameters are studied over five Malayalam (one of the most popular Indian language) vowels. We found that the distribution patterns are almost similar for repeated utterances of the same vowel and varies from vowel to vowel. These distribution patterns are used for recognizing the vowels using multilayer feed forward artificial neural network. After analyzing the distribution patterns and the vowel recognition results, we realize that the zerocrossing interval distribution parameters can be effectively used for the speech phone classification and recognition. The noise adaptness of this parameter is also studied by adding additive white Gaussian noise at different signal to noise ratio. The computational complexity of the proposed technique is also less compared to the conventional spectral techniques, which includes FFT and Cepstral methods, used in the parameterization of speech signal.  相似文献   

19.
In this paper we investigated Artificial Neural Networks (ANN) based Automatic Speech Recognition (ASR) by using limited Arabic vocabulary corpora. These limited Arabic vocabulary subsets are digits and vowels carried by specific carrier words. In addition to this, Hidden Markov Model (HMM) based ASR systems are designed and compared to two ANN based systems, namely Multilayer Perceptron (MLP) and recurrent architectures, by using the same corpora. All systems are isolated word speech recognizers. The ANN based recognition system achieved 99.5% correct digit recognition. On the other hand, the HMM based recognition system achieved 98.1% correct digit recognition. With vowels carrier words, the MLP and recurrent ANN based recognition systems achieved 92.13% and 98.06, respectively, correct vowel recognition; but the HMM based recognition system achieved 91.6% correct vowel recognition.  相似文献   

20.
Automatic recognition of the speech of children is a challenging topic in computer-based speech recognition systems. Conventional feature extraction method namely Mel-frequency cepstral coefficient (MFCC) is not efficient for children's speech recognition. This paper proposes a novel fuzzy-based discriminative feature representation to address the recognition of Malay vowels uttered by children. Considering the age-dependent variational acoustical speech parameters, performance of the automatic speech recognition (ASR) systems degrades in recognition of children's speech. To solve this problem, this study addresses representation of relevant and discriminative features for children's speech recognition. The addressed methods include extraction of MFCC with narrower filter bank followed by a fuzzy-based feature selection method. The proposed feature selection provides relevant, discriminative, and complementary features. For this purpose, conflicting objective functions for measuring the goodness of the features have to be fulfilled. To this end, fuzzy formulation of the problem and fuzzy aggregation of the objectives are used to address uncertainties involved with the problem.The proposed method can diminish the dimensionality without compromising the speech recognition rate. To assess the capability of the proposed method, the study analyzed six Malay vowels from the recording of 360 children, ages 7 to 12. Upon extracting the features, two well-known classification methods, namely, MLP and HMM, were employed for the speech recognition task. Optimal parameter adjustment was performed for each classifier to adapt them for the experiments. The experiments were conducted based on a speaker-independent manner. The proposed method performed better than the conventional MFCC and a number of conventional feature selection methods in the children speech recognition task. The fuzzy-based feature selection allowed the flexible selection of the MFCCs with the best discriminative ability to enhance the difference between the vowel classes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号