首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
This paper presents the design and development of an Auto Associative Neural Network (AANN) based unrestricted prosodic information synthesizer. Unrestricted Text To Speech System (TTS) is capable of synthesize different domain speech with improved quality. This paper deals with a corpus-driven text-to speech system based on the concatenative synthesis approach. Concatenative speech synthesis involves the concatenation of the basic units to synthesize an intelligent, natural sounding speech. A corpus-based method (unit selection) uses a large inventory to select the units and concatenate. The prosody prediction is done with the help of five layer auto associative neural network which helps us to improve the quality of speech synthesis. Here syllables are used as basic unit of speech synthesis database. The database consisting of the units along with their annotated information is called annotated speech corpus. A clustering technique is used in annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit. Discontinuities present at the unit boundaries are lowered by using the mel-LPC smoothing technique. The experiment has been made for the Dravidian language Tamil and the results reveal to demonstrate the improved intelligibility and naturalness of the proposed method. The proposed system is applicable to all the languages if the syllabification rules has been changed.  相似文献   

3.
In this paper we describe the collection and organization of the speaker recognition database in Indian scenario named as IITG Multivariability Speaker Recognition Database. The database contains speech from 451 speakers speaking English and other Indian languages both in conversational and read speech styles recorded using various sensors in parallel under different environmental conditions. The database is organized into four phases on the basis of different conditions employed for the recording. The results of the initial studies conducted on a speaker verification system exploring the impact of mismatch in training and test conditions using the collected data are also included. A copy of this database can be obtained from the authors by contacting them.  相似文献   

4.
The speech recognition system basically extracts the textual information present in the speech. In the present work, speaker independent isolated word recognition system for one of the south Indian language—Kannada has been developed. For European languages such as English, large amount of research has been carried out in the context of speech recognition. But, speech recognition in Indian languages such as Kannada reported significantly less amount of work and there are no standard speech corpus readily available. In the present study, speech database has been developed by recording the speech utterances of regional Kannada news corpus of different speakers. The speech recognition system has been implemented using the Hidden Markov Tool Kit. Two separate pronunciation dictionaries namely phone based and syllable based dictionaries are built in-order to design and evaluate the performances of phone-level and syllable-level sub-word acoustical models. Experiments have been carried out and results are analyzed by varying the number of Gaussian mixtures in each state of monophone Hidden Markov Model (HMM). Also, context dependent triphone HMM models have been built for the same Kannada speech corpus and the recognition accuracies are comparatively analyzed. Mel frequency cepstral coefficients along with their first and second derivative coefficients are used as feature vectors and are computed in acoustic front-end processing. The overall word recognition accuracy of 60.2 and 74.35 % respectively for monophone and triphone models have been obtained. The study shows a good improvement in the accuracy of isolated-word Kannada speech recognition system using triphone HMM models compared to that of monophone HMM models.  相似文献   

5.
Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. Identification of the accent before the speech recognition can improve performance of the speech recognition systems. If the number of accents is more in a language, the accent recognition becomes crucial. Telugu is an Indian language which is widely spoken in Southern part of India. Telugu language has different accents. The main accents are coastal Andhra, Telangana, and Rayalaseema. In this present work the samples of speeches are collected from the native speakers of different accents of Telugu language for both training and testing. In this work, Mel frequency cepstral coefficients (MFCC) features are extracted for each speech of both training and test samples. In the next step Gaussian mixture model (GMM) is used for classification of the speech based on accent. The overall efficiency of the proposed system to recognize the speaker, about the region he belongs, based on accent is 91 %.  相似文献   

6.
The work presented in this paper is focused on the development of a simulated emotion database particularly for the excitation source analysis. The presence of simultaneous electroglottogram (EGG) recordings for each emotion utterance helps to accurately analyze the variations in the source parameters according to different emotions. The work presented in this paper describes the development of comparatively large simulated emotion database for three emotions (Anger, Happy and Sad) along with neutrally spoken utterances in three languages (Tamil, Malayalam and Indian English). Emotion utterances in each language are recorded from 10 speakers in multiple sessions (Tamil and Malayalam). Unlike the existing simulated emotion databases, instead of emotionally neutral utterances, emotionally biased utterances are used for recording. Based on the emotion recognition experiments, the emotions elicited from emotionally biased utterances are found to show more emotion discrimination as compared to emotionally neutral utterances. Also, based on the comparative experimental analysis, the speech and EGG utterances of the proposed simulated emotion database are found to preserve the general trend in the excitation source characteristics (instantaneous F0 and strength of excitation parameters) for different emotions as that of the classical German emotion speech-EGG database (EmoDb). Finally, the emotion recognition rates obtained for the proposed speech-EGG emotion database using the conventional mel frequency cepstral coefficients and Gaussian mixture model based emotion recognition system, are found to be comparable with that of the existing German (EmoDb) and IITKGP-SESC Telugu speech emotion databases.  相似文献   

7.
The porting of a speech recognition system to a new language is usually a time-consuming and expensive process since it requires collecting, transcribing, and processing a large amount of language-specific training sentences. This work presents techniques for improved cross-language transfer of speech recognition systems to new target languages. Such techniques are particularly useful for target languages where minimal amounts of training data are available. We describe a novel method to produce a language-independent system by combining acoustic models from a number of source languages. This intermediate language-independent acoustic model is used to bootstrap a target-language system by applying language adaptation. For our experiments, we use acoustic models of seven source languages to develop a target Greek acoustic model. We show that our technique significantly outperforms a system trained from scratch when less than 8 h of read speech is available  相似文献   

8.
A novel approach to generate Pitch Cycle Waveform (PCW) for Waveform Interpolation (WI) speech coding is presented in this paper. This approach has the advantage of low delay and low computational complexity. PCW-based synthesis using the proposed method is capable of producing highly intelligible speech at a low bit rate. For better efficiency and perception, voiced and unvoiced sounds are dealt separately and different coding schemes are applied to encode the PCWs at different frequency bands. A 2.0 kbps speech coder has been developed and evaluated by subjective listening tests. The output speech has a fairly good quality that is sufficient for communication purposes.  相似文献   

9.
In this paper, a sinusoidal model has been proposed for characterization and classification of different stress classes (emotions) in a speech signal. Frequency, amplitude and phase features of the sinusoidal model are analyzed and used as input features to a stressed speech recognition system. The performances of sinusoidal model features are evaluated for recognition of different stress classes with a vector-quantization classifier and a hidden Markov model classifier. To find the effectiveness of these features for recognition of different emotions in different languages, speech signals are recorded and tested in two languages, Telugu (an Indian language) and English. Average stressed speech index values are proposed for comparing differences between stress classes in a speech signal. Results show that sinusoidal model features are successful in characterizing different stress classes in a speech signal. Sinusoidal features perform better compared to the linear prediction and cepstral features in recognizing the emotions in a speech signal.  相似文献   

10.
Automatic spoken language identification (LID) is the task of identifying a language from a short utterance of speech by an unknown speaker. This article describes a novel two-level identification system for Indian languages using acoustic features. In the first level, the system identifies the family of the spoken language; the second level aims at identifying the particular language within the corresponding family. The proposed system has been modeled using Gaussian mixture model (GMM) and utilizes the following acoustic features: mel frequency cepstral coefficients (MFCC) and shifted delta cepstrum (SDC). A new database has been created for nine Indian languages. It is shown that a GMM-based LID system using MFCC with delta and acceleration coefficients is performing well, with 80.56% accuracy. The performance accuracy of the GMM-based LID system with SDC is also considerable.  相似文献   

11.
为了抑制语音信号中的环境噪声,提出了一种基于子带谱减法进行噪声抑制的语音增强方法。首先通过滤波器组将时域信号分成若干个频(子)带,然后在每个子带中,独立使用改进的谱减法技术进行语音增强。由于实际环境中的背景噪声绝大多数都不是随频率均匀分布的,因此这种在不同频带内进行噪声估计和频谱相减的方法更具有针对性,且更加准确。在实际语音处理实验中证明,所提方法在达到噪声抑制效果的同时较好地保留了语音的结构,使增强后的语音具有更高的听觉舒适度和可理解度。  相似文献   

12.
Data transformation is the core process in migrating database from relational database to NoSQL database such as column-oriented database. However, there is no standard guideline for data transformation from relational database to NoSQL database. A number of schema transformation techniques have been proposed to improve data transformation process and resulted better query processing time when compared to the relational database query processing time. However, these approaches produced redundant tables in the resulted schema that in turn consume large unnecessary storage size and produce high query processing time due to the generated schema with redundant column families in the transformed column-oriented database. In this paper, an efficient data transformation technique from relational database to column-oriented database is proposed. The proposed schema transformation technique is based on the combination of denormalization approach, data access pattern and multiple-nested schema. In order to validate the proposed work, the proposed technique is implemented by transforming data from MySQL database to HBase database. A benchmark transformation technique is also performed in which the query processing time and the storage size are compared. Based on the experimental results, the proposed transformation technique showed significant improvement in terms query processing time and storage space usage due to the reduced number of column families in the column-oriented database.  相似文献   

13.
本文提出了一种汉语语音合成的方法.利用小波变换检测语音信号的声门闭合时刻(GCI)利用语言信号的GCI进行基音同步和样本选择的多脉冲线性预测分析,将得到的时参数保存到语音库中,通过改变相应的语音参数可以灵活地调节音节的时长,基音频率简强,本方法得到一语音参数比传统方法得到的参数更加精确,合成语音清晰,自然,并且大大降低了语音库的存储量,非常适合小于微型计算机系统。  相似文献   

14.
This paper proposes a method for automatic detection of breathy voiced vowels in continuous Gujarati speech. As breathy voice is a specific phonetic feature predominantly present in Gujarati among Indian languages, it can be used for identifying Gujarati language. The objective of this paper is to differentiate breathy voiced vowels from modal voiced vowels based on loudness measure. Excitation source characteristics represented by loudness measure are used for differentiating the voice quality. In the proposed method, initially vowel regions in continuous speech are determined by using the knowledge of vowel onset points and epochs. Later, hypothesized vowel segments are classified by using loudness measure. Performance of the proposed method is evaluated on Gujarati speech utterances containing around 47 breathy and 192 modal vowels spoken by 5 male and 5 female speakers. Classification of vowels into breathy or modal voice is achieved with an accuracy of around 94 %.  相似文献   

15.
In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedforward neural networks are considered as prosody models. Labelled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural network models for predicting the duration and intonation. The features representing the positional, contextual and phonological constraints are used for developing the prosody models. In this paper, the use of prosody models is illustrated using speech recognition, speech synthesis, speaker recognition and language identification applications. Autoassociative neural networks and support vector machines are used as classification models for developing the speech systems. The performance of the speech systems has shown to be improved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefficients (WLPCCs).  相似文献   

16.
The success of using Hidden Markov Models (HMMs) for speech recognition application has motivated the adoption of these models for handwriting recognition especially the online handwriting that has large similarity with the speech signal as a sequential process. Some languages such as Arabic, Farsi and Urdo include large number of delayed strokes that are written above or below most letters and usually written delayed in time. These delayed strokes represent a modeling challenge for the conventional left-right HMM that is commonly used for Automatic Speech Recognition (ASR) systems. In this paper, we introduce a new approach for handling delayed strokes in Arabic online handwriting recognition using HMMs. We also show that several modeling approaches such as context based tri-grapheme models, speaker adaptive training and discriminative training that are currently used in most state-of-the-art ASR systems can provide similar performance improvement for Hand Writing Recognition (HWR) systems. Finally, we show that using a multi-pass decoder that use the computationally less expensive models in the early passes can provide an Arabic large vocabulary HWR system with practical decoding time. We evaluated the performance of our proposed Arabic HWR system using two databases of small and large lexicons. For the small lexicon data set, our system achieved competing results compared to the best reported state-of-the-art Arabic HWR systems. For the large lexicon, our system achieved promising results (accuracy and time) for a vocabulary size of 64k words with the possibility of adapting the models for specific writers to get even better results.  相似文献   

17.
In this paper spectral and prosodic features extracted from different levels are explored for analyzing the language specific information present in speech. In this work, spectral features extracted from frames of 20 ms (block processing), individual pitch cycles (pitch synchronous analysis) and glottal closure regions are used for discriminating the languages. Prosodic features extracted from syllable, tri-syllable and multi-word (phrase) levels are proposed in addition to spectral features for capturing the language specific information. In this study, language specific prosody is represented by intonation, rhythm and stress features at syllable and tri-syllable (words) levels, whereas temporal variations in fundamental frequency (F 0 contour), durations of syllables and temporal variations in intensities (energy contour) are used to represent the prosody at multi-word (phrase) level. For analyzing the language specific information in the proposed features, Indian language speech database (IITKGP-MLILSC) is used. Gaussian mixture models are used to capture the language specific information from the proposed features. The evaluation results indicate that language identification performance is improved with combination of features. Performance of proposed features is also analyzed on standard Oregon Graduate Institute Multi-Language Telephone-based Speech (OGI-MLTS) database.  相似文献   

18.
This paper concerns the development of statistical language models of the Slovenian language for use in an automatic speech recognition system. The proposed techniques are language-independent and can be applied to other highly inflected Slavic languages. The large number of unique words in inflected languages is identified as the primary reason for performance degradation. This article discusses the concept of word-formation in the Slovenian language, which is also common to all Slavic languages. The main problems are outlined for word-based language models. A novel variation on the N-gram modelling theme is examined where, instead of using words, modelling units are chosen to be stems and endings. Only data-driven algorithms are employed, which decompose words automatically. A significant reduction in the OOV rate results when using stems and endings for modelling the Slovenian language. The final part of this article focuses on building a speech recogniser. Two different decoding strategies have been employed: one-pass and two-pass search decoders. Language modelling experiments have been performed using the VEER newswire text corpus, and recognition experiments have been conducted using the SNABI Slovenian speech database. The new language model resulted in the reduction of the OOV rate by 64%, and the recognition accuracy was improved by 4.34%.  相似文献   

19.
Automatic spoken Language IDentification (LID) is the task of identifying the language from a short duration of speech signal uttered by an unknown speaker. In this work, an attempt has been made to develop a two level language identification system for Indian languages using acoustic features. In the first level, the system identifies the family of the spoken language, and then it is fed to the second level which aims at identifying the particular language in the corresponding family. The performance of the system is analyzed for various acoustic features and different classifiers. The suitable acoustic feature and the pattern classification model are suggested for effective identification of Indian languages. The system has been modeled using hidden Markov model (HMM), Gaussian mixture model (GMM) and artificial neural networks (ANN). We studied the discriminative power of the system for the features mel frequency cepstral coefficients (MFCC), MFCC with delta and acceleration coefficients and shifted delta cepstral (SDC) coefficients. Then the LID performance as a function of the different training and testing set sizes has been studied. To carry out the experiments, a new database has been created for 9 Indian languages. It is shown that GMM based LID system using MFCC with delta and acceleration coefficients is performing well with 80.56% accuracy. The performance of GMM based LID system with SDC is also considerable.  相似文献   

20.
Techniques of speech synthesis potentially suitable for machine voice output were demonstrated in research laboratories 20 years ago (see, for example, Holmes et al. 1964), but have so far been restricted in application by the difficulty of generating acceptable speech with a sufficiently flexible vocabulary. JSRU's current laboratory system produces highly intelligible speech from an unlimited English vocabulary. The technique of speech synthesis by rule enables synthetic speech to be generated from conventionally spelled English text, with provision for using modified spelling or phonetic symbols for the small proportion of words that would otherwise be pronounced incorrectly.

Recent advances in electronic technology have made it feasible to implement the most advanced systems for flexible speech synthesis in low-cost equipment. In addition to research towards improving the speech quality, JSRU is shortly expecting to demonstrate synthesis by rule in a self-contained voice output peripheral based on inexpensive microprocessor and signal processing integrated circuits. This paper considers some of the operational constraints which must be placed on the use of such a device if speech synthesis is to take its place as a general-purpose machine-man communication medium.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号