首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This paper presents the design and development of unrestricted text to speech synthesis (TTS) system in Bengali language. Unrestricted TTS system is capable to synthesize good quality of speech in different domains. In this work, syllables are used as basic units for synthesis. Festival framework has been used for building the TTS system. Speech collected from a female artist is used as speech corpus. Initially five speakers’ speech is collected and a prototype TTS is built from each of the five speakers. Best speaker among the five is selected through subjective and objective evaluation of natural and synthesized waveforms. Then development of unrestricted TTS is carried out by addressing the issues involved at each stage to produce good quality synthesizer. Evaluation is carried out in four stages by conducting objective and subjective listening tests on synthesized speech. At the first stage, TTS system is built with basic festival framework. In the following stages, additional features are incorporated into the system and quality of synthesis is evaluated. The subjective and objective measures indicate that the proposed features and methods have improved the quality of the synthesized speech from stage-2 to stage-4.  相似文献   

2.
3.
4.
The paper presents theoretical support for and describes the use of a fuzzy paradigm in implementing a TTS system for the Romanian language, employing a rule-based formant synthesizer. In the framework of classic TTS systems, we propose a new approach in order to improve formant trace computation, aiming at increasing synthetic speech perceptual quality. A fuzzy system is proposed for solving the problem of the phonemes that are prone to multi-definitions in rule-based speech synthesis. In the introductory section, we briefly present the background of the problem and our previous results in speech synthesis. In the second section, we deal with the problem of the context-dependent phonemes at the letter-to-sound module level of our TTS system. Then, we discuss the case of the phoneme /l/ and the solution adopted to define it for different contexts. A fuzzy system is associated with each parameter (denoted F1 and F2) to implement the results of the complete analysis of the phoneme /l/ behavior. The knowledge used in implementing the fuzzy module is acquired by natural speech analysis. In the third section, we exemplify the computation of the synthesis parameters F1 and F2 of the phoneme /l/ in the context of the two syllable sequences. The parameter values are contrasted with those obtained from the spectrogram analysis of the natural speech sequences. The last section presents the main conclusions and further research objectives.  相似文献   

5.
6.
Speech synthesis by rule has made considerable advances and it is being used today in numerous text-to-speech synthesis systems. Current systems are able to synthesise pleasant-sounding voices at high intelligibility levels. However, because their synthetic speech quality is still inferior to that of fluently produced human speech it has not found wide acceptance and instead it has been restricted mainly in useful applications for the handicapped or for restricted tasks in telecommunications. The problems with automatic speech synthesis are related to the methods of controlling speech synthesizer models in order to mimic the varying properties of the human speech production system during discourse. In this paper, artificial neural networks are developed for the control of a formant synthesizer. A set of common words comprising of larynx-produced phonemes were analysed and used to train a neural network cluster. The system was able to produce intelligible speech for certain phonemic combinations of new and unfamiliar words.  相似文献   

7.
讨论了语音合成系统,在输入文档中加入注释标记的重要性和必要性;以及说明迷了实现合成器之间的兼容,便于它们与其它系统集成,而制定一个统一的文本民注释方案的重要性。  相似文献   

8.
This article describes an unrestricted vocabulary text-to-speech (TTS) conversion system for the synthesis of Standard Arabic (SA) speech. The system uses short phonetic clusters that are derived from the Arabic syllables to synthesize Arabic. Basic and phonetic variants of the synthesis units are defined after qualitative and quantitative analyses of the phonetics of SA. A speech database of the synthesis units and their phonetic variations is created and the units are tested to control their segmental quality. Besides the types of synthesis unit used, their enhancement with phonetic variants, and their segmental quality control, the production of good quality speech also depends on waveform analysis and the method used to concatenate the synthesis units together. Waveform analysis is needed to condition the selected synthesis units at their junctures to produce synthesized speech of better quality. The types of speech juncture between contiguous units, the phonetic characteristics of the sounds surrounding the junctures and the concatenation artifacts occurring across the junctures are important and will be discussed. The results of waveform analysis and smoothing algorithms will be presented. The intelligibility of synthesized Arabic by a standard intelligibility test method that is adapted to suit the Arabic phonetic characteristics and scoring the results of the tests will also be dealt with.  相似文献   

9.
This paper presents the design and development of an Auto Associative Neural Network (AANN) based unrestricted prosodic information synthesizer. Unrestricted Text To Speech System (TTS) is capable of synthesize different domain speech with improved quality. This paper deals with a corpus-driven text-to speech system based on the concatenative synthesis approach. Concatenative speech synthesis involves the concatenation of the basic units to synthesize an intelligent, natural sounding speech. A corpus-based method (unit selection) uses a large inventory to select the units and concatenate. The prosody prediction is done with the help of five layer auto associative neural network which helps us to improve the quality of speech synthesis. Here syllables are used as basic unit of speech synthesis database. The database consisting of the units along with their annotated information is called annotated speech corpus. A clustering technique is used in annotated speech corpus that provides way to select the appropriate unit for concatenation, based on the lowest total join cost of the speech unit. Discontinuities present at the unit boundaries are lowered by using the mel-LPC smoothing technique. The experiment has been made for the Dravidian language Tamil and the results reveal to demonstrate the improved intelligibility and naturalness of the proposed method. The proposed system is applicable to all the languages if the syllabification rules has been changed.  相似文献   

10.
利用TTS技术实现文本文件的语音合成   总被引:8,自引:0,他引:8  
本文基于语音合成的代表性技术TIS,利用Microsoft Speech SDK语音开发包、TIS引擎和MFC微软基础类库,开发了一个在VC集成环境下的文语转换应用程序,实现了从文本文件到语音的自动转换功能。  相似文献   

11.
12.
Emotive audio–visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural- sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio–visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.   相似文献   

13.
In concatenative Text-to-Speech, the size of the speech corpus is closely related to synthetic speech quality. In this paper, we describe our work on a new corpus-based Bell Labs' TTS system. This encompasses large acoustic inventories with a rich set of annotations, models and data structures for representing and managing such inventories, and an optimal unit selection algorithm that accommodates a broad range of possible cost criteria. We also propose a new method for setting weights in the cost functions based on a perceptual preference test. Our results show that this approach can successfully predict human preference patterns. Synthetic speech using weights determined in this manner consistently demonstrates smoother transitions and higher voice quality than speech using manually set weights.  相似文献   

14.
Wu  Xing  Ji  Sihui  Wang  Jianjia  Guo  Yike 《Applied Intelligence》2022,52(13):14839-14852

Human beings are capable of imagining a person’s voice according to his or her appearance because different people have different voice characteristics. Although researchers have made great progress in single-view speech synthesis, there are few studies on multi-view speech synthesis, especially the speech synthesis using face images. On the basis of implicit relationship between the speaker’s face image and his or her voice, we propose a multi-view speech synthesis method called SSFE (Speech Synthesis with Face Embeddings). The proposed SSFE consists of three parts: a voice encoder, a face encoder and an improved multi-speaker text-to-speech (TTS) engine. On the one hand, the proposed voice encoder generates the voice embeddings from the speaker’s speech and the proposed face encoder extracts the voice features from the speaker’s face as f-voice embeddings. On the other hand, the multi-speaker TTS engine would synthesize the speech with voice embeddings and f-voice embeddings. We have conducted extensive experiments to evaluate the proposed SSFE on the synthesized speech quality and face-voice matching degree, in which the Mean Opinion Score of the SSFE is more than 3.7 and the matching degree is about 1.7. The experimental results prove that the proposed SSFE method outperforms state-of-the-art methods on the synthesized speech in terms of speech quality and face-voice matching degree.

  相似文献   

15.
In this paper, speech coding techniques are integrated into a Mandarin text-to-speech system. By exploiting the intrinsic properties of Mandarin, we encode the acoustic features of 408 syllabic utterances into templates, each containing modeling parameters for speech synthesis. As a result, the developed TTS system demands merely 36 Kbytes to store all syllabic templates. In the synthesis stage, modeling parameters retrieved from the templates are modified according to the prosody estimated from a hierarchically layered model. To render a general view of the performance of this TTS system, we conduct listening tests and end up with 86.4% intelligibility and 97% comprehensibility. A simplified Mandarin TTS system is also implemented on an FPGA development board. The realization on an FPGA makes us to believe that such a TTS synthesizer can be easily incorporable with other portable devices as a voicing interface.  相似文献   

16.
17.
18.
To illustrate the potential of multilayer neural networks for adaptive interfaces, a VPL Data-Glove connected to a DECtalk speech synthesizer via five neural networks was used to implement a hand-gesture to speech system. Using minor variations of the standard backpropagation learning procedure, the complex mapping of hand movements to speech is learned using data obtained from a single ;speaker' in a simple training phase. With a 203 gesture-to-word vocabulary, the wrong word is produced less than 1% of the time, and no word is produced about 5% of the time. Adaptive control of the speaking rate and word stress is also available. The training times and final performance speed are improved by using small, separate networks for each naturally defined subtask. The system demonstrates that neural networks can be used to develop the complex mappings required in a high bandwidth interface that adapts to the individual user.  相似文献   

19.
Reproducing the smooth vocal tract trajectories is critical for high quality articulatory speech synthesis. This paper presents an adaptive neural control scheme for such a task using fuzzy logic and neural networks. The control scheme estimates motor commands from trajectories of flesh-points on selected articulators. These motor commands are then used to reproduce the trajectories of the underlying articulators in a 2nd order dynamical system. Initial experiments show that the control scheme is able to manipulate the mass-spring based elastic tract walls in a 2-dimensional articulatory synthesizer and to realize efficient speech motor control. The proposed controller achieves high accuracy during on-line tracking of the lips, the tongue, and the jaw in the simulation of consonant–vowel sequences. It also offers salient features such as generality and adaptability for future developments of control models in articulatory synthesis.  相似文献   

20.
Photo-realistic talking-heads from image samples   总被引:1,自引:0,他引:1  
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号