共查询到20条相似文献,搜索用时 218 毫秒
1.
2.
目前,面向蒙古语的语音识别语音库资源相对稀缺,但存在较多的电视剧、广播等蒙古语音频和对应的文本。该文提出基于语音识别的蒙古语长音频语音文本自动对齐方法,实现蒙古语电视剧语音的自动标注,扩充了蒙古语语音库。在前端处理阶段,使用基于高斯混合模型的语音端点检测技术筛选并删除噪音段;在语音识别阶段,构建基于前向型序列记忆网络的蒙古语声学模型;最后基于向量空间模型,将语音识别得到的假设序列和参考音素序列进行句子级别的动态时间归整算法匹配。实验结果表明,与基于Needleman-Wunsch算法的语音对齐比较,该文提出的蒙古语长音频语音文本自动对齐方法的对齐正确率提升了31.09%。 相似文献
3.
语音合成技术日趋成熟,为了提高合成情感语音的质量,提出了一种端到端情感语音合成与韵律修正相结合的方法。在Tacotron模型合成的情感语音基础上,进行韵律参数的修改,提高合成系统的情感表达力。首先使用大型中性语料库训练Tacotron模型,再使用小型情感语料库训练,合成出具有情感的语音。然后采用Praat声学分析工具对语料库中的情感语音韵律特征进行分析并总结不同情感状态下的参数规律,最后借助该规律,对Tacotron合成的相应情感语音的基频、时长和能量进行修正,使情感表达更为精确。客观情感识别实验和主观评价的结果表明,该方法能够合成较为自然且表现力更加丰富的情感语音。 相似文献
4.
针对传统语音合成质量差、自然度低和自回归模型训练时间较长,效率低等问题,提出了一种基于非自回归模型的中文语音合成方法。该方法相比于自回归模型训练效率拥有大幅提升,并在声码器中采用生成对抗网络,较传统语音合成方法合成音频质量有明显提升。该方法首先输入中文汉字经过前端处理转换为音素,再通过One-hot编码转换到音素嵌入层,通过位置编码确定音素序列位置信息,编码器中前馈网络负责将音素序列转换为隐藏序列,再添加可变信息适配器预测的音频特征,最后由解码器输出梅尔频谱到声码器生成音频波形。实验数据集采用专业中文女声10000句,实验结果表明主观意见得分为3.76,在合成质量方面明显优于传统参数式语音合成方法,训练时间只需要自回归模型的15%。 相似文献
5.
提出一种基于统计声学模型的单元挑选语音合成算法.在模型训练阶段,首先提取语料库中语音数据的频谱、基频等声学参数,结合语料库中的音段和韵律标注来估计各上下文相关音素对应的统计声学模型,使用的模型结构为隐马尔柯夫模型.在合成阶段,以使目标合成句对应的声学模型具有最大的似然值输出为准则,来进行最佳合成单元的挑选,最后通过平滑连接各备选单元波形来生成合成语音.以此算法为基础,构建一个以声韵母为基本拼接单元的中文语音合成系统,并通过测听实验证明此算法相对传统算法在提高合成语音自然度上的有效性. 相似文献
6.
7.
8.
9.
针对蒙古语语音识别模型训练时语料资源匮乏,导致的低资源语料无法满足深度网络模型充分训练的问题。该文基于迁移学习提出了层迁移方法,针对层迁移设计了多种迁移策略构建基于CNN-CTC(卷积神经网络和连接时序分类器)的蒙古语层迁移语音识别模型,并对不同的迁移策略进行探究,从而得到最优模型。在10 000句英语语料数据集和5 000句蒙古语语料数据集上开展了层迁移模型训练中学习率选择实验、层迁移有效性实验、迁移层选择策略实验以及高资源模型训练数据量对层迁移模型的影响实验。实验结果表明,层迁移模型可以加快训练速度,且可以有效降低模型的WER;采用自下向上的迁移层选择策略可以获得最佳的层迁移模型;在有限的蒙古语语料资源下,基于CNN-CTC的蒙古语层迁移语音识别模型比普通基于CNN-CTC的蒙古语语音识别模型的WER降低10.18%。 相似文献
10.
11.
《Computer Speech and Language》2007,21(2):325-349
In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yorùbá (SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic framework. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Tree framework is its flexibility in that it facilitates the independent implementation of the different dimensions of prosody, i.e. duration, intonation, and intensity, using different techniques and their subsequent integration. We applied the Fuzzy Decision Tree (FDT) technique to model the duration dimension. In order to evaluate the effectiveness of FDT in duration modelling, we have also developed a Classification And Regression Tree (CART) based duration model using the same speech data. Each of these models was integrated into our R-Tree based prosody model.We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e. intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the training data more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training data since it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model produces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that the expressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to a set of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practical approach for duration modelling in SY TTS applications. 相似文献
12.
13.
基于语法信息的汉语韵律结构预测 总被引:8,自引:4,他引:8
韵律结构的预测, 主要包括短语的自动切分和重音的等级分布两个大的方面。本文在概述汉语韵律结构的基础上, 根据从自然话语中获得的韵律结构与句法结构和词性的关系, 用一种新的方法,通过文本分析,全面地预测韵律边界的位置分布及其等级差异,并进一步预测重音的位置分布及其等级差异。 相似文献
14.
A.I.C. Monaghan 《International Journal of Speech Technology》2003,6(1):73-81
The model of prosody used in the Aculab TTS system is unusual in several respects. Firstly, it is based firmly on current metrical theories of prosody. Secondly, it is entirely knowledge-based: there are no stochastic components in the model. Thirdly, it makes use of a quasi-random element to avoid the predictability of conventional synthetic prosody. Fourthly, it is specifically designed for multilingual use: it currently handles several Germanic and Romance languages. 相似文献
15.
G. Olaszy G. Németh P. Olaszi G. Kiss Cs. Zainkó G. Gordos 《International Journal of Speech Technology》2000,3(3-4):201-215
The latest Hungarian text-to-speech (TTS) system developed for telephone-based applications is described. The main features are intelligible human-like voice; robust software designed for continuous running; fully automatic conversion of declarative (short and very long) sentences and questions; and real time parallel operation, running on minimum 30 channels. The concept of prosody generation and sound duration processing is introduced. Also, the development environment of Profivox is presented. The market-leader Hungarian mobile service provider applies the TTS system in an automatic e-mail reading application. 相似文献
16.
Modifying the prosody parameters like pitch, duration and strength of excitation by desired factor is termed as prosody modification. The objective of this work is to develop a dynamic prosody modification method based on zero frequency filtered signal (ZFFS), a byproduct of zero frequency filtering (ZFF). The existing epoch based prosody modification techniques use epochs as pitch markers and the required prosody modification is achieved by the interpolation of epoch intervals plot. Alternatively, this work proposes a method for prosody modification by the resampling of ZFFS. Also the existing epoch based prosody modification method is further refined for modifying the prosodic parameters at every epoch level. Thus providing more flexibility for prosody modification. The general framework for deriving the modified epoch locations can also be used for obtaining the dynamic prosody modification from existing PSOLA and epoch based prosody modification methods. The quality of the prosody modified speech is evaluated using waveforms, spectrograms and subjective studies. The usefulness of the proposed dynamic prosody modification is demonstrated for neutral to emotional conversion task. The subjective evaluations performed for the emotion conversion indicate the effectiveness of the dynamic prosody modification over the fixed prosody modification for emotion conversion. The dynamic prosody modified speech files synthesized using the proposed, epoch based and TD-PSOLA methods are available at http://www.iitg.ac.in/eee/emstlab/demos/demo5.php. 相似文献
17.
Navas E. Hernaez I. Iker Luengo 《IEEE transactions on audio, speech, and language processing》2006,14(4):1117-1127
Building a text corpus suitable to be used in corpus-based speech synthesis is a time-consuming process that usually requires some human intervention to select the desired phonetic content and the necessary variety of prosodic contexts. If an emotional text-to-speech (TTS) system is desired, the complexity of the corpus generation process increases. This paper presents a study aiming to validate or reject the use of a semantically neutral text corpus for the recording of both neutral and emotional (acted) speech. The use of this kind of texts would eliminate the need to include semantically emotional texts into the corpus. The study has been performed for Basque language. It has been made by performing subjective and objective comparisons between the prosodic characteristics of recorded emotional speech using both semantically neutral and emotional texts. At the same time, the performed experiments allow for an evaluation of the capability of prosody to carry emotional information in Basque language. Prosody manipulation is the most common processing tool used in concatenative TTS. Experiments of automatic recognition of the emotions considered in this paper (the "Big Six emotions") show that prosody is an important emotional indicator, but cannot be the only manipulated parameter in an emotional TTS system-at least not for all the emotions. Resynthesis experiments transferring prosody from emotional to neutral speech have also been performed. They corroborate the results and support the use of a neutral-semantic-content text in databases for emotional speech synthesis. 相似文献
18.
汉语语音拼接模块是TTS系统中最基本、最重要的模块。它的功能是根据文本分析、韵律生成的结果从语音数据库中提取语音基元,并将这些语音基元按照某种算法拼接在一起,从而实时地生成适当的语音输出文件。本文主要剖析了采用波形拼接的方法实现汉语语音拼接的技术,阐述了主要模块的开发过程。 相似文献
19.
针对汉语发音特点,基于对大量自然汉语语句基频轮廓数据的统计和分析,提出一种用于数据驱动生成汉语韵律特征的数学模型。该模型以基频参数为主,辅以时长和增益参数,能表现汉语的语气、短语节奏、韵律词声调及轻重音多层韵律信忠,各层参数可按语言知识分类训练和标注。给出了模型的各种归一化“调素”函数和变调规则。仿真实验表明了该模型的有效性。 相似文献
20.
Catalin Ungurean Dragos Burileanu Aurelian Dervis 《International Journal of Speech Technology》2009,12(2-3):63-73
Lexical stress is primarily important to generate a correct pronunciation of words in many languages; hence its correct placement is a major task in prosody prediction and generation for high-quality TTS (text-to-speech) synthesis systems. This paper proposes a statistical approach to lexical stress assignment for TTS synthesis in Romanian. The method is essentially based on n-gram language models at character level, and uses a modified Katz backoff smoothing technique to solve the problem of data sparseness during training. Monosyllabic words are considered as not carrying stress, and are separated by an automatic syllabification algorithm. A maximum accuracy of 99.11% was obtained on a test corpus of about 47,000 words. 相似文献