共查询到20条相似文献,搜索用时 140 毫秒
1.
语音合成中的韵律关联模型 总被引:4,自引:2,他引:4
基于大规模语音数据库的文语转换系统(Text-to-Speech , TTS)中,如何选取合适的语音基元是提高合成语音自然度的重要因素。本文研究了连续语流中的韵律关联现象,提出了包含韵律关联参数的汉语韵律特征参数集,基于数据挖掘中的关联规则模型(Association Rules Model)建立韵律关联模型,并将该模型应用于基元选取。实验表明,该方法有效地利用了语音基元的韵律及关联信息,符合人耳的知觉感受,使得合成语音自然度的主观评测MOS(Mean Opinion Score)得分与不考虑韵律关联时的结果相比提高了12.22%(3.49/3.11)。 相似文献
2.
3.
韵律结构的自动预测是高自然度文语转换(TTS)系统的关键组成部分,直接影响到合成语音的自然度和表现力。该文建立了一个同时具有语法信息与韵律结构标注的汉语语料库。在这一语料库的基础上,对汉语的韵律结构组成、韵律结构与语法语义之间的关系进行了分析,并进行了预测试验。研究发现,汉语的韵律结构虽与语法结构不同,但是有着密切的联系,韵律结构可以通过语法结构进行预测。韵律结构除与语法结构有关之外,还要受到语句语义的制约。 相似文献
4.
文语转换系统中的韵律研究 总被引:1,自引:0,他引:1
文语转换(Tex-To-Speech,TTS)是将文字形信息转换成自然语言的一种技术,在人机交互、通讯、资讯、家电等领域有着广泛的用途;然而,当前的TTS系统普遍存在着输出语音的机器味太浓、不够自然的现象,在很大程度上阻碍了它的推广和应用。其根本原理即在于合成语音中缺乏必要的韵律信息。我们认为,一个TTS系统应分为相对独立的上下两层;韵律结构分析和语言生成。上层负责分析语句的韵律结构,并标注相应的韵律标记,下层负责将之转换成相应的合成器参数,并输出语音。因此,当前的首要任务就是要研究韵律的主要特点、韵律的结构和主要内容,在此基础上,制订出一套相应的韵律标记方法。 相似文献
5.
汉语语音拼接模块是TTS系统中最基本、最重要的模块。它的功能是根据文本分析、韵律生成的结果从语音数据库中提取语音基元,并将这些语音基元按照某种算法拼接在一起,从而实时地生成适当的语音输出文件。本文主要剖析了采用波形拼接的方法实现汉语语音拼接的技术,阐述了主要模块的开发过程。 相似文献
6.
对汉语TTS系统的大规模语料库做了基本的韵律参数统计,分析了音节的韵律特征与其所在的韵律结构位置以及韵律结构边界的关系。进一步,对有调音节样本集基于基频包络采用k中心点算法进行聚类,通过听辩实验检验了聚类结果。并分析了音节聚类与其所在韵律结构之间的对应关系。 相似文献
7.
8.
本文首先简述语音合成技术和文语转换系统(TTS:Text to Speech)的基本原理,而后介绍一种可用来进行韵律修正的基音同步叠加算法(PSOLA),最后介绍调素论在系统中的应用,使合成语音的自然度得以提高。 相似文献
9.
韵律边界对言语表达的自然度和可理解度有着重要作用。韵律建模也是语音合成、语音理解中的重要方面。该文从相邻声调的相互作用角度出发,提出基于深度神经网络(DNN)及声调核声学特征的汉语韵律边界检测方法。该方法首先采用声调核部分的声学特征来计算边界检测相关参数。然后,利用深度神经网络进行建模。作为对比,实验中采用了以整个音节的声学特征为输入特征的基线系统。结果表明,只使用调核部分声学特征的系统优于使用整个音节的系统,韵律边界检测正确率相对提高了4%,这表明该文提出的汉语韵律边界检测方法的有效性。 相似文献
10.
11.
12.
Chin-Teng Lin Rui-Cheng Wu Jyh-Yeong Chang Sheng-Fu Liang 《IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics》2004,34(1):309-324
In this paper, a new technique for the Chinese text-to-speech (TTS) system is proposed. Our major effort focuses on the prosodic information generation. New methodologies for constructing fuzzy rules in a prosodic model simulating human's pronouncing rules are developed. The proposed Recurrent Fuzzy Neural Network (RFNN) is a multilayer recurrent neural network (RNN) which integrates a Self-cOnstructing Neural Fuzzy Inference Network (SONFIN) into a recurrent connectionist structure. The RFNN can be functionally divided into two parts. The first part adopts the SONFIN as a prosodic model to explore the relationship between high-level linguistic features and prosodic information based on fuzzy inference rules. As compared to conventional neural networks, the SONFIN can always construct itself with an economic network size in high learning speed. The second part employs a five-layer network to generate all prosodic parameters by directly using the prosodic fuzzy rules inferred from the first part as well as other important features of syllables. The TTS system combined with the proposed method can behave not only sandhi rules but also the other prosodic phenomena existing in the traditional TTS systems. Moreover, the proposed scheme can even find out some new rules about prosodic phrase structure. The performance of the proposed RFNN-based prosodic model is verified by imbedding it into a Chinese TTS system with a Chinese monosyllable database based on the time-domain pitch synchronous overlap add (TD-PSOLA) method. Our experimental results show that the proposed RFNN can generate proper prosodic parameters including pitch means, pitch shapes, maximum energy levels, syllable duration, and pause duration. Some synthetic sounds are online available for demonstration. 相似文献
13.
Sanghun Kim Youngjik Lee Keikichi Hirose 《International Journal of Speech Technology》2002,5(2):105-116
This paper describes a new Korean Text-to-Speech (TTS) system based on a large speech corpus. Conventional concatenative TTS systems still produce machine-like synthetic speech. The poor naturalness is caused by excessive prosodic modification using a small speech database. To cope with this problem, we utilized a dynamic unit selection method based on a large speech database without prosodic modification. The proposed TTS system adopts triphones as synthesis units. We designed a new sentence set maximizing phonetic or prosodic coverage of Korean triphones. All the utterances were segmented automatically into phonemes using a speech recognizer. With the segmented phonemes, we achieved a synthesis unit cost of zero if two synthesis units were placed consecutively in an utterance. This reduces the number of concatenating points that may occur due to concatenating mismatches. In this paper, we present data concerning the realization of major prosodic variations through a consideration of prosodic phrase break strength. The phrase break was divided into four kinds of strength based on pause length. Using phrase break strength, triphones were further classified to reflect major prosodic variations. To predict phrase break strength on texts, we adopted an HMM-like Part-of-Speech (POS) sequence model. The performance of the model showed 73.5% accuracy for 4-level break strength prediction. For unit selection, a Viterbi beam search was performed to find the most appropriate triphone sequence, which has the minimum continuation cost of prosody and spectrum at concatenating boundaries. From the informal listening test, we found that the proposed Korean corpus-based TTS system showed better naturalness than the conventional demisyllable-based one. 相似文献
14.
《Computer Speech and Language》2005,19(2):129-146
Three experiments are reported that use new experimental methods for the evaluation of text-to-speech (TTS) synthesis from the user's perspective. Experiment 1, using sentence stimuli, and Experiment 2, using discrete “call centre” word stimuli, investigated the effect of voice gender and signal quality on the intelligibility of three concatenative TTS synthesis systems. Accuracy and search time were recorded as on-line, implicit indices of intelligibility during phoneme detection tasks. It was found that both voice gender and noise affect intelligibility. Results also indicate interactions of voice gender, signal quality, and TTS synthesis system on accuracy and search time. In Experiment 3 the method of paired comparisons was used to yield ranks of naturalness and preference. As hypothesized, preference and naturalness ranks were influenced by TTS system, signal quality and voice, in isolation and in combination. The pattern of results across the four dependent variables – accuracy, search time, naturalness, preference – was consistent. Natural speech surpassed synthetic speech, and TTS system C elicited relatively high scores across all measures. Intelligibility, judged naturalness and preference are modulated by several factors and there is a need to tailor systems to particular commercial applications and environmental conditions. 相似文献
15.
Yiqiang Chen Wen Gao Tingshao Zhu Charles Ling 《Journal of Intelligent Information Systems》2002,19(1):95-109
Higher quality synthesized speech is required for widespread use of text-to-speech (TTS) technology, and the prosodic pattern is the key feature that makes synthetic speech sound unnatural and monotonous, which mainly describes the variation of pitch. The rules used in most Chinese TTS systems are constructed by experts, with weak quality control and low precision. In this paper, we propose a combination of clustering and machine learning techniques to extract prosodic patterns from actual large mandarin speech databases to improve the naturalness and intelligibility of synthesized speech. Typical prosody models are found by clustering analysis. Some machine learning techniques, including Rough Set, Artificial Neural Network (ANN) and Decision tree, are trained for fundamental frequency and energy contours, which can be directly used in a pitch-synchronous-overlap-add-based (PSOLA-based) TTS system. The experimental results showed that synthesized prosodic features greatly resembled their original counterparts for most syllables. 相似文献
16.
改进的跨语种语音合成模型自适应方法 总被引:1,自引:0,他引:1
统计参数语音合成中的跨语种模型自适应主要应用于目标说话人语种与源模型语种不同时,使用目标发音人少量语音数据快速构建具有其音色特征的源模型语种合成系统。本文对传统的基于音素映射和三音素模型的跨语种自适应方法进行改进,一方面通过结合数据挑选的音素映射方法以提高音素映射的可靠性,另一方面引入跨语种的韵律信息映射以弥补原有方法中三音素模型在韵律表征上的不足。在中英文跨语种模型自适应系统上的实验结果表明,改进后系统合成语音的自然度与相似度相对传统方法都有了明显提升。 相似文献
17.
中文语音合成系统中的一种两层韵律结构生成体系 总被引:1,自引:0,他引:1
韵律结构生成是改进一个语音合成系统中的合成语音的完整度和自然度的重要组成部分. 韵律词和韵律短语的自动切分是中文层级韵律结构的两个重要的基本层面, 本文调研了这个基本问题, 并提出了一种两层韵律结构生成体系. 为此, 我们建立了条件随机场模型为韵律词和韵律短语的预测选取不同的前端特征. 除此之外, 我们还引入了基于转换的错误驱动学习模块来修正后端的初始预测. 实验结果显示, 这种结合条件随机场和错误驱动学习的方法使得韵律词和韵律短语的自动分割的F-score值达到了94.66%. 相似文献
18.
提出了一种基于粒子群算法PSO优化广义回归神经网络GRNN模型的语音转换方法。首先,该方法利用训练语音的声道和激励源的个性化特征参数分别训练两个GRNN,得到GRNN的结构参数;然后,利用PSO对GRNN的结构参数进行优化,减少人为因素对转换结果的影响;最后,对语音的韵律特征、基音轮廓和能量分别进行了线性转换,使得转换后的语音包含更多源语音的个性化特征信息。主客观实验结果表明:与径向基神经网络RBF和GRNN相比,使用本文提出的转换模型获得的转换语音的自然度和似然度都得到了很大的提升,谱失真率明显降低并且更接近于目标语音。 相似文献
19.
20.
为预测英语文语转换(Text-to-Speech,TIS)系统中韵律生成模块的韵律边界,通过在中间短语、语调短语和语句后分别插入不同长度的停顿,产生使合成语音具有与真人语音类似的韵律结构.通过采用基于语块的中间短语切分,以中间短语为基本单位,生成一个语调短语边界预测的学习语料库,然后采用转换式学习法进行标注学习,从而实现韵律边界的切分.在对真人语料库进行测试的实验中,标注正确率达到81.32%,通过在学习中增加语调短语音节数和标点符号的约束规则,可进一步提高标注正确率. 相似文献