首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 125 毫秒
1.
汉语语音合成中基频曲线(F0 曲线)预测是决定合成语音声调自然度的关键因素,为了使生成的基频曲线过渡自然,提出应用连接段基频曲线模式连接各音节的方法.连接段和音节基频曲线模式使用聚类、分析修正的方法获得,相互问有重叠性,应用时根据参数来确定选取区域,进行连接.通过实验过程中分析总结得到的规则确定基频曲线模式参数.实际应用于 PSOLA 语音合成系统后,经实验证明合成语音声调自然度明显提升.  相似文献   

2.
嵌入式汉语TTS系统的设计与实现   总被引:5,自引:0,他引:5  
针对手持设备和PDA存储量较小的特点,本文提出了基于音节基频包络特征、采用k中心点算法聚类裁减音库容量的方法。聚类结果的听辩实验和统计分析表明此算法可以保证聚类内部音节样本的相似性及类间样本的相异性。经过对汉语可选合成基元的分析,系统中首次引入声韵母半音节与音节作为混合基元,构造了基于混合基元的音库。经过对样本集分别聚类裁减,进一步压缩了音库容量,并在PDA平台上实现了嵌入式TTS系统。  相似文献   

3.
对汉语TTS系统的大规模语料库做了基本的韵律参数统计,分析了音节的韵律特征与其所在的韵律结构位置以及韵律结构边界的关系。进一步,对有调音节样本集基于基频包络采用k中心点算法进行聚类,通过听辩实验检验了聚类结果。并分析了音节聚类与其所在韵律结构之间的对应关系。  相似文献   

4.
介绍了一种基于决策树和条件概率的基频预测模型(FO Prediction with Integrated Decision Tree and Condi-tional Probability Model,IDBCPM).基频是一种重要的韵律特征参数,高精度的基频预测模型是高质量合成语音系统的必要保证.基频模型是根据从文本分析得到的信息预测相应于当前文本的合理基频曲线.IDBCPM一定程度上避免了其它基频模型预测时,在相邻音节处的预测结果不匹配问题.这种不匹配产生于预测当前音节基频时不能有效考虑相邻音节的预测结果.IDBCPM充分利用决策树的输出信息,包含决策树的输出类别和类别相应的概率,另外IDBCPM可以有效应用从训练数据中得到的先验条件概率,消除决策树输出的不合理结果.实验证明这种方法的预测精度比单纯决策树预测精度有明显提高.  相似文献   

5.
合成语音自然度客观测度   总被引:2,自引:1,他引:1  
目前合成语音的自然度有待提高,论文根据目前的研究现状提出了一种合成语音自然度的客观评价方法,该方法主要从语音韵律特征的主要参数出发,计算同一发音人的自然语音和合成语音之间的基频、时长、音强等参数的差距,其中由于两种语音基频时间不匹配,所以采用DTW(Dynamic Time Warping)算法来对两种语音的基频进行了时间弯折对准。最后再将计算结果与主观评测(MOS)的结果进行比较。实验数据表明,论文提出的基频曲线失真测度与MOS之间具有很强的相关性,从韵律特征角度给出的评价结果能够衡量合成语音的自然度。  相似文献   

6.
音节是维吾尔语的最小发音单元,所以大部分维吾尔语语音合成系统以音节作为基本的合成单元,但维吾尔语中音节数量很大,语料库很难保证覆盖所有的音节样本,这会导致合成语音不稳定和不连续。为解决合成语音不稳定的情况,提出了结合单音素和三音素两个不同基元的单元挑选算法。通过在单元挑选模块中加入韵律参数相匹配的方法选出最佳韵律匹配的单元并解决了合成语音不连续的情况。实验结果表明,提出的方法有效地解决了合成语音不稳定和不连续的现象,从而提高了合成语音的自然度。  相似文献   

7.
一种使用声调映射码本的汉语声音转换方法   总被引:3,自引:0,他引:3  
在使用高斯混合模型实现说话人语音频谱包络变换的同时,提出了一种汉语声调码本映射技术来进一步提高转换语音目标说话人特征倾向性的方法。从源语音和目标语音分别提取汉语单音节的基频曲线作为基频变换单元,作预处理和聚类后分别形成源、目标声调码本,根据时间对准原则建立了一个由源特征空间到目标特征空间的声调模式映射码本。声音转换实验评估了声调码本映射算法的性能。实验结果表明,该算法较好地反映出源说话人与目标说话人基频曲线之间的映射关系,改善了声音转换性能。  相似文献   

8.
汉语是声调语言,同一个音节带上不同的声调可以表述不同的语义。发音时两个或两个以上的音节连在一起时,音节所属调类调值所发生变化的现象称为“连读变调”,目前的语音合成系统没有考虑连续变调,使得合成语音自然度不够。采用TD-PSOLA对具有连续变调现象的语音进行合成,实验表明合成语音的自然度较高,是适合小语料库语音合成的良好算法。  相似文献   

9.
针对存在情感差异性语音情况下说话人识别系统性能急剧下降以及缺乏充足情感语音训练说话人模型的问题,提出一种基于基频的情感语音聚类的说话人识别方法,能有效利用系统可获取的少量情感语音.该方法通过对男女说话人设定不同的基频阈值,根据阈值,对倒谱特征进行聚类,为每个说话人建立不同基频区间的模型.在特征匹配时,选用最大似然度的基频区间模型的得分作为该说话人的得分.在中文情感语音库上的测试结果表明,与传统的基于中性训练语音的高斯混合模型说话人识别方法和结构化训练方法相比,该方法具有更高的识别率.  相似文献   

10.
针对HMM语音合成算法,固定参数的后置滤波器无法适应不同失真程度的频谱导致合成语音自然度下降,提出了一种基于后置滤波器参数自适应的语音合成改进算法。该方法根据语音谱的平坦度自适应选择最优的短时滤波参数来对合成语音频谱的共振峰区域增强;使用长时后置滤波器优化合成语音的基频谐波结构来减轻合成语音基频的不连续性。仿真实验结果表明,该方法能够有效地减轻语音的频谱过平滑,主观测试结果表明,合成语音的自然度得以提高。  相似文献   

11.
Higher quality synthesized speech is required for widespread use of text-to-speech (TTS) technology, and the prosodic pattern is the key feature that makes synthetic speech sound unnatural and monotonous, which mainly describes the variation of pitch. The rules used in most Chinese TTS systems are constructed by experts, with weak quality control and low precision. In this paper, we propose a combination of clustering and machine learning techniques to extract prosodic patterns from actual large mandarin speech databases to improve the naturalness and intelligibility of synthesized speech. Typical prosody models are found by clustering analysis. Some machine learning techniques, including Rough Set, Artificial Neural Network (ANN) and Decision tree, are trained for fundamental frequency and energy contours, which can be directly used in a pitch-synchronous-overlap-add-based (PSOLA-based) TTS system. The experimental results showed that synthesized prosodic features greatly resembled their original counterparts for most syllables.  相似文献   

12.
This work attempts to convert a given neutral speech to a target emotional style using signal processing techniques. Sadness and anger emotions are considered in this study. For emotion conversion, we propose signal processing methods to process neutral speech in three ways: (i) modifying the energy spectra (ii) modifying the source features and (iii) modifying the prosodic features. Energy spectra of different emotions are analyzed, and a method has been proposed to modify the energy spectra of neutral speech after dividing the speech into different frequency bands. For the source part, epoch strength and epoch sharpness are extensively studied. A new method has been proposed for modification and incorporation of epoch strength and epoch sharpness parameters using appropriate modification factors. Prosodic features like pitch contour and intensity have also been modified in this work. New pitch contours corresponding to the target emotions are derived from the pitch contours of neutral test utterances. The new pitch contours are incorporated into the neutral utterances. Intensity modification is done by dividing neutral utterances into three equal segments and modifying the intensities of these segments separately, according to the modification factors suitable for the target emotions. Subjective evaluation using mean opinion scores has been carried out to evaluate the quality of converted emotional speech. Though the modified speech does not completely resemble the target emotion, the potential of these methods to change the style of the speech is demonstrated by these subjective tests.  相似文献   

13.
Prosody conversion from neutral speech to emotional speech   总被引:1,自引:0,他引:1  
Emotion is an important element in expressive speech synthesis. Unlike traditional discrete emotion simulations, this paper attempts to synthesize emotional speech by using "strong", "medium", and "weak" classifications. This paper tests different models, a linear modification model (LMM), a Gaussian mixture model (GMM), and a classification and regression tree (CART) model. The linear modification model makes direct modification of sentence F0 contours and syllabic durations from acoustic distributions of emotional speech, such as, F0 topline, F0 baseline, durations, and intensities. Further analysis shows that emotional speech is also related to stress and linguistic information. Unlike the linear modification method, the GMM and CART models try to map the subtle prosody distributions between neutral and emotional speech. While the GMM just uses the features, the CART model integrates linguistic features into the mapping. A pitch target model which is optimized to describe Mandarin F0 contours is also introduced. For all conversion methods, a deviation of perceived expressiveness (DPE) measure is created to evaluate the expressiveness of the output speech. The results show that the LMM gives the worst results among the three methods. The GMM method is more suitable for a small training set, while the CART method gives the better emotional speech output if trained with a large context-balanced corpus. The methods discussed in this paper indicate ways to generate emotional speech in speech synthesis. The objective and subjective evaluation processes are also analyzed. These results support the use of a neutral semantic content text in databases for emotional speech synthesis.  相似文献   

14.
In this paper, we present a new approach of speech clustering with regards of the speaker identity. It consists in grouping the homogeneous speech segments that are obtained at the end of the segmentation process, by using the spatial information provided by the stereophonic speech signals. The proposed method uses the differential energy of the two stereophonic signals collected by two cardioid microphones, in order to cluster all the speech segments that belong to the same speaker. The total number of clusters obtained at the end should be equal to the real number of speakers present in the meeting room and each cluster should contain the global intervention of only one speaker. The proposed system is suitable for debates or multi-conferences for which the speakers are located at fixed positions. Basically, our approach tries to make a speaker localization with regards to the position of the microphones, taken as a spatial reference. Based on this localization, the new proposed method can recognize the speaker identity of any speech segment during the meeting. So, the intervention of each speaker is automatically detected and assigned to him by estimating his relative position. In a purpose of comparison, two types of clustering methods have been implemented and experimented: the new approach, which we called Energy Differential based Spatial Clustering (EDSC) and a classical statistical approach called “Mono-Gaussian based Sequential Clustering” (MGSC). Experiments of speaker clustering are done on a stereophonic speech corpus called DB15, composed of 15 stereophonic scenarios of about 3.5 minutes each. Every scenario corresponds to a free discussion between two or three speakers seated at fixed positions in the meeting room. Results show the outstanding performances of the new approach in terms of precision and speed, especially for short speech segments, where most of clustering techniques present a strong failure.  相似文献   

15.
语料库词性标注一致性检查方法研究   总被引:4,自引:0,他引:4  
在对大规模语料库进行深加工时,保证词性标注的一致性已成为建设高质量语料库的首要问题。本文提出了基于聚类和分类的语料库词性标注一致性检查的新方法,该方法避开了以前一贯采用的规则或统计的方法,利用聚类和分类的思想,对范例进行聚类并求出阈值,对测试数据分类来确定其标注的正误,进而得出每篇文章的词性标注一致性情况,进一步保证大规模语料库标注的正确性。  相似文献   

16.
In this work, spectral features extracted from sub-syllabic regions and pitch synchronous analysis are proposed for speech emotion recognition. Linear prediction cepstral coefficients, mel frequency cepstral coefficients and the features extracted from high amplitude regions of spectrum are used to represent emotion specific spectral information. These features are extracted from consonant, vowel and transition regions of each syllable to study the contribution of these regions toward recognition of emotions. Consonant, vowel and the transition regions are determined using vowel onset points. Spectral features extracted from each pitch cycle, are also used to recognize emotions present in speech. The emotions used in this study are: anger, fear, happy, neutral and sad. The emotion recognition performance using sub-syllabic speech segments are compared with the results of conventional block processing approach, where entire speech signal is processed frame by frame. The proposed emotion specific features are evaluated using simulated emotion speech corpus, IITKGP-SESC (Indian Institute of Technology, KharaGPur-Simulated Emotion Speech Corpus). The emotion recognition results obtained using IITKGP-SESC are compared with the results of Berlin emotion speech corpus. Emotion recognition systems are developed using Gaussian mixture models and auto-associative neural networks. The purpose of this study is to explore sub-syllabic regions to identify the emotions embedded in a speech signal, and if possible, to avoid processing of entire speech signal for emotion recognition without serious compromise in the performance.  相似文献   

17.
In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, duration, pitch, and energy values are used to represent the prosodic information, for recognizing the emotions from speech. Global prosodic features represent the gross statistics such as mean, minimum, maximum, standard deviation, and slope of the prosodic contours. Local prosodic features represent the temporal dynamics in the prosody. In this work, global and local prosodic features are analyzed separately and in combination at different levels for the recognition of emotions. In this study, we have also explored the words and syllables at different positions (initial, middle, and final) separately, to analyze their contribution towards the recognition of emotions. In this paper, all the studies are carried out using simulated Telugu emotion speech corpus (IITKGP-SESC). These results are compared with the results of internationally known Berlin emotion speech corpus (Emo-DB). Support vector machines are used to develop the emotion recognition models. The results indicate that, the recognition performance using local prosodic features is better compared to the performance of global prosodic features. Words in the final position of the sentences, syllables in the final position of the words exhibit more emotion discriminative information compared to the words and syllables present in the other positions.  相似文献   

18.
汉语韵律短语的时长与音高研究   总被引:2,自引:1,他引:1  
语句和篇章的韵律结构和信息结构的分析及模型化是提高语音合成的自然度、降低自然语言识别错误率的关键。该文在带有韵律标注ASCCD语料库的基础上对韵律短语的时长和音高特性进行了研究,得到并验证了如下一些结论:(1)韵律短语边界对音节时长有明显的延长作用,不同声调对音节的时长延长作用不同,并且不同的重音级别对音节时长的延长作用也不同。(2)韵律短语边界处中断的时长在较小的韵律边界表现的更为明显。韵律短语的边界处发生了明显的音高重置现象,韵律短语的音高低线总是下降的,而音高高线只是在重音后下降,并且重音处的音域大而且音高高线的位置高。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号