首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 686 毫秒
1.
动态视位模型及其参数估计   总被引:3,自引:0,他引:3       下载免费PDF全文
王志明  蔡莲红 《软件学报》2003,14(3):461-466
视觉信息可以加强人们对语音的理解,但如何在可视语音合成中生成逼真自然的口形是个复杂的问题.在深入地研究了人们说话过程中口形变化的规律后,提出了一个基于控制函数混合的动态语音视位模型.并针对汉语发音的特点给出了一种系统的从训练数据学习模型参数的方法,这比依靠主观经验人为指定模型参数更为可靠.实验结果表明,视位模型和通过训练数据学习得到的模型参数可以有效地描述汉语发音过程中口形的变化过程.  相似文献   

2.
本文介绍了一个基于语音参数规则合成的汉语文语转换系统。本系统采用汉语音节和词汇作为合成单元,保留了音节构词时音节与音节之间以及音节内部的超音段信息,保证了合成语音的自然度;采用目前较成功的CELP语音编码方法对合成单元进行压缩,在20多倍的情况下仍能保证合成语音的高清晰度。作者在构建系统时对系统软件的完善考虑以及对用户编程接口的设计,使得该系统成为一个有广泛用途的汉语文语转换系统。  相似文献   

3.
基于数据驱动方法的汉语文本-可视语音合成   总被引:7,自引:0,他引:7  
王志明  蔡莲红  艾海舟 《软件学报》2005,16(6):1054-1063
计算机文本-可视语音合成系统(TTVS)可以增强语音的可懂度,并使人机交互界面变得更为友好.给出一个基于数据驱动方法(基于样本方法)的汉语文本-可视语音合成系统,通过将小段视频拼接生成新的可视语音.给出一种构造汉语声韵母视觉混淆树的有效方法,并提出了一个基于视觉混淆树和硬度因子的协同发音模型,模型可用于分析阶段的语料库选取和合成阶段的基元选取.对于拼接边界处两帧图像的明显差别,采用图像变形技术进行平滑并.结合已有的文本-语音合成系统(TTS),实现了一个中文文本视觉语音合成系统.  相似文献   

4.
本文讨论了在汉语语转换系统中进行语调模拟这一课题,分析了语调的构成要素,研究了语调对声学参数的调节作用,指出当前语调模拟中的主要问题,并且提出了解决这些问题的初步构想。  相似文献   

5.
视觉语音参数估计在视觉语音的研究中占有重要的地位.从MPEG-4定义的人脸动画参数FAP中选择24个与发音有直接关系的参数来描述视觉语音,将统计学习方法和基于规则的方法结合起来,利用人脸颜色概率分布信息和先验形状及边缘知识跟踪嘴唇轮廓线和人脸特征点,取得了较为精确的跟踪效果.在滤除参考点跟踪中的高频噪声后,利用人脸上最为突出的4个参考点估计出主要的人脸运动姿态,从而消除了全局运动的影响,最后根据这些人脸特征点的运动计算出准确的视觉语音参数,并得到了实际应用.  相似文献   

6.
基于结构助词驱动的韵律短语界定的研究   总被引:10,自引:5,他引:5  
应宏  蔡莲红 《中文信息学报》1999,13(6):42-46,64
提高合成语音的自然度是汉语文语转换系统(CTTS)的核心任务,而韵律短语的界定扮演着重要的角色。本文通过分析虚词的特征,研究了结构助词在连续语流中的特点、地位,以及在韵律短语界定中的作用,得到了一组相应的规则和结论。  相似文献   

7.
一种面向汉语语音识别的口形形状识别方法   总被引:1,自引:0,他引:1  
借助汉语发音口形的生理特点,在音素识别这一水平上进行汉语语音的辅助识别,具体给出了一种口形形状识别和灰度的统计方法及其具体实现.实验结果基本与理论估算相吻合,对5个元音的口形区别正确率在80%以上,为语言的声波识别提供了一种有利的辅助手段.  相似文献   

8.
一种面向汉语语音识别的口形形状识别方法*   总被引:3,自引:0,他引:3  
钟晓  周昌乐  俞瑞钊 《软件学报》1999,10(2):205-209
借助汉语发音口形的生理特点,在音素识别这一水平上进行汉语语音的辅助识别,具体给出了一种口形形状识别和灰度的统计方法及其具体实现.实验结果基本与理论估算相吻合,对5个元音的口形区别正确率在80%以上,为语言的声波识别提供了一种有利的辅助手段.  相似文献   

9.
文语转换系统语音库中不同长度协同发音单元的选择   总被引:1,自引:0,他引:1  
本文以我们自行开发的文语转换系统为背景,根据汉语普通话的特点,从解决协同发音角度研究了文语转换系统语音库中双音节词和三音节词的选取,以及与协同发音相关的单音的选择,我们把实验的结果应用到系统中,取得了提高合成语音自然度的效果。  相似文献   

10.
Windows下汉语文—语转换系统的设计与实现   总被引:1,自引:0,他引:1  
本文对汉语文-语转换作了简单介绍。重点讲述了Windows下汉语文-语转换系统的研究和设计。该系统实现了无限词汇的汉语文-语转换,能让计算机输出连续自然的语声流。  相似文献   

11.
Many children with speech sound disorders cannot pronounce the sibilant consonants correctly. We have developed a serious game, which is controlled by the children's voices in real time, with the purpose of helping children on practicing the production of European Portuguese (EP) sibilant consonants. For this, the game uses a sibilant consonant classifier. Since the game does not require any type of adult supervision, children can practice producing these sounds more often, which may lead to faster improvements of their speech. Recently, the use of deep neural networks has given considerable improvements in the classification of a variety of use cases, from image classification to speech and language processing. Here, we propose to use deep convolutional neural networks to classify sibilant phonemes of EP in our serious game for speech and language therapy. We compared the performance of several different artificial neural networks that used Mel frequency cepstral coefficients or log Mel filterbanks. Our best deep learning model achieves classification scores of 95.48% using a 2D convolutional model with log Mel filterbanks as input features. Such results are then further improved for specific classes with simple binary classifiers.  相似文献   

12.
Computer-aided pronunciation training(CAPT) technologies enable the use of automatic speech recognition to detect mispronunciations in second language(L2) learners' speech. In order to further facilitate learning, we aim to develop a principle-based method for generating a gradation of the severity of mispronunciations. This paper presents an approach towards gradation that is motivated by auditory perception. We have developed a computational method for generating a perceptual distance(PD) between two spoken phonemes. This is used to compute the auditory confusion of native language(L1). PD is found to correlate well with the mispronunciations detected in CAPT system for Chinese learners of English,i.e., L1 being Chinese(Mandarin and Cantonese) and L2 being US English. The results show that auditory confusion is indicative of pronunciation confusions in L2 learning. PD can also be used to help us grade the severity of errors(i.e.,mispronunciations that confuse more distant phonemes are more severe) and accordingly prioritize the order of corrective feedback generated for the learners.  相似文献   

13.
文本-视觉语音合成综述   总被引:3,自引:1,他引:2  
视觉信息对于理解语音的内容非常重要.不只是听力有障碍的人,普通人在交谈过程中也存在着一定程度的唇读,尤其是在语音质量受损的噪声环境下.正如文语转换系统可以使计算机像人一样讲话,文本-视觉语音合成系统可以使计算机模拟人类语音的双模态性,让计算机界面变得更为友好.回顾了文本-视觉语音合成的发展.文本驱动的视觉语音合成的实现方法可以分为两类:基于参数控制的方法和基于数据驱动的方法.详细介绍了参数控制类中的几个关键问题和数据驱动类中的几种不同实现方法,比较了这两类方法的优缺点及不同的适用环境.  相似文献   

14.
In this paper, we report our development of context-dependent allophonic hidden Markov models (HMMs) implemented in a 75 000-word speaker-dependent Gaussian-HMM recognizer. The context explored is the immediate left and/or right adjacent phoneme. To achieve reliable estimation of the model parameters, phonemes are grouped into classes based on their expected co-articulatory effects on neighboring phonemes. Only five separate preceding and following contexts are identified explicitly for each phoneme. By grouping the contexts we ensure that they occur frequently enough in the training data to allow reliable estimation of the parameters of the HMM representing the context-dependent units. Further improvement in the estimation reliability is obtained by tying the covariance matrices in the HMM output distributions across all contexts. Speech recognition experiments show that when a large amount of data (e.g. over 2500 words) is used to train context-dependent HMMs, the word recognition error rate is reduced by 33%, compared with the context-independent HMMs. For smaller amounts of training data the error reduction becomes less significant.  相似文献   

15.
按照模块化设计原则,将面向对象的思想应用于虚拟仪器的开发。通过将测试对象或者其操作抽象为相应类的实例,进而完成对表征这些客观对象的属性和操作的封装设计。应用这种思想,完成基于数据采集卡的硬件平台在Windows环境下利用Visual C++6.0进行钻井液参数虚拟测试系统的开发,其中包括采集模块、被测对象、测试数据以及各种辅助模块的封装设计。工程实践证明,面向对象的思想能简化设计过程,开放仪器构架,便于系统的分析和移植。  相似文献   

16.
针对作者已经提出的双因子高斯过程隐变量模型(Two-factor Gaussian process latent variable model,TF-GPLVM)用于语音转换时未考虑语音的动态特征,并且模型训练时需要估计的参数较多的问题,提出引入隐马尔科夫模型(Hidden Markov model,HMM)对语音动态特征进行建模,并利用HMM隐状态对各帧语音进行关于语义内容的概率软分类,建立了分离精度更高、运算负荷较小的双因子高斯过程动态模型(Two-factor Gaussian process dynamic model,TF-GPDM).基于此模型,设计了一种全新的基于说话人特征替换的语音声道谱转换方案.主、客观实验结果表明,无论是与传统的统计映射和频率弯折转换方法相比,还是与双因子高斯过程隐变量模型方法相比,本文方法都获得了语音质量和转换相似度的提升,以及两项性能的更佳平衡.  相似文献   

17.
目的 传统的零样本学习(zero-shot learning,ZSL)旨在依据可见类别的数据和相关辅助信息对未见类别的数据进行预测分类,而广义零样本学习(generalized zero-shot learning,GZSL)中分类的类别既可能属于可见类也可能属于不可见类,这更符合现实的应用场景。基于生成模型的广义零样本学习的原始特征和生成特征不一定编码共享属性所指的语义相关信息,这样会导致模型倾向于可见类,并且分类时忽略了语义信息中与特征相关的有用信息。为了分解出相关的视觉特征和语义信息,提出了视觉—语义双重解纠缠框架。方法 首先,使用条件变分自编码器为不可见类生成视觉特征,再通过一个特征解纠缠模块将其分解为语义一致性和语义无关特征。然后,设计了一个语义解纠缠模块将语义信息分解为特征相关和特征无关的语义。其中,利用总相关惩罚来保证分解出来的两个分量之间的独立性,特征解纠缠模块通过关系网络来衡量分解的语义一致性,语义解纠缠模块通过跨模态交叉重构来保证分解的特征相关性。最后,使用两个解纠缠模块分离出来的语义一致性特征和特征相关语义信息联合学习一个广义零样本学习分类器。结果 实验在4个广义...  相似文献   

18.
This article presents a cross-lingual study for Hungarian and Finnish about the segmentation of continuous speech on word and phrasal level by examination of supra-segmental parameters. A word level segmentationer has been developed which can indicate the word boundaries with acceptable precision for both languages. The ultimate aim is to increase the robustness of speech recognition on the language modelling level by the detection of word and phrase boundaries, and thus we can significantly decrease the searching space during the decoding process. Searching space reduction is highly important in the case of agglutinative languages. In Hungarian and in Finnish, if stress is present, this is always on the first syllable of the word stressed. Thus if stressed syllables can be detected, these must be at the beginning of the word. We have developed different algorithms based either on a rule-based or a data-driven approach. The rule-based algorithms and HMM-based methods are compared. The best results were obtained by data-driven algorithms using the time series of fundamental frequency and energy together. Syllable length was found to be much less effective, hence was discarded. By use of supra-segmental features, word boundaries can be marked with high accuracy, even if we are unable to find all of them. The method we evaluated is easily adaptable to other fixed-stress languages. To investigate this we adapted our data-driven method to the Finnish language and obtained similar results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号