This correspondence presents a speech sentence compression scheme. A compressed word sequence is first extracted. Speech segments, in the spoken document, corresponding to the extracted words are selected for concatenation. Evaluation of the proposed approach shows the compressed speech sentence retains important and meaningful information and naturalness  相似文献   

The synthesis of talking heads has been a flourishing research area over the last few years. Since human beings have an uncanny ability to read people's faces, most related applications (e.g., advertising, video-teleconferencing) require absolutely realistic photometric and behavioral synthesis of faces. This paper proposes a person-specific facial synthesis framework that allows high realism and includes a novel way to control visual emphasis (e.g., level of exaggeration of visible articulatory movements of the vocal tract). There are three main contributions: a geodesic interpolation with visual unit selection, a parameterization of visual emphasis, and the design of minimum size corpora. Perceptual tests with human subjects reveal high realism properties, achieving similar perceptual scores as real samples. Furthermore, the visual emphasis level and two communication styles show a statistical interaction relationship.   相似文献   

Realism is often a primary goal in computer graphics imagery, and we strive to create images that are perceptually indistinguishable from an actual scene. Rendering systems can now closely approximate the physical distribution of light in an environment. However, physical accuracy does not guarantee that the displayed images will have authentic visual appearance. In recent years the emphasis in realistic image synthesis has begun to shift from the simulation of light in an environment to images that look as real as the physical environment they portray. In other words the computer image should be not only physically correct but also perceptually equivalent to the scene it represents. This implies aspects of the Human Visual System (HVS) must be considered if realism is required. Visual perception is employed in many different guises in graphics to achieve authenticity. Certain aspects of the visual system must be considered to identify the perceptual effects that a realistic rendering system must achieve in order to reproduce effectively a similar visual response to a real scene. This paper outlines the manner in which knowledge about visual perception is increasingly appearing in state‐of‐the‐art realistic image synthesis. After a brief overview of the HVS, this paper is organized into four sections, each exploring the use of perception in realistic image synthesis, each with slightly different emphasis and application. First, Tone Mapping Operators, which attempt to map the vast range of computed radiance values to the limited range of display values, are discussed. Then perception based image quality metrics, which aim to compare images on a perceptual rather than physical basis, are presented. These metrics can be used to evaluate, validate and compare imagery. Thirdly, perception driven rendering algorithms are described. These algorithms focus on embedding models of the HVS directly into global illumination computations in order to improve their efficiency. Finally, techniques for comparing computer graphics imagery against the real world scenes they represent are discussed.  相似文献   

基于混合语言模型的语音识别系统虽然具有可以识别集外词的优点,但是集外词识别准确率远低于集内词。为了进一步提升混合语音识别系统的识别性能,本文提出了一种基于互补声学模型的多系统融合方法。首先,通过采用不同的声学建模单元,构建了两套基于隐马尔科夫模型和深层神经网络(Hidden Markov model and deep neural network, HMM-DNN)的混合语音识别系统;然后,针对这两种识别任务之间的关联性,采用多任务学习(Multi-task learning DNN, MTL-DNN)思想,实现DNN网络输入层和隐含层的共享,并通过联合训练提高建模精度。最后,采用ROVER(Recognizer output voting error reduction)方法对两套系统的输出结果进行融合。实验结果表明,相比于单任务学DNN(Single-task learning DNN, STL-DNN)建模方式,MTL-DNN可以获得更好的识别性能;将两个系统的输出进行融合,能够进一步降低词错误率。  相似文献   

情感语音合成可以增强语音的表现力,为使合成的情感语音更自然,提出一种结合时域基音同步叠加(PSOLA)和离散余弦变换(DCT)的情感语音合成方法。根据情感语音数据库中的高兴、悲伤、中性语音进行韵律参数分析归纳情感规则,调整中性语音各音节的基音频率、能量和时长。使用DCT方法对基音标记过的语音段进行基音频率的调整,并利用PSOLA算法修改基音频率使其逼近目标情感语音的基频。实验结果表明,该方法比单独使用PSOLA算法合成的情感语音更具情感色彩,其主观情感的识别率更高,合成的情感语音质量更好。  相似文献   

Visual Speech Synthesis by Morphing Visemes   总被引:6,自引:0,他引:6  
We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a small set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a photorealistic talking face.  相似文献   

该文提出了基于支撑向量机SVM(SupportVectorMachine)结合由主元分析PCA(PrincipleComponentAnaly-sis)导出的DFFS(DistanceFromFaceSpace)判据进行人脸视觉语音特征区域定位的方法。并与基于传统Fisher准则的线性判别方法FDA(FisherDiscriminationAnalysis)结合DFFS判据的定位结果进行了比较分析。在有限样本的情况下,基于SVM-DFFS的方法与传统的线性FDA-DFFS方法相比具有一定的优势。该文实验中所使用的样本数据来自中国科学院声学所汉语听觉、视觉双模态数据库(CAVSRv1.0)。  相似文献   

为了提高语音合成自然度和稳定性,提出HMM与深度神经网络相融合的,以维吾尔语作为实验语言的语音合成方法.基于深度学习的端到端语音合成方法存在生成速度慢、稳定性及可控性不够好,但是合成语音自然度高,而基于HMM的方法系统稳定性好,合成语音自然度不如端到端的方法.因此,系统前端部分利用HMM(马尔科夫模型)获取维吾尔语固有的语言特征,后端合成部分利用深度神经网络框架建立自回归模型.前端文本分析用HMM模型获取语言特征,后端合成用不同的神经网路模型,并进行了对比试验.最后对于实验结果进行了评测.实验结果验证了基于HMM+BiLSTM的语音合成方法的效果最好.  相似文献   

语音合成技术日趋成熟,为了提高合成情感语音的质量,提出了一种端到端情感语音合成与韵律修正相结合的方法。在Tacotron模型合成的情感语音基础上,进行韵律参数的修改,提高合成系统的情感表达力。首先使用大型中性语料库训练Tacotron模型,再使用小型情感语料库训练,合成出具有情感的语音。然后采用Praat声学分析工具对语料库中的情感语音韵律特征进行分析并总结不同情感状态下的参数规律,最后借助该规律,对Tacotron合成的相应情感语音的基频、时长和能量进行修正,使情感表达更为精确。客观情感识别实验和主观评价的结果表明,该方法能够合成较为自然且表现力更加丰富的情感语音。  相似文献   

真实感图形学是《计算机图形学》课程中重要的组成部分,也是实践性很强的教学环节。通过对教学实践过程中存在具体问题的思考,指出了开发真实感图形生成(Realistic Rendering)辅助教学课件的必要性。介绍了基于OpenGL和Cg(C for Graphics)合编程的课件制作方法。课件投入教学使用后,使学生在了解真实感图形学基本原理的基础上,可以掌握最新的图形技术发展动态,动手能力得到了普遍增强,教学效果也明显得到了提高。  相似文献   

红外与可见光图像融合是在复杂环境中获得高质量目标图像的一种有效手段,广泛应用于目标检测、人脸识别等领域。传统的红外与可见光图像融合方法未充分利用图像的关键信息,导致融合图像的视觉效果不佳、背景细节信息丢失。针对该问题,提出基于注意力与残差级联的端到端融合方法。将源图像输入到生成器中,通过层次特征提取模块提取源图像的层次特征,基于U-net连接的解码器融合层次特征并生成初始融合图像。将生成器与输入预融合图像的判别器进行对抗训练,同时利用细节损失函数优化生成器,补充融合图像缺失的信息。此外,在判别器中,采用谱归一化技术提高生成对抗网络训练的稳定性。实验结果表明,该方法的信息熵、标准差、互信息、空间频率分别为7.118 2、46.629 2、14.236 3和20.321,相比FusionGAN、LP、STDFusionNet等融合方法,能够充分提取源图像的信息,所得图像具有较优的视觉效果和图像质量。  相似文献   

针对基于大语料库的拼接合成系统中经常出现的拼接单元不匹配问题,特别是浊音拼接处不匹配对合成效果会产生较大的损伤,本文提出一种基于时域单元融合技术的平滑算法。它通过模板匹配选取合适的过渡段模板作为融合单元,并同时进行相位对齐,然后采用TD-PSOLA的方法对拼接单元和融合单元进行时域上的基音同步迭加融合。它的优点是对音质损伤很小,而且直接在时域上进行,效率高。通过对平滑前后语谱及主观听感两个方面的对比评测,平滑后的效果比平滑前有明显改善。  相似文献   

This paper deals with the problem of modelling the dynamics of articulation for a parameterised talking head based on phonetic input. Four different models are implemented and trained to reproduce the articulatory patterns of a real speaker, based on a corpus of optical measurements. Two of the models, (Cohen-Massaro and Öhman) are based on coarticulation models from speech production theory and two are based on artificial neural networks, one of which is specially intended for streaming real-time applications. The different models are evaluated through comparison between predicted and measured trajectories, which shows that the Cohen-Massaro model produces trajectories that best matches the measurements. A perceptual intelligibility experiment is also carried out, where the four data-driven models are compared against a rule-based model as well as an audio-alone condition. Results show that all models give significantly increased speech intelligibility over the audio-alone case, with the rule-based model yielding highest intelligibility score.  相似文献   

基于FD—PSOLA算法的语音合成分析方法   总被引:3,自引:0,他引:3  
介绍了一种基于FD-PSOLA算法来实现汉语韵律特征的修改。在短时信号频域修改的过程中,通过同态滤波处理分离了频谱包络和激励源频谱,并通过修改频率轴坐标来实现激励源频谱的压缩或拉伸。实验结果表明,FD-PSOLA算法比TD-PSOLA算法更适合于较高频率调整范围的语音合成分析。  相似文献   

徐志航  陈博  张辉  俞凯 《计算机学报》2022,45(5):1003-1017
在语音合成中,使用少量的用户录制数据进行说话人自适应一直面临着一个问题:如何在不过分降低合成声音的自然度的情况下,提高合成声音的相似度.现有的句子级别、帧级别说话人嵌入等自适应方法在合成训练集外说话人声音时会出现低相似度的问题.使用少量的用户录制数据微调预训练的语音合成模型的自适应方法尽管能提升合成音频的相似度,但是也...  相似文献   

发音问题是初学英语的一大难题。在我国这样的非英语环境中,很多小学生课后缺少专业老师辅导,极易出现英语发音障碍。本文设计开发了一个基于可视语音的英语发音辅导系统EP Tutor,模拟一个卡通家教的脸部动画,生动亲切的为学生一对一辅导英语发音。本文重点讨论了系统设计理念、系统架构、部分关键功能的详细设计以及关键技术的实现。  相似文献   

构建一种基于发音特征的音视频双流动态贝叶斯网络(DBN)语音识别模型(AFAV_DBN),定义节点的条件概率关系,使发音特征状态的变化可以异步.在音视频语音数据库上的语音识别实验表明,通过调整发音特征之问的异步约束,AF- AV_DBN模型能得到比基于状态的同步和异步DBN模型以及音频单流模型更高的识别率,对噪声也具有...  相似文献   

由于RGB-D相机深度信息范围有限且易受噪声影响,为了提高其视觉里程计的精度,提出一种基于混合3D-3D和3D-2D运动估计方法的视觉里程计。使用基于RICP算法(RANSAC-ICP)的3D-3D模型,并结合3D-2D运动模型,将深度信息缺失的二维特征点添加到估计方法中,充分利用了图像信息,提高了匹配准确率。综合考虑了关键帧和前一帧的地图信息进行迭代估计,增加了匹配点对数量,提供了更多约束信息。在该混合运动估计方法的基础上,结合稀疏光束平差法SBA对位姿估计结果进行优化,达到定位精度高、积累误差小的效果。在基于Kinect相机的移动平台上进行了验证,结合离线和在线实验表明,该方法满足实时性同时有效地提高了定位精度。  相似文献   

