期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Speech driven photo realistic facial animation based on an articulatory DBN model and AAM features

Dongmei Jiang Yong Zhao Hichem Sahli Yanning Zhang 《Multimedia Tools and Applications》2014,73(1):397-415

This paper presents a photo realistic facial animation synthesis approach based on an audio visual articulatory dynamic Bayesian network model (AF_AVDBN), in which the maximum asynchronies between the articulatory features, such as lips, tongue and glottis/velum, can be controlled. Perceptual Linear Prediction (PLP) features from audio speech, as well as active appearance model (AAM) features from face images of an audio visual continuous speech database, are adopted to train the AF_AVDBN model parameters. Based on the trained model, given an input audio speech, the optimal AAM visual features are estimated via a maximum likelihood estimation (MLE) criterion, which are then used to construct face images for the animation. In our experiments, facial animations are synthesized for 20 continuous audio speech sentences, using the proposed AF_AVDBN model, as well as the state-of-art methods, being the audio visual state synchronous DBN model (SS_DBN) implementing a multi-stream Hidden Markov Model, and the state asynchronous DBN model (SA_DBN). Objective evaluations on the learned AAM features show that much more accurate visual features can be learned from the AF_AVDBN model. Subjective evaluations show that the synthesized facial animations using AF_AVDBN are better than those using the state based SA_DBN and SS_DBN models, in the overall naturalness and matching accuracy of the mouth movements to the speech content. 相似文献

2.

A coupled HMM approach to video-realistic speech animation

Lei Xie^{Author Vitae} Zhi-Qiang Liu Author Vitae 《Pattern recognition》2007,40(8):2325-2340

We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio-visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio-visual speech is promising for speech animation. 相似文献

3.

基于状态异步DBN的语音驱动面部动画合成

赵勇蒋冬梅 Sahli Hichem 《计算机工程》2014,(2):180-183,188

提出一种基于状态异步动态贝叶斯网络模型(SA-DBN)的语音驱动面部动画合成方法。提取音视频语音数据库中音频的感知线性预测特征和面部图像的主动外观模型(AAM)特征来训练模型参数,对于给定的输入语音,基于极大似然估计原理学习得到对应的最优AAM特征序列,并由此合成面部图像序列和面部动画。对合成面部动画的主观评测结果表明,与听视觉状态同步的DBN模型相比,通过限制听觉语音状态和视觉语音状态间的最大异步程度,SA-DBN可以得到清晰自然并且嘴部运动与输入语音高度一致的面部动画。相似文献

4.

基于隐马尔科夫模型的中文发音动作参数预测方法研究

蔡明琦凌震华戴礼荣《数据采集与处理》2014,29(2):204-210

发音动作参数描述发音过程中唇、舌、颚等发音器官的位置与运动。本文对给定文本与语音情况下中文发音动作参数的预测方法进行研究。首先,设计了基于电磁发音仪的发音动作参数采集与预处理方法,通过头部运动规整与咬合面规整保证了发音动作参数的可靠性;其次,将隐马尔科夫模型应用于中文发音动作参数预测,采用包含声学参数与发音动作参数的双流模型结构实现从声学参数到发音动作参数的映射,并且分析对比了建模过程中不同上下文属性、模型聚类方式以及流间相关性假设对于中文发音动作参数预测性能的影响。实验结果表明,当采用三音素模型、双流独立聚类并且考虑流间相关性的情况下,可以获得最优的预测性能。相似文献

5.

Animating expressive faces across languages 总被引：2，自引：0，他引：2

Verma A. Subramaniam L.V. Rajput N. Neti C. Faruquie T.A. 《Multimedia, IEEE Transactions on》2004,6(6):791-800

This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. The method presented here can also be used for text to audio-visual speech synthesis. Visemes in new expressions are synthesized to be able to generate animations with different facial expressions. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face representing different visemes. The presented techniques give improved lip synchronization and naturalness to the animated video. 相似文献

6.

基于发音特征的音/视频双流语音识别模型* 总被引：1，自引：0，他引：1

宋培岩蒋冬梅王风娜《计算机应用研究》2009,26(7):2481-2483

构建了一种基于发音特征的音/视频双流动态贝叶斯网络(dynamic Bayesian network, DBN)语音识别模型,定义了各节点的条件概率关系,以及发音特征之间的异步约束关系,最后在音/视频连接数字语音数据库上进行了语音识别实验,并与音频单流、视频单流DBN模型比较了在不同信噪比情况下的识别效果。结果表明,在低信噪比情况下,基于发音特征的音/视频双流语音识别模型表现出最好的识别性能,而且随着噪声的增加,其识别率下降的趋势比较平缓,表明该模型对噪声具有很强的鲁棒性,更适用于低信噪比环境下的语音识别相似文献

7.

Learning dynamic audio-visual mapping with input-output Hidden Markov models 总被引：1，自引：0，他引：1

《Multimedia, IEEE Transactions on》2006,8(3):542-549

In this paper, we formulate the problem of synthesizing facial animation from an input audio sequence as a dynamic audio-visual mapping. We propose that audio-visual mapping should be modeled with an input-output hidden Markov model, or IOHMM. An IOHMM is an HMM for which the output and transition probabilities are conditional on the input sequence. We train IOHMMs using the expectation-maximization(EM) algorithm with a novel architecture to explicitly model the relationship between transition probabilities and the input using neural networks. Given an input sequence, the output sequence is synthesized by the maximum likelihood estimation. Experimental results demonstrate that IOHMMs can generate natural and good-quality facial animation sequences from the input audio. 相似文献

8.

A statistical parametric approach to video-realistic text-driven talking avatar

Lei Xie Naicai Sun Bo Fan 《Multimedia Tools and Applications》2014,73(1):377-396

This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation. 相似文献

9.

Audio/visual mapping with cross-modal hidden Markov models 总被引：1，自引：0，他引：1

Shengli Fu Gutierrez-Osuna R. Esposito A. Kakumanu P.K. Garcia O.N. 《Multimedia, IEEE Transactions on》2005,7(2):243-252

The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data. 相似文献

10.

基于深度神经网络的语音驱动发音器官的运动合成 总被引：1，自引：0，他引：1

唐郅侯进《自动化学报》2016,42(6):923-930

实现一种基于深度神经网络的语音驱动发音器官运动合成的方法,并应用于语音驱动虚拟说话人动画合成. 通过深度神经网络(Deep neural networks, DNN)学习声学特征与发音器官位置信息之间的映射关系,系统根据输入的语音数据估计发音器官的运动轨迹,并将其体现在一个三维虚拟人上面. 首先,在一系列参数下对比人工神经网络(Artificial neural network, ANN)和DNN的实验结果,得到最优网络; 其次,设置不同上下文声学特征长度并调整隐层单元数,获取最佳长度; 最后,选取最优网络结构,由DNN 输出的发音器官运动轨迹信息控制发音器官运动合成,实现虚拟人动画. 实验证明,本文所实现的动画合成方法高效逼真. 相似文献

11.

Animated Lombard speech: Motion capture,facial animation and visual intelligibility of speech produced in adverse conditions

《Computer Speech and Language》2014,28(2):607-618

In this paper we study the production and perception of speech in diverse conditions for the purposes of accurate, flexible and highly intelligible talking face animation. We recorded audio, video and facial motion capture data of a talker uttering a set of 180 short sentences, under three conditions: normal speech (in quiet), Lombard speech (in noise), and whispering. We then produced an animated 3D avatar with similar shape and appearance as the original talker and used an error minimization procedure to drive the animated version of the talker in a way that matched the original performance as closely as possible. In a perceptual intelligibility study with degraded audio we then compared the animated talker against the real talker and the audio alone, in terms of audio-visual word recognition rate across the three different production conditions. We found that the visual intelligibility of the animated talker was on par with the real talker for the Lombard and whisper conditions. In addition we created two incongruent conditions where normal speech audio was paired with animated Lombard speech or whispering. When compared to the congruent normal speech condition, Lombard animation yields a significant increase in intelligibility, despite the AV-incongruence. In a separate evaluation, we gathered subjective opinions on the different animations, and found that some degree of incongruence was generally accepted. 相似文献

12.

基于动态贝叶斯网络的音视频双模态说话人识别 总被引：4，自引：2，他引：4

吴志勇蔡莲红《计算机研究与发展》2006,43(3):470-475

动态贝叶斯网络在描述具有多个通道的复杂随机过程方面具有优异的性能．基于动态贝叶斯网络进行音视频双模态说话人识别的工作．分析了音视频联合建模的层级结构，利用动态贝叶斯网络对不同层级的音视频关联关系建立模型，并基于该模型进行音视频说话人识别的实验．通过对不同层级的建模过程及说话人识别实验的结果进行分析，结果表明，动态贝叶斯网络为描述音视频间的时序相关性和特征相关性提供了有效的建模方法，在不同语音信噪比的情况下均能提高说话人识别的性能．相似文献

13.

A deep bidirectional LSTM approach for video-realistic talking head

Bo Fan Lei Xie Shan Yang Lijuan Wang Frank K. Soong 《Multimedia Tools and Applications》2016,75(9):5287-5309

This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations. 相似文献

14.

基于机器学习的语音驱动人脸动画方法 总被引：19，自引：0，他引：19

陈益强高文王兆其姜大龙《软件学报》2003,14(2):215-221

语音与唇动面部表情的同步是人脸动画的难点之一.综合利用聚类和机器学习的方法学习语音信号和唇动面部表情之间的同步关系,并应用于基于MEPG-4标准的语音驱动人脸动画系统中.在大规模音视频同步数据库的基础上,利用无监督聚类发现了能有效表征人脸运动的基本模式,采用神经网络学习训练,实现了从含韵律的语音特征到人脸运动基本模式的直接映射,不仅回避了语音识别鲁棒性不高的缺陷,同时学习的结果还可以直接驱动人脸网格.最后给出对语音驱动人脸动画系统定量和定性的两种分析评价方法.实验结果表明,基于机器学习的语音驱动人脸动画不仅能有效地解决语音视频同步的难题,增强动画的真实感和逼真性,同时基于MPEG-4的学习结果独立于人脸模型,还可用来驱动各种不同的人脸模型,包括真实视频、2D卡通人物以及3维虚拟人脸. 相似文献

15.

Face Active Appearance Modeling and Speech Acoustic Information to Recover Articulation

《IEEE transactions on audio, speech, and language processing》2009,17(3):411-422

We are interested in recovering aspects of vocal tract's geometry and dynamics from speech, a problem referred to as speech inversion. Traditional audio-only speech inversion techniques are inherently ill-posed since the same speech acoustics can be produced by multiple articulatory configurations. To alleviate the ill-posedness of the audio-only inversion process, we propose an inversion scheme which also exploits visual information from the speaker's face. The complex audiovisual-to-articulatory mapping is approximated by an adaptive piecewise linear model. Model switching is governed by a Markovian discrete process which captures articulatory dynamic information. Each constituent linear mapping is effectively estimated via canonical correlation analysis. In the described multimodal context, we investigate alternative fusion schemes which allow interaction between the audio and visual modalities at various synchronization levels. For facial analysis, we employ active appearance models (AAMs) and demonstrate fully automatic face tracking and visual feature extraction. Using the AAM features in conjunction with audio features such as Mel frequency cepstral coefficients (MFCCs) or line spectral frequencies (LSFs) leads to effective estimation of the trajectories followed by certain points of interest in the speech production system. We report experiments on the QSMT and MOCHA databases which contain audio, video, and electromagnetic articulography data recorded in parallel. The results show that exploiting both audio and visual modalities in a multistream hidden Markov model based scheme clearly improves performance relative to either audio or visual-only estimation. 相似文献

16.

Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

《IEEE transactions on audio, speech, and language processing》2007,15(1):96-108

Looking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures 相似文献

17.

Facial animation based on context-dependent visemes

Jos Mario De Martino Lo Pini Magalhes Fbio Violaro 《Computers & Graphics》2006,30(6):971-980

This paper presents a novel approach for the generation of realistic speech synchronized 3D facial animation that copes with anticipatory and perseveratory coarticulation. The methodology is based on the measurement of 3D trajectories of fiduciary points marked on the face of a real speaker during the speech production of CVCV non-sense words. The trajectories are measured from standard video sequences using stereo vision photogrammetric techniques. The first stationary point of each trajectory associated with a phonetic segment is selected as its articulatory target. By clustering according to geometric similarity all articulatory targets of a same segment in different phonetic contexts, a set of phonetic context-dependent visemes accounting for coarticulation is identified. These visemes are then used to drive a set of geometric transformation/deformation models that reproduce the rotation and translation of the temporomandibular joint on the 3D virtual face, as well as the behavior of the lips, such as protrusion, and opening width and height of the natural articulation. This approach is being used to generate 3D speech synchronized animation from both natural and synthetic speech generated by a text-to-speech synthesizer. 相似文献

18.

语音驱动人脸唇形动画的实现

下载免费PDF全文

林爱华张文俊王毅敏赵光俊《计算机工程》2007,33(18):239-241

提出了一种实现语音直接驱动人脸唇形动画的新方法。结合人脸唇部运动机理，建立了与唇部运动相关肌肉拉伸和下颌转动的唇形物理模型，对输入的语音信号进行分析和提取其与唇部运动相关的特征参数，并直接将其映射到唇形物理模型的控制参数上，驱动唇形动画变形，实现输入语音和唇形动画的实时同步。仿真实验结果表明，该方法有效实现了语音和唇形的实时同步，唇形动画效果更接近自然，真实感更强。且该唇形物理模型独立于人脸几何模型，可广泛应用于各类人脸唇形动画的语音驱动，具有良好的普适性和可扩展性。相似文献

19.

语音驱动人脸动画中语音参数的提取技术

下载免费PDF全文

陈新周东生张强魏小鹏《计算机工程》2007,33(6):225-227

语音特征参数的提取是语音驱动人脸动画中语音可视化的前提和基础，该文立足于语音驱动的人脸动画技术，较为系统地研究了语音参数的提取。在参数精度方面，引入了用小波变换重构原始信号的思想，对重构后的信号进行参数提取，从而为语音驱动人脸动画系统建立良好的可视化映射模型奠定了基础。相似文献

20.

An adaptive neural control scheme for articulatory synthesis of CV sequences

《Computer Speech and Language》2014,28(1):163-176

Reproducing the smooth vocal tract trajectories is critical for high quality articulatory speech synthesis. This paper presents an adaptive neural control scheme for such a task using fuzzy logic and neural networks. The control scheme estimates motor commands from trajectories of flesh-points on selected articulators. These motor commands are then used to reproduce the trajectories of the underlying articulators in a 2nd order dynamical system. Initial experiments show that the control scheme is able to manipulate the mass-spring based elastic tract walls in a 2-dimensional articulatory synthesizer and to realize efficient speech motor control. The proposed controller achieves high accuracy during on-line tracking of the lips, the tongue, and the jaw in the simulation of consonant–vowel sequences. It also offers salient features such as generality and adaptability for future developments of control models in articulatory synthesis. 相似文献