首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
基于机器学习的语音驱动人脸动画方法   总被引:19,自引:0,他引:19  
语音与唇动面部表情的同步是人脸动画的难点之一.综合利用聚类和机器学习的方法学习语音信号和唇动面部表情之间的同步关系,并应用于基于MEPG-4标准的语音驱动人脸动画系统中.在大规模音视频同步数据库的基础上,利用无监督聚类发现了能有效表征人脸运动的基本模式,采用神经网络学习训练,实现了从含韵律的语音特征到人脸运动基本模式的直接映射,不仅回避了语音识别鲁棒性不高的缺陷,同时学习的结果还可以直接驱动人脸网格.最后给出对语音驱动人脸动画系统定量和定性的两种分析评价方法.实验结果表明,基于机器学习的语音驱动人脸动画不仅能有效地解决语音视频同步的难题,增强动画的真实感和逼真性,同时基于MPEG-4的学习结果独立于人脸模型,还可用来驱动各种不同的人脸模型,包括真实视频、2D卡通人物以及3维虚拟人脸.  相似文献   

2.
This paper presents a virtual character animation system for real- time multimodal interaction in an immersive virtual reality setting. Human to human interaction is highly multimodal, involving features such as verbal language, tone of voice, facial expression, gestures and gaze. This multimodality means that, in order to simulate social interaction, our characters must be able to handle many different types of interaction and many different types of animation, simultaneously. Our system is based on a model of animation that represents different types of animations as instantiations of an abstract function representation. This makes it easy to combine different types of animation. It also encourages the creation of behavior out of basic building blocks, making it easy to create and configure new behaviors for novel situations. The model has been implemented in Piavca, an open source character animation system.  相似文献   

3.
语音特征参数的提取是语音驱动人脸动画中语音可视化的前提和基础,该文立足于语音驱动的人脸动画技术,较为系统地研究了语音参数的提取。在参数精度方面,引入了用小波变换重构原始信号的思想,对重构后的信号进行参数提取,从而为语音驱动人脸动画系统建立良好的可视化映射模型奠定了基础。  相似文献   

4.
Animating expressive faces across languages   总被引:2,自引:0,他引:2  
This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. The method presented here can also be used for text to audio-visual speech synthesis. Visemes in new expressions are synthesized to be able to generate animations with different facial expressions. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face representing different visemes. The presented techniques give improved lip synchronization and naturalness to the animated video.  相似文献   

5.
6.
This paper presents an experimental study on an agent system with multimodal interfaces for a smart office environment. The agent system is based upon multimodal interfaces such as recognition modules for both speech and pen-mouse gesture, and identification modules for both face and fingerprint. For essential modules, speech recognition and synthesis were basically used for a virtual interaction between user and system. In this study, a real-time speech recognizer based on a Hidden Markov Network (HM-Net) was incorporated into the proposed system. In addition, identification techniques based on both face and fingerprint were adopted to provide a specific user with the service of a user-customized interaction with security in an office environment. In evaluation, results showed that the proposed system was easy to use and would prove useful in a smart office environment, even though the performance of the speech recognizer was not satisfactory mainly due to noisy environments.  相似文献   

7.
A real-time speech-driven synthetic talking face provides an effective multimodal communication interface in distributed collaboration environments. Nonverbal gestures such as facial expressions are important to human communication and should be considered by speech-driven face animation systems. In this paper, we present a framework that systematically addresses facial deformation modeling, automatic facial motion analysis, and real-time speech-driven face animation with expression using neural networks. Based on this framework, we learn a quantitative visual representation of the facial deformations, called the motion units (MUs). A facial deformation can be approximated by a linear combination of the MUs weighted by MU parameters (MUPs). We develop an MU-based facial motion tracking algorithm which is used to collect an audio-visual training database. Then, we construct a real-time audio-to-MUP mapping by training a set of neural networks using the collected audio-visual training database. The quantitative evaluation of the mapping shows the effectiveness of the proposed approach. Using the proposed method, we develop the functionality of real-time speech-driven face animation with expressions for the iFACE system. Experimental results show that the synthetic expressive talking face of the iFACE system is comparable with a real face in terms of the effectiveness of their influences on bimodal human emotion perception.  相似文献   

8.
三维语音动画聊天室的设计与实现   总被引:1,自引:1,他引:0  
聊天室是人们在网上交流的一种重要手段,由于硬件设备、网络带宽的限制,目前广泛使用的聊天室只能基于文本和语音,不能基于人脸形象。以已经实现的“一个基于SAPI5.0的中文语音动画系统”为基础,设计并实现了一个结合文本、语音和人脸动画的三维语音动画聊天室。聊天室由客户端和服务器两部分组成,多个用户利用客户端连接到服务器上,用户可以输入文本并且选择各种表情,由客户端混合成带有表情标签的文本传送给服务器。服务器将该用户的三维人脸模型以及带有表情标签的文本传送给接收方,由接收方的客户端合成为带有表情的语音动画。聊天室只有发送方第一次给接收方传送信息时才需要下载人脸模型,以后只传送带有表情标签的文本,具有计算方法简单、系统通信开销小的特点,在普通微机上就能够产生带有表情的高质量的语音动画。  相似文献   

9.
1.引言人脸建模与动画(face modeling and animation)是计算机图形学中最富有挑战性的课题之一。这是因为:首先,人脸的几何形状非常复杂,其表面不但具有无数细小的皱纹,而且呈现颜色和纹理的微妙变化,因此建立精确的人脸模型、生成真实感人脸非常困难;其次,脸部运动是骨骼、肌肉、皮下组织和皮肤共同作用的结果,其运动机理非常复杂,因此生成真实感人脸动画非常困难;另外,我们人类生来就具有一种识别和  相似文献   

10.
We propose a coupled hidden Markov model (CHMM) approach to video-realistic speech animation, which realizes realistic facial animations driven by speaker independent continuous speech. Different from hidden Markov model (HMM)-based animation approaches that use a single-state chain, we use CHMMs to explicitly model the subtle characteristics of audio-visual speech, e.g., the asynchrony, temporal dependency (synchrony), and different speech classes between the two modalities. We derive an expectation maximization (EM)-based A/V conversion algorithm for the CHMMs, which converts acoustic speech into decent facial animation parameters. We also present a video-realistic speech animation system. The system transforms the facial animation parameters to a mouth animation sequence, refines the animation with a performance refinement process, and finally stitches the animated mouth with a background facial sequence seamlessly. We have compared the animation performance of the CHMM with the HMMs, the multi-stream HMMs and the factorial HMMs both objectively and subjectively. Results show that the CHMMs achieve superior animation performance. The ph-vi-CHMM system, which adopts different state variables (phoneme states and viseme states) in the audio and visual modalities, performs the best. The proposed approach indicates that explicitly modelling audio-visual speech is promising for speech animation.  相似文献   

11.
In this paper, we present a simple and robust mixed reality (MR) framework that allows for real-time interaction with virtual humans in mixed reality environments under consistent illumination. We will look at three crucial parts of this system: interaction, animation and global illumination of virtual humans for an integrated and enhanced presence. The interaction system comprises of a dialogue module, which is interfaced with a speech recognition and synthesis system. Next to speech output, the dialogue system generates face and body motions, which are in turn managed by the virtual human animation layer. Our fast animation engine can handle various types of motions, such as normal key-frame animations, or motions that are generated on-the-fly by adapting previously recorded clips. Real-time idle motions are an example of the latter category. All these different motions are generated and blended on-line, resulting in a flexible and realistic animation. Our robust rendering method operates in accordance with the previous animation layer, based on an extended for virtual humans precomputed radiance transfer (PRT) illumination model, resulting in a realistic rendition of such interactive virtual characters in mixed reality environments. Finally, we present a scenario that illustrates the interplay and application of our methods, glued under a unique framework for presence and interaction in MR.  相似文献   

12.
语音驱动唇形动画的同步是人脸动画的难点之一。首先以音节为识别单位,通过严格的声韵母建模方法,利用HTK工具包,识别得到语音文件中的音节序列与时间信息;然后利用基本唇形库和音节到唇形映射表,获得与音节序列对应的唇形序列;利用唇形序列的时间信息插值播放唇形序列,实现语音驱动的唇形动画。实验表明,该方法不仅大大减少了模型数目,而且能准确识别音节序列以及时间信息,可有效地实现语音与唇动的同步。  相似文献   

13.
We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is described. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergartens through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.  相似文献   

14.
Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D motions of the markers on the face of a human subject are captured while he/she recites a predesigned corpus, with specific spoken and visual expressions. We present a novel motion capture mining technique that "learns" speech coarticulation models for diphones and triphones from the recorded data. A phoneme-independent expression eigenspace (PIEES) that encloses the dynamic expression signals is constructed by motion signal processing (phoneme-based time-warping and subtraction) and principal component analysis (PCA) reduction. New expressive facial animations are synthesized as follows: First, the learned coarticulation models are concatenated to synthesize neutral visual speech according to novel speech input, then a texture-synthesis-based approach is used to generate a novel dynamic expression signal from the PIEES model, and finally the synthesized expression signal is blended with the synthesized neutral visual speech to create the final expressive facial animation. Our experiments demonstrate that the system can effectively synthesize realistic expressive facial animation  相似文献   

15.
Emotive audio–visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural- sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio–visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.   相似文献   

16.
This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation.  相似文献   

17.
Facial expression analogy provides computer animation professionals with a tool to map expressions of an arbitrary source face onto an arbitrary target face. In the recent past, several algorithms have been presented in the literature that aim at putting the expression analogy paradigm into practice. Some of these methods exclusively handle expression mapping between 3-D face models, while others enable the transfer of expressions between images of faces only. None of them, however, represents a more general framework that can be applied to either of these two face representations. In this paper, we describe a novel generic method for analogy-based facial animation that employs the same efficient framework to transfer facial expressions between arbitrary 3-D face models, as well as between images of performer's faces. We propose a novel geometry encoding for triangle meshes, vertex-tent-coordinates, that enables us to formulate expression transfer in the 2-D and the 3-D case as a solution to a simple system of linear equations. Our experiments show that our method outperforms many previous analogy-based animation approaches in terms of achieved animation quality, computation time and generality.  相似文献   

18.
基于图像的虚拟景观漫游   总被引:19,自引:2,他引:17  
提出一种基于图像的虚拟景观漫游方法。该方法采用全景图和基于图像的动画来建立虚拟环境,用户通过实时图像处理浏览器,可以模拟这两种典型的漫游行为-环绕漫游和贯穿漫游,并且在风景旅游模拟系统(Virtrual-Touring)中实现了这种方法。  相似文献   

19.
We present a novel method for transferring speech animation recorded in low quality videos to high resolution 3D face models. The basic idea is to synthesize the animated faces by an interpolation based on a small set of 3D key face shapes which span a 3D face space. The 3D key shapes are extracted by an unsupervised learning process in 2D video space to form a set of 2D visemes which are then mapped to the 3D face space. The learning process consists of two main phases: 1) isomap-based nonlinear dimensionality reduction to embed the video speech movements into a low-dimensional manifold and 2) k-means clustering in the low-dimensional space to extract 2D key viseme frames. Our main contribution is that we use the isomap-based learning method to extract intrinsic geometry of the speech video space and thus to make it possible to define the 3D key viseme shapes. To do so, we need only to capture a limited number of 3D key face models by using a general 3D scanner. Moreover, we also develop a skull movement recovery method based on simple anatomical structures to enhance 3D realism in local mouth movements. Experimental results show that our method can achieve realistic 3D animation effects with a small number of 3D key face models  相似文献   

20.
林鑫  陈桦  王开志  王继成 《计算机工程》2007,33(17):237-238
定义了10种基本的嘴形。以Mel频率倒谱系数(MFCC)作为语音特征,通过SVM分类器进行元音a,i,u的识别,根据其对应量化后的语音能量,映射到嘴形序列,进行中值滤波和排除“奇异点”。该算法在基于语音驱动人脸动画系统中的应用取得了良好的效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号