首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 140 毫秒
1.
为了提高中文唇音同步人脸动画视频的真实性, 本文提出一种基于改进Wav2Lip模型的文本音频驱动人脸动画生成技术. 首先, 构建了一个中文唇音同步数据集, 使用该数据集来预训练唇部判别器, 使其判别中文唇音同步人脸动画更加准确. 然后, 在Wav2Lip模型中, 引入文本特征, 提升唇音时间同步性从而提高人脸动画视频的真实性. 本文模型综合提取到的文本信息、音频信息和说话人面部信息, 在预训练的唇部判别器和视频质量判别器的监督下, 生成高真实感的唇音同步人脸动画视频. 与ATVGnet模型和Wav2Lip模型的对比实验表明, 本文模型生成的唇音同步人脸动画视频提升了唇形和音频之间的同步性, 提高了人脸动画视频整体的真实感. 本文成果为当前人脸动画生成需求提供一种解决方案.  相似文献   

2.
语音驱动的人脸重演的目标是通过输入一段任意语音,驱动目标人物重演相应表情。现有方法无法仅以一段自然场景下视频作为训练素材,实现可操控的高保真语音驱动人脸重演,为此,提出了一种基于LD特征的语音驱动人脸重演方法。首先,对输入视频进行人脸对齐,检测人脸关键点并提取LD特征;然后,通过音频特征提取模块将输入音频映射为一段64维的潜码,构建基于多层感知机的编码器和解码器,由解码器将潜码解码为LD特征后,通过编码器将其还原为潜码表示;接着,将更新后的潜码输入基于网格的NeRF获得采样点密度和颜色,通过体绘制输出头部重演RGB帧;同时,将姿势输入身体变形模块,合成重演帧的身体部分。实验结果证明,方法能够根据输入语音内容生成高保真重演结果,并且在重演过程中能够对目标人物的面部表情进行个性化控制。  相似文献   

3.
在视频图像中快速提取完整的嘴唇外形是计算机唇读系统的首要任务之一,文中提出了一种综合采用Red Exclusion和Fisher变换的唇部检测方法,根据肤色模型和运动相关性在视频图像中检测人脸,然后在RGB空间内排除红色,用(G,B)分量作为Fisher变换矢量,对人脸下三分之一区域进行唇部图像增强,并利用增强后的灰度图像的灰度值呈正态分布这一特性,自适应确定肤色和唇色阈值,将唇部从背景图像中分割出来。该方法能检测出完整的嘴唇外形,且检测速度高,对光照、胡须及说话人不敏感。  相似文献   

4.
基于机器学习的语音驱动人脸动画方法   总被引:19,自引:0,他引:19  
语音与唇动面部表情的同步是人脸动画的难点之一.综合利用聚类和机器学习的方法学习语音信号和唇动面部表情之间的同步关系,并应用于基于MEPG-4标准的语音驱动人脸动画系统中.在大规模音视频同步数据库的基础上,利用无监督聚类发现了能有效表征人脸运动的基本模式,采用神经网络学习训练,实现了从含韵律的语音特征到人脸运动基本模式的直接映射,不仅回避了语音识别鲁棒性不高的缺陷,同时学习的结果还可以直接驱动人脸网格.最后给出对语音驱动人脸动画系统定量和定性的两种分析评价方法.实验结果表明,基于机器学习的语音驱动人脸动画不仅能有效地解决语音视频同步的难题,增强动画的真实感和逼真性,同时基于MPEG-4的学习结果独立于人脸模型,还可用来驱动各种不同的人脸模型,包括真实视频、2D卡通人物以及3维虚拟人脸.  相似文献   

5.
为了由视频进行驱动生成人脸表情动画,提出一种表演驱动的二维人脸表情合成方法。利用主动外观模型算法对人脸的关键点进行定位,并从关键点提取出人脸的运动参数;对人脸划分区域,并获取目标人脸的若干样本图像;从人脸的运动参数获取样本图像的插值系数,对样本图像进行线性组合来合成目标人脸的表情图像。该方法具有计算简单有效、真实感强的特点,可以应用于数字娱乐、视频会议等领域。  相似文献   

6.
为了利用计算机方便快捷地生成表情逼真的动漫人物,提出一种基于深度学习和表情AU参数的人脸动画生成方法.该方法定义了用于描述面部表情的24个面部运动单元参数,即表情AU参数,并利用卷积神经网络和FEAFA数据集构建和训练了相应的参数回归网络模型.在根据视频图像生成人脸动画时,首先从单目摄像头获取视频图像,采用有监督的梯度下降法对视频帧进行人脸检测,进而对得到的人脸表情图像准确地回归出表情AU参数值,将其视为三维人脸表情基系数,并结合虚拟人物相对应的24个基础三维表情形状和中立表情形状,在自然环境下基于表情融合变形模型驱动虚拟人物生成人脸动画.该方法省去了传统方法中的三维重建过程,并且考虑了运动单元参数之间的相互影响,使得生成的人脸动画的表情更加自然、细腻.此外,基于人脸图像比基于特征点回归出的表情系数更加准确.  相似文献   

7.
针对语音驱动人脸动画中如何生成随语音运动自然呈现的眨眼、抬眉等表情细节以增强虚拟环境的沉浸感的问题,提出一种可以合成表情细节的语音驱动人脸动画方法.该方法分为训练与合成2个阶段.在训练阶段,首先对富有表情的三维人脸语音运动捕获数据特征进行重采样处理,降低训练数据量以提升训练效率,然后运用隐马尔可夫模型(HMM)学习表情人脸语音运动和同步语音的关系,同时统计经过训练的HMM在训练数据集上的合成余量;在合成阶段,首先使用经过训练的HMM从新语音特征中推断与之匹配的表情人脸动画,在此基础上,根据训练阶段计算的合成余量增加表情细节.实验结果表明,文中方法比已有方法计算效率高,合成的表情细节通过了用户评价验证.  相似文献   

8.
快速实时生成表情逼真、姿态自然的虚拟人脸一直是较为有挑战性的研究。提出一种基于3DMM与GAN结合的实时人脸表情迁移方法。通过目标人脸的一段表演视频,将表演人员与目标人脸关键点建立映射关系,使用二维RGB摄像头实时跟踪表演人脸关键点并利用GAN生成目标虚拟人脸特征点,进一步估计人脸姿态。利用3DMM构成二维到三维人脸模型的重建,实时渲染出当前姿态的二维人脸表情,再将表演人脸表情与目标人脸表情进行融合,生成表情逼真的目标人脸。对比实验表明,该方法能得到更为逼真的人脸表情,可以模仿出目标人脸真实的表情,同时也能够达到实时性,在创建逼真的视频方面实现了更大的灵活性。同时,提出一种针对人脸表情迁移仿真效果的验证方法可以客观评价仿真人脸的结果。  相似文献   

9.
基于MPEG-4的人脸表情图像变形研究   总被引:1,自引:0,他引:1       下载免费PDF全文
为了实时地生成自然真实的人脸表情,提出了一种基于MPEG-4人脸动画框架的人脸表情图像变形方法。该方法首先采用face alignment工具提取人脸照片中的88个特征点;接着在此基础上,对标准人脸网格进行校准变形,以进一步生成特定人脸的三角网格;然后根据人脸动画参数(FAP)移动相应的面部关键特征点及其附近的关联特征点,并在移动过程中保证在多个FAP的作用下的人脸三角网格拓扑结构不变;最后对发生形变的所有三角网格区域通过仿射变换进行面部纹理填充,生成了由FAP所定义的人脸表情图像。该方法的输入是一张中性人脸照片和一组人脸动画参数,输出是对应的人脸表情图像。为了实现细微表情动作和虚拟说话人的合成,还设计了一种眼神表情动作和口内细节纹理的生成算法。基于5分制(MOS)的主观评测实验表明,利用该人脸图像变形方法生成的表情脸像自然度得分为3.67。虚拟说话人合成的实验表明,该方法具有很好的实时性,在普通PC机上的平均处理速度为66.67 fps,适用于实时的视频处理和人脸动画的生成。  相似文献   

10.
杨逸  侯进  王献 《计算机应用研究》2013,30(7):2236-2240
针对可视化语音以及虚拟说话人系统中对唇部和舌部动画的高逼真度要求, 提出了一种基于运动轨迹分析的3D唇舌肌肉控制模型。该方法首先根据解剖学原理, 建立基于网格和纹理的唇部、舌部模型。然后, 根据分析唇部的运动轨迹, 将口轮匝肌分解成两部分来联合控制唇部的运动, 可以获得各种口型。在舌部的运动模拟中, 将它的运动轨迹分解成一些机械运动的组合, 通过使用四种肌肉模型来控制这些运动。最终实现了人脸嘴部说话时的各种口型, 实现了卷舌、舔嘴等动作。实验结果表明, 该方法能逼真地实现出唇部、舌部的运动动画。  相似文献   

11.
Automatic analysis of head gestures and facial expressions is a challenging research area and it has significant applications in human-computer interfaces. We develop a face and head gesture detector in video streams. The detector is based on face landmark paradigm in that appearance and configuration information of landmarks are used. First we detect and track accurately facial landmarks using adaptive templates, Kalman predictor and subspace regularization. Then the trajectories (time series) of facial landmark positions during the course of the head gesture or facial expression are converted in various discriminative features. Features can be landmark coordinate time series, facial geometric features or patches on expressive regions of the face. We use comparatively, two feature sequence classifiers, that is, Hidden Markov Models (HMM) and Hidden Conditional Random Fields (HCRF), and various feature subspace classifiers, that is, ICA (Independent Component Analysis) and NMF (Non-negative Matrix Factorization) on the spatiotemporal data. We achieve 87.3% correct gesture classification on a seven-gesture test database, and the performance reaches 98.2% correct detection under a fusion scheme. Promising and competitive results are also achieved on classification of naturally occurring gesture clips of LIlir TwoTalk Corpus.  相似文献   

12.
Animating expressive faces across languages   总被引:2,自引:0,他引:2  
This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. The method presented here can also be used for text to audio-visual speech synthesis. Visemes in new expressions are synthesized to be able to generate animations with different facial expressions. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face representing different visemes. The presented techniques give improved lip synchronization and naturalness to the animated video.  相似文献   

13.
Automatically locating facial landmarks in images is an important task in computer vision. This paper proposes a novel context modeling method for facial landmark detection, which integrates context constraints together with local texture model in the cascaded AdaBoost framework. The motivation of our method lies in the basic human psychology observation that not only the local texture information but also the global context information is used for human to locate facial landmarks in faces. Therefore, in our solution, a novel type of feature, called Non-Adjacent Rectangle (NAR) Haar-like feature, is proposed to characterize the co-occurrence between facial landmarks and its surroundings, i.e., the context information, in terms of low-level features. For the locating task, traditional Haar-like features (characterizing local texture information) and NAR Haar-like features (characterizing context constraints in global sense) are combined together to form more powerful representations. Through Real AdaBoost learning, the most discriminative feature set is selected automatically and used for facial landmark detection. To verify the effectiveness of the proposed method, we evaluate our facial landmark detection algorithm on BioID and Cohn-Kanade face databases. Experimental results convincingly show that the NAR Haar-like feature is effective to model the context and our proposed algorithm impressively outperforms the published state-of-the-art methods. In addition, the generalization capability of the NAR Haar-like feature is further validated by extended applications to face detection task on FDDB face database.  相似文献   

14.
随着计算机视觉领域图像生成研究的发展,面部重演引起广泛关注,这项技术旨在根据源人脸图像的身份以及驱动信息提供的嘴型、表情和姿态等信息合成新的说话人图像或视频。面部重演具有十分广泛的应用,例如虚拟主播生成、线上授课、游戏形象定制、配音视频中的口型配准以及视频会议压缩等,该项技术发展时间较短,但是涌现了大量研究。然而目前国内外几乎没有重点关注面部重演的综述,面部重演的研究概述只是在深度伪造检测综述中以深度伪造的内容出现。鉴于此,本文对面部重演领域的发展进行梳理和总结。本文从面部重演模型入手,对面部重演存在的问题、模型的分类以及驱动人脸特征表达进行阐述,列举并介绍了训练面部重演模型常用的数据集及评估模型的评价指标,对面部重演近年研究工作进行归纳、分析与比较,最后对面部重演的演化趋势、当前挑战、未来发展方向、危害及应对策略进行了总结和展望。  相似文献   

15.
This paper proposes a deep bidirectional long short-term memory approach in modeling the long contextual, nonlinear mapping between audio and visual streams for video-realistic talking head. In training stage, an audio-visual stereo database is firstly recorded as a subject talking to a camera. The audio streams are converted into acoustic feature, i.e. Mel-Frequency Cepstrum Coefficients (MFCCs), and their textual labels are also extracted. The visual streams, in particular, the lower face region, are compactly represented by active appearance model (AAM) parameters by which the shape and texture variations can be jointly modeled. Given pairs of the audio and visual parameter sequence, a DBLSTM model is trained to learn the sequence mapping from audio to visual space. For any unseen speech audio, whether it is original recorded or synthesized by text-to-speech (TTS), the trained DBLSTM model can predict a convincing AAM parameter trajectory for the lower face animation. To further improve the realism of the proposed talking head, the trajectory tiling method is adopted to use the DBLSTM predicted AAM trajectory as a guide to select a smooth real sample image sequence from the recorded database. We then stitch the selected lower face image sequence back to a background face video of the same subject, resulting in a video-realistic talking head. Experimental results show that the proposed DBLSTM approach outperforms the existing HMM-based approach in both objective and subjective evaluations.  相似文献   

16.
This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation.  相似文献   

17.
This paper proposes to demonstrate the advantages of using certain properties of the human visual system in order to develop a set of fusion algorithms for automatic analysis and interpretation of global and local facial motions. The proposed fusion algorithms rely on information coming from human vision models such as human retina and primary visual cortex previously developed at Gipsa-lab. Starting from a set of low level bio-inspired modules (static and moving contour detector, motion event detector and spectrum analyser) which are very efficient for video data pre-processing, it is shown how to organize them together in order to achieve reliable face motion interpretation. In particular, algorithms for global head motion analysis such as head nods, for local eye motion analysis such as blinking, for local mouth motion analysis such as speech lip motion and yawning and for open/close mouth/eye state detection are proposed and their performances are assessed. Thanks to the use of human vision model pre-processing which decorrelates visual information in a reliable manner, fusion algorithms are simplified and remain robust against traditional video acquisition problems (light changes, object detection failure, etc.).  相似文献   

18.
针对现有的生成对抗网络(GAN)伪造人脸图像检测方法在有角度及遮挡情况下存在的真实人脸误判问题,提出了一种基于深度对齐网络(DAN)的GAN伪造人脸图像检测方法。首先,基于DAN设计面部关键点提取网络,以提取真伪人脸关键点位置;然后,采用主成分分析(PCA)方法将每一组关键点映射到三维空间,从而减少冗余信息以及降低特征维度;最后,利用支持向量机(SVM)五折交叉验证对特征进行分类,并计算准确率。实验结果表明,该方法通过提高面部关键点定位准确度改善了由于定位误差引起的面部不协调问题,进而降低了真实人脸误判率。与VGG19、XceptionNet和Dlib-SVM方法相比,正脸情况下,该方法的ROC下面积(AUC)值提高了4.48到32.96个百分点,平均精度(AP)提高了4.26到33.12个百分点;有角度及遮挡人脸情况下,该方法的AUC值提高了10.56到30.75个百分点,AP提高了7.42到42.45个百分点。  相似文献   

19.
语音驱动唇形动画的同步是人脸动画的难点之一。首先以音节为识别单位,通过严格的声韵母建模方法,利用HTK工具包,识别得到语音文件中的音节序列与时间信息;然后利用基本唇形库和音节到唇形映射表,获得与音节序列对应的唇形序列;利用唇形序列的时间信息插值播放唇形序列,实现语音驱动的唇形动画。实验表明,该方法不仅大大减少了模型数目,而且能准确识别音节序列以及时间信息,可有效地实现语音与唇动的同步。  相似文献   

20.
In this paper, a real-time system to create a talking head from a video sequence without any user intervention is presented. In the proposed system, a probabilistic approach, to decide whether or not extracted facial features are appropriate for creating a three-dimensional (3-D) face model, is presented. Automatically extracted two-dimensional facial features from a video sequence are fed into the proposed probabilistic framework before a corresponding 3-D face model is built to avoid generating an unnatural or nonrealistic 3-D face model. To extract face shape, we also present a face shape extractor based on an ellipse model controlled by three anchor points, which is accurate and computationally cheap. To create a 3-D face model, a least-square approach is presented to find a coefficient vector that is necessary to adapt a generic 3-D model into the extracted facial features. Experimental results show that the proposed system can efficiently build a 3-D face model from a video sequence without any user intervention for various Internet applications including virtual conference and a virtual story teller that do not require much head movements or high-quality facial animation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号