首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This work focuses on the development of expressive text-to-speech synthesis techniques for a Chinese spoken dialog system, where the expressivity is driven by the message content. We adapt the three-dimensional pleasure-displeasure, arousal-nonarousal and dominance-submissiveness (PAD) model for describing expressivity in input text semantics. The context of our study is based on response messages generated by a spoken dialog system in the tourist information domain. We use the $P$ (pleasure) and $A$ (arousal) dimensions to describe expressivity at the prosodic word level based on lexical semantics. The $D$ (dominance) dimension is used to describe expressivity at the utterance level based on dialog acts. We analyze contrastive (neutral versus expressive) speech recordings to develop a nonlinear perturbation model that incorporates the PAD values of a response message to transform neutral speech into expressive speech. Two levels of perturbations are implemented—local perturbation at the prosodic word level, as well as global perturbation at the utterance level. Perceptual experiments involving 14 subjects indicate that the proposed approach can significantly enhance expressivity in response generation for a spoken dialog system.   相似文献   

2.
Human facial gestures often exhibit such natural stochastic variations as how often the eyes blink, how often the eyebrows and the nose twitch, and how the head moves while speaking. The stochastic movements of facial features are key ingredients for generating convincing facial expressions. Although such small variations have been simulated using noise functions in many graphics applications, modulating noise functions to match natural variations induced from the affective states and the personality of characters is difficult and not intuitive. We present a technique for generating subtle expressive facial gestures (facial expressions and head motion) semi‐automatically from motion capture data. Our approach is based on Markov random fields that are simulated in two levels. In the lower level, the coordinated movements of facial features are captured, parameterized, and transferred to synthetic faces using basis shapes. The upper level represents independent stochastic behavior of facial features. The experimental results show that our system generates expressive facial gestures synchronized with input speech.  相似文献   

3.
Emotive audio–visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural- sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio–visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.   相似文献   

4.
A real-time speech-driven synthetic talking face provides an effective multimodal communication interface in distributed collaboration environments. Nonverbal gestures such as facial expressions are important to human communication and should be considered by speech-driven face animation systems. In this paper, we present a framework that systematically addresses facial deformation modeling, automatic facial motion analysis, and real-time speech-driven face animation with expression using neural networks. Based on this framework, we learn a quantitative visual representation of the facial deformations, called the motion units (MUs). A facial deformation can be approximated by a linear combination of the MUs weighted by MU parameters (MUPs). We develop an MU-based facial motion tracking algorithm which is used to collect an audio-visual training database. Then, we construct a real-time audio-to-MUP mapping by training a set of neural networks using the collected audio-visual training database. The quantitative evaluation of the mapping shows the effectiveness of the proposed approach. Using the proposed method, we develop the functionality of real-time speech-driven face animation with expressions for the iFACE system. Experimental results show that the synthetic expressive talking face of the iFACE system is comparable with a real face in terms of the effectiveness of their influences on bimodal human emotion perception.  相似文献   

5.
针对现有的虚拟说话人面部表情比较单一,表情和动作不能很好的协同问题,提出了一种建立具有真实感的情绪化虚拟人的方法。该方法首先利用三参数产生,保持和消退来对动态面部表情进行仿真,采用融合变形技术合成复杂的表情,然后以人类心理学的统计数据为依据来对眼部和头部动作进行设计,使虚拟人看起来更加逼真,最后分析了外在条件相机位置、光照对增加虚拟人真实感的影响。实验结果表明,该方法建立的虚拟人不仅逼真自然且富有感情,而且语音、动态面部表情、眼动和头动达到了很好的协调同步。  相似文献   

6.
This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation.  相似文献   

7.
针对现有语音生成说话人脸视频方法忽略说话人头部运动的问题,提出基于关键点表示的语音驱动说话人脸视频生成方法.分别利用人脸的面部轮廓关键点和唇部关键点表示说话人的头部运动信息和唇部运动信息,通过并行多分支网络将输入语音转换到人脸关键点,通过连续的唇部关键点和头部关键点序列及模板图像最终生成面部人脸视频.定量和定性实验表明,文中方法能合成清晰、自然、带有头部动作的说话人脸视频,性能指标较优.  相似文献   

8.
Avatars are increasingly used to express our emotions in our online communications. Such avatars are used based on the assumption that avatar expressions are interpreted universally among all cultures. This paper investigated cross-cultural evaluations of avatar expressions designed by Japanese and Western designers. The goals of the study were: (1) to investigate cultural differences in avatar expression evaluation and apply findings from psychological studies of human facial expression recognition, (2) to identify expressions and design features that cause cultural differences in avatar facial expression interpretation. The results of our study confirmed that (1) there are cultural differences in interpreting avatars’ facial expressions, and the psychological theory that suggests physical proximity affects facial expression recognition accuracy is also applicable to avatar facial expressions, (2) positive expressions have wider cultural variance in interpretation than negative ones, (3) use of gestures and gesture marks may sometimes cause counter-effects in recognizing avatar facial expressions.  相似文献   

9.
Psychological research findings suggest that humans rely on the combined visual channels of face and body more than any other channel when they make judgments about human communicative behavior. However, most of the existing systems attempting to analyze the human nonverbal behavior are mono-modal and focus only on the face. Research that aims to integrate gestures as an expression mean has only recently emerged. Accordingly, this paper presents an approach to automatic visual recognition of expressive face and upper-body gestures from video sequences suitable for use in a vision-based affective multi-modal framework. Face and body movements are captured simultaneously using two separate cameras. For each video sequence single expressive frames both from face and body are selected manually for analysis and recognition of emotions. Firstly, individual classifiers are trained from individual modalities. Secondly, we fuse facial expression and affective body gesture information at the feature and at the decision level. In the experiments performed, the emotion classification using the two modalities achieved a better recognition accuracy outperforming classification using the individual facial or bodily modality alone.  相似文献   

10.
Synthesizing expressive facial animation is a very challenging topic within the graphics community. In this paper, we present an expressive facial animation synthesis system enabled by automated learning from facial motion capture data. Accurate 3D motions of the markers on the face of a human subject are captured while he/she recites a predesigned corpus, with specific spoken and visual expressions. We present a novel motion capture mining technique that "learns" speech coarticulation models for diphones and triphones from the recorded data. A phoneme-independent expression eigenspace (PIEES) that encloses the dynamic expression signals is constructed by motion signal processing (phoneme-based time-warping and subtraction) and principal component analysis (PCA) reduction. New expressive facial animations are synthesized as follows: First, the learned coarticulation models are concatenated to synthesize neutral visual speech according to novel speech input, then a texture-synthesis-based approach is used to generate a novel dynamic expression signal from the PIEES model, and finally the synthesized expression signal is blended with the synthesized neutral visual speech to create the final expressive facial animation. Our experiments demonstrate that the system can effectively synthesize realistic expressive facial animation  相似文献   

11.
This paper presents a non-verbal and non-facial method for effective communication of a “mechanoid robot” by conveying the emotions through gestures. This research focuses on human–robot interaction using a mechanoid robot that does not possess any anthropomorphic facial features for conveying gestures. Another feature of this research is the use of human-like smooth motion of this mechanoid robot in contrast to the traditional trapezoidal velocity profile for its communication. For conveying gestures, the connection between motion of robot and perceived emotions is established by varying the velocity and acceleration of the mechanoid structure. The selected motion parameters are changed systematically to observe the variation in perceived emotions. The perceived emotions have been further investigated using three different emotional behavior models: Russell’s circumplex model of affect, Tellegen–Watson–Clark model and PAD model. Results obtained show that the designated motion parameters are linked with the change of emotions. Moreover, the emotions perceived by the user are same through all three models, validating the reliability of all the three emotional scale models and also of the emotions perceived by the user.  相似文献   

12.
《Advanced Robotics》2013,27(8):827-852
The purpose of a robot is to execute tasks for people. People should be able to communicate with robots in a natural way. People naturally express themselves through body language using facial gestures and expressions. We have built a human-robot interface based on head gestures for use in robot applications. Our interface can track a person's facial features in real time (30 Hz video frame rate). No special illumination or facial makeup is needed to achieve robust tracking. We use dedicated vision hardware based on correlation image matching to implement the face tracking. Tracking using correlation matching suffers from the problems of changing shade and deformation or even disappearance of facial features. By using multiple Kalman filters we are able to overcome these problems. Our system can accurately predict and robustly track the positions of facial features despite disturbances and rapid movements of the head (including both translational and rotational motion). Since we can reliably track faces in real-time we are also able to recognize motion gestures of the face. Our system can recognize a large set of gestures (15) ranging from yes, no and may be to detecting winks, blinks and sleeping. We have used an approach that decomposes each gesture into a set of atomic actions, e.g. a nod for yes consists of an atomic up followed by a down motion. Our system can understand gestures by monitoring the transition between atomic actions.  相似文献   

13.
In this paper a facial feature point tracker that is motivated by applications such as human-computer interfaces and facial expression analysis systems is proposed. The proposed tracker is based on a graphical model framework. The facial features are tracked through video streams by incorporating statistical relations in time as well as spatial relations between feature points. By exploiting the spatial relationships between feature points, the proposed method provides robustness in real-world conditions such as arbitrary head movements and occlusions. A Gabor feature-based occlusion detector is developed and used to handle occlusions. The performance of the proposed tracker has been evaluated on real video data under various conditions including occluded facial gestures and head movements. It is also compared to two popular methods, one based on Kalman filtering exploiting temporal relations, and the other based on active appearance models (AAM). Improvements provided by the proposed approach are demonstrated through both visual displays and quantitative analysis.  相似文献   

14.
基于深度神经网络的语音驱动发音器官的运动合成   总被引:1,自引:0,他引:1  
唐郅  侯进 《自动化学报》2016,42(6):923-930
实现一种基于深度神经网络的语音驱动发音器官运动合成的方法,并应用于语音驱动虚拟说话人动画合成. 通过深度神经网络(Deep neural networks, DNN)学习声学特征与发音器官位置信息之间的映射关系,系统根据输入的语音数据估计发音器官的运动轨迹,并将其体现在一个三维虚拟人上面. 首先,在一系列参数下对比人工神经网络(Artificial neural network, ANN)和DNN的实验结果,得到最优网络; 其次,设置不同上下文声学特征长度并调整隐层单元数,获取最佳长度; 最后,选取最优网络结构,由DNN 输出的发音器官运动轨迹信息控制发音器官运动合成,实现虚拟人动画. 实验证明,本文所实现的动画合成方法高效逼真.  相似文献   

15.
The use of avatars with emotionally expressive faces is potentially highly beneficial to communication in collaborative virtual environments (CVEs), especially when used in a distance learning context. However, little is known about how, or indeed whether, emotions can effectively be transmitted through the medium of a CVE. Given this, an avatar head model with limited but human-like expressive abilities was built, designed to enrich CVE communication. Based on the facial action coding system (FACS), the head was designed to express, in a readily recognisable manner, the six universal emotions. An experiment was conducted to investigate the efficacy of the model. Results indicate that the approach of applying the FACS model to virtual face representations is not guaranteed to work for all expressions of a particular emotion category. However, given appropriate use of the model, emotions can effectively be visualised with a limited number of facial features. A set of exemplar facial expressions is presented.  相似文献   

16.
杨逸  侯进  王献 《计算机应用研究》2013,30(7):2236-2240
针对可视化语音以及虚拟说话人系统中对唇部和舌部动画的高逼真度要求, 提出了一种基于运动轨迹分析的3D唇舌肌肉控制模型。该方法首先根据解剖学原理, 建立基于网格和纹理的唇部、舌部模型。然后, 根据分析唇部的运动轨迹, 将口轮匝肌分解成两部分来联合控制唇部的运动, 可以获得各种口型。在舌部的运动模拟中, 将它的运动轨迹分解成一些机械运动的组合, 通过使用四种肌肉模型来控制这些运动。最终实现了人脸嘴部说话时的各种口型, 实现了卷舌、舔嘴等动作。实验结果表明, 该方法能逼真地实现出唇部、舌部的运动动画。  相似文献   

17.
18.
在人脸序列的图象编码中 ,由于模型基编码方法可以获得高的主观图象质量和低的码率 ,因而受到广泛重视 .但是 ,其运动参数的可靠估计还是一个难点 .为此 ,该文分析了头部运动的特点 ,并把它分为头部刚体运动、脸部表情的简单运动和脸部表情复杂运动 3种形式 .其中 ,提取头部刚体运动参数利用了基于特征点对的运动参数估计算法 ,并提出了一个线性的实现方法 ;文中还指出提高运动参数估计的精度在于选择合适的特征点和建立一个和特定人脸相一致的三维线框模型 ;另外 ,还为脸部表情的简单运动建立了形变矩阵 ;最后给出了用面积误差函数评价的运动参数估计误差 .  相似文献   

19.
The MPEG4 standard supports the transmission and composition of facial animation with natural video by including a facial animation parameter (FAP) set that is defined based on the study of minimal facial actions and is closely related to muscle actions. The FAP set enables model-based representation of natural or synthetic talking head sequences and allows intelligible visual reproduction of facial expressions, emotions, and speech pronunciations at the receiver. This paper describes two key components we have developed for building a model-based video coding system: (1) a method for estimating FAP parameters based on our previously proposed piecewise Bézier volume deformation model (PBVD), and (2) various methods for encoding FAP parameters. PBVD is a linear deformation model suitable for both the synthesis and the analysis of facial images. Each FAP parameter is a basis function in this model. Experimental results on PBVD-based animation, model-based tracking, and spatial-temporal compression of FAP parameters are demonstrated in this paper.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号