首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non-linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.  相似文献   

2.
Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.  相似文献   

3.
Controlling a crowd using multi‐touch devices appeals to the computer games and animation industries, as such devices provide a high‐dimensional control signal that can effectively define the crowd formation and movement. However, existing works relying on pre‐defined control schemes require the users to learn a scheme that may not be intuitive. We propose a data‐driven gesture‐based crowd control system, in which the control scheme is learned from example gestures provided by different users. In particular, we build a database with pairwise samples of gestures and crowd motions. To effectively generalize the gesture style of different users, such as the use of different numbers of fingers, we propose a set of gesture features for representing a set of hand gesture trajectories. Similarly, to represent crowd motion trajectories of different numbers of characters over time, we propose a set of crowd motion features that are extracted from a Gaussian mixture model. Given a run‐time gesture, our system extracts the K nearest gestures from the database and interpolates the corresponding crowd motions in order to generate the run‐time control. Our system is accurate and efficient, making it suitable for real‐time applications such as real‐time strategy games and interactive animation controls.  相似文献   

4.
Gesture plays an important role for recognizing lecture activities in video content analysis. In this paper, we propose a real-time gesture detection algorithm by integrating cues from visual, speech and electronic slides. In contrast to the conventional “complete gesture” recognition, we emphasize detection by the prediction from “incomplete gesture”. Specifically, intentional gestures are predicted by the modified hidden Markov model (HMM) which can recognize incomplete gestures before the whole gesture paths are observed. The multimodal correspondence between speech and gesture is exploited to increase the accuracy and responsiveness of gesture detection. In lecture presentation, this algorithm enables the on-the-fly editing of lecture slides by simulating appropriate camera motion to highlight the intention and flow of lecturing. We develop a real-time application, namely simulated smartboard, and demonstrate the feasibility of our prediction algorithm using hand gesture and laser pen with simple setup without involving expensive hardware.   相似文献   

5.
Speech/gesture interface to a visual-computing environment   总被引:3,自引:0,他引:3  
We developed a speech/gesture interface that uses visual hand-gesture analysis and speech recognition to control a 3D display in VMD, a virtual environment for structural biology. The reason we used a particular virtual environment context was to set the necessary constraints to make our analysis robust and to develop a command language that optimally combines speech and gesture inputs. Our interface uses: automatic speech recognition (ASR), aided by a microphone, to recognize voice commands; two strategically positioned cameras to detect hand gestures; and automatic gesture recognition (AGR), a set of computer vision techniques to interpret those hand gestures. The computer vision algorithms can extract the user's hand from the background, detect different finger positions, and distinguish meaningful gestures from unintentional hand movements. Our main goal was to simplify model manipulation and rendering to make biomolecular modeling more playful. Researchers can explore variations of their model and concentrate on biomolecular aspects of their task without undue distraction by computational aspects. They can view simulations of molecular dynamics, play with different combinations of molecular structures, and better understand the molecules' important properties. A potential benefit, for example, might be reducing the time to discover new compounds for new drugs  相似文献   

6.
We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in "prosody transplant" and gesture transplant" scenarios.  相似文献   

7.
Generating expressive speech for storytelling applications   总被引:1,自引:0,他引:1  
Work on expressive speech synthesis has long focused on the expression of basic emotions. In recent years, however, interest in other expressive styles has been increasing. The research presented in this paper aims at the generation of a storytelling speaking style, which is suitable for storytelling applications and more in general, for applications aimed at children. Based on an analysis of human storytellers' speech, we designed and implemented a set of prosodic rules for converting "neutral" speech, as produced by a text-to-speech system, into storytelling speech. An evaluation of our storytelling speech generation system showed encouraging results.  相似文献   

8.
Style Transfer Functions for Illustrative Volume Rendering   总被引:3,自引:0,他引:3  
Illustrative volume visualization frequently employs non-photorealistic rendering techniques to enhance important features or to suppress unwanted details. However, it is difficult to integrate multiple non-photorealistic rendering approaches into a single framework due to great differences in the individual methods and their parameters. In this paper, we present the concept of style transfer functions. Our approach enables flexible data-driven illumination which goes beyond using the transfer function to just assign colors and opacities. An image-based lighting model uses sphere maps to represent non-photorealistic rendering styles. Style transfer functions allow us to combine a multitude of different shading styles in a single rendering. We extend this concept with a technique for curvature-controlled style contours and an illustrative transparency model. Our implementation of the presented methods allows interactive generation of high-quality volumetric illustrations.  相似文献   

9.
Actions performed by a virtual character can be controlled with verbal commands such as ‘walk five steps forward’. Similar control of the motion style, meaning how the actions are performed, is complicated by the ambiguity of describing individual motions with phrases such as ‘aggressive walking’. In this paper, we present a method for controlling motion style with relative commands such as ‘do the same, but more sadly’. Based on acted example motions, comparative annotations, and a set of calculated motion features, relative styles can be defined as vectors in the feature space. We present a new method for creating these style vectors by finding out which features are essential for a style to be perceived and eliminating those that show only incidental correlations with the style. We show with a user study that our feature selection procedure is more accurate than earlier methods for creating style vectors, and that the style definitions generalize across different actors and annotators. We also present a tool enabling interactive control of parametric motion synthesis by verbal commands. As the control method is independent from the generation of motion, it can be applied to virtually any parametric synthesis method.  相似文献   

10.
11.
Assistance is currently a pivotal research area in robotics, with huge societal potential. Since assistant robots directly interact with people, finding natural and easy-to-use user interfaces is of fundamental importance. This paper describes a flexible multimodal interface based on speech and gesture modalities in order to control our mobile robot named Jido. The vision system uses a stereo head mounted on a pan-tilt unit and a bank of collaborative particle filters devoted to the upper human body extremities to track and recognize pointing/symbolic mono but also bi-manual gestures. Such framework constitutes our first contribution, as it is shown, to give proper handling of natural artifacts (self-occlusion, camera out of view field, hand deformation) when performing 3D gestures using one or the other hand even both. A speech recognition and understanding system based on the Julius engine is also developed and embedded in order to process deictic and anaphoric utterances. The second contribution deals with a probabilistic and multi-hypothesis interpreter framework to fuse results from speech and gesture components. Such interpreter is shown to improve the classification rates of multimodal commands compared to using either modality alone. Finally, we report on successful live experiments in human-centered settings. Results are reported in the context of an interactive manipulation task, where users specify local motion commands to Jido and perform safe object exchanges.  相似文献   

12.
Humans use a combination of gesture and speech to interact with objects and usually do so more naturally without holding a device or pointer. We present a system that incorporates user body-pose estimation, gesture recognition and speech recognition for interaction in virtual reality environments. We describe a vision-based method for tracking the pose of a user in real time and introduce a technique that provides parameterized gesture recognition. More precisely, we train a support vector classifier to model the boundary of the space of possible gestures, and train Hidden Markov Models (HMM) on specific gestures. Given a sequence, we can find the start and end of various gestures using a support vector classifier, and find gesture likelihoods and parameters with a HMM. A multimodal recognition process is performed using rank-order fusion to merge speech and vision hypotheses. Finally we describe the use of our multimodal framework in a virtual world application that allows users to interact using gestures and speech.  相似文献   

13.
目的 基于控制器的运动生成方法存在一些局限性,对骨骼参数、运动风格变化的适应性较差。提出一种面向多骨骼及多风格的行走运动控制器及其生成方法。方法 首先使用改进的比例微分控制器作为预处理,使仿真过程适应较大的比例微分增益,然后应用特定规则调整比例微分系数并使用旋转迭代算法进行优化,最后设置各种与稳定性和风格相关的目标函数,使用协方差矩阵自适应进化策略对控制器中的目标姿态进行优化。结果 本文方法可以生成一系列对应不同骨架、不同运动风格的行走控制器,在效率、稳定性、鲁棒性和多样性方面相较其他方法有一定优势,其中稳定运行时间可提高一个数量级。结论 本文方法生成的控制器运动稳定性好,风格多样可控,不需要大量手工调整,不要求用户具有较强专业背景,增强了控制器的适应性,扩展了控制器生成运动的应用范围。  相似文献   

14.
Gesture and speech are co-expressive and complementary channels of a single human language system. While speech carries the major load of symbolic presentation, gesture provides the imagistic content. We investigate the role of oscillatory/cyclical hand motions in ‘carrying’ this image content. We present our work on the extraction of hand motion oscillation frequencies of gestures that accompany speech. The key challenges are that such motions are characterized by non-stationary oscillations, and multiple frequencies may be simultaneously extant. Also, the duration of the oscillations may be extended over very few cycles. We apply the windowed Fourier transform and wavelet transform to detect and extract gesticulatory oscillations. We tested these against synthetic signals (stationary and non-stationary) and real data sequences of gesticulatory hand movements in natural discourse. Our results show that both filters functioned well for the synthetic signals. For the real data, the wavelet bandpass filter bank is better for detecting and extracting hand gesture oscillations. We relate the hand motion oscillatory gestures detected by wavelet analysis to speech in natural conversation and apply to multimodal language analysis. We demonstrate the ability of our algorithm to extract gesticulatory oscillations and show how oscillatory gestures reveal portions of the multimodal discourse structure.  相似文献   

15.
In this paper, we aim for the recognition of a set of dance gestures from contemporary ballet. Our input data are motion trajectories followed by the joints of a dancing body provided by a motion-capture system. It is obvious that direct use of the original signals is unreliable and expensive. Therefore, we propose a suitable tool for non-uniform sub-sampling of spatiotemporal signals. The key to our approach is the use of a deformable model to provide a compact and efficient representation of motion trajectories. Our dance gesture recognition method involves a set of hidden Markov models (HMMs), each of them being related to a motion trajectory followed by the joints. The recognition of such movements is then achieved by matching the resulting gesture models with the input data via HMMs. We have validated our recognition system on 12 fundamental movements from contemporary ballet performed by four dancers. This revised version was published online in November 2004 with corrections to the section numbers. Ballet Atlantique Régine Chopinot.  相似文献   

16.
This paper presents an approach for the integration of Virtual Reality (VR) and Computer-Aided Design (CAD). Our general goal is to develop a VR-CAD framework making possible intuitive and direct 3D edition on CAD objects within Virtual Environments (VE). Such a framework can be applied to collaborative part design activities and to immersive project reviews. The cornerstone of our approach is a model that manages implicit editing of CAD objects. This model uses a naming technique of B-Rep components and a set of logical rules to provide straight access to the operators of Construction History Graphs (CHG). Another set of logical rules and the replay capacities of CHG make it possible to modify in real-time the parameters of these operators according to the user’s 3D interactions. A demonstrator of our model has been developed on the OpenCASCADE geometric kernel, but we explain how it can be applied to more standard CAD systems such as CATIA. We combined our VR-CAD framework with multimodal immersive interaction (using 6 DoF tracking, speech and gesture recognition systems) to gain direct and intuitive deformation of the objects’ shapes within a VE, thus avoiding explicit interactions with the CHG within a classical WIMP interface. In addition, we present several haptic paradigms specially conceptualized and evaluated to provide an accurate perception of B-Rep components and to help the user during his/her 3D interactions. Finally, we conclude on some issues for future researches in the field of VR-CAD integration.  相似文献   

17.
谢宁  赵婷婷  杨阳  魏琴  Heng Tao SHEN 《软件学报》2018,29(4):1071-1084
基于学习的智能图像风格绘制是当前多媒体特别是艺术风格化领域的一个热门研究课题.图像风格学习算法主要研究利用多媒体数据及机器学习方法对真实样本数据进行自动艺术绘制智能处理.目前主流方法主要针对艺术图像的静态样本进行学习.然而,由于静态数据类中包含的信息是平面化、局部化和非连续化的,难以保证风格化处理的全局一致性.本文旨在针对创意过程中的大规模复杂多媒体数据,从序列任务学习理论上提出一套智能艺术风格绘制的理论模型、设计方法以及优化方法三个层面.本文贡献主要体现在以下研究工作:(1)提出了一套面向数字美术的多媒体数据采集设备与软件系统,(2)利用IRL算法实现了艺术风格行为的模型化及其数字化保护方法,(3)提出基于PGPE的正则化策略学习方法以提高风格学习过程的稳定性.实验结果表明本文提出的方法行之有效地实现针对具体个性风格的照片水墨画艺术风格转化.基于序列化多媒体数据采集与分析,本文提出的面向移动互联网的自动艺术风格绘制辅助系统,不仅具有理论上的创新而且还具有实际上的巨大应用和价值.  相似文献   

18.
19.
The processing of captured motion is an essential task for undertaking the synthesis of high-quality character animation. The motion decomposition techniques investigated in prior work extract meaningful motion primitives that help to facilitate this process. Carefully selected motion primitives can play a major role in various motion-synthesis tasks, such as interpolation, blending, warping, editing or the generation of new motions. Unfortunately, for a complex character motion, finding generic motion primitives by decomposition is an intractable problem due to the compound nature of the behaviours of such characters. Additionally, decomposed motion primitives tend to be too limited for the chosen model to cover a broad range of motion-synthesis tasks. To address these challenges, we propose a generative motion decomposition framework in which the decomposed motion primitives are applicable to a wide range of motion-synthesis tasks. Technically, the input motion is smoothly decomposed into three motion layers. These are base-level motion, a layer with controllable motion displacements and a layer with high-frequency residuals. The final motion can easily be synthesized simply by changing a single user parameter that is linked to the layer of controllable motion displacements or by imposing suitable temporal correspondences to the decomposition framework. Our experiments show that this decomposition provides a great deal of flexibility in several motion synthesis scenarios: denoising, style modulation, upsampling and time warping.  相似文献   

20.
Despite the existence of advanced functions in smartphones, most blind people are still using old-fashioned phones with familiar layouts and dependence on tactile buttons. Smartphones support accessibility features including vibration, speech and sound feedback, and screen readers. However, these features are only intended to provide feedback to user commands or input. It is still a challenge for blind people to discover functions on the screen and to input the commands. Although voice commands are supported in smartphones, these commands are difficult for a system to recognize in noisy environments. At the same time, smartphones are integrated with sophisticated motion sensors, and motion gestures with device tilt have been gaining attention for eyes-free input. We believe that these motion gesture interactions offer more efficient access to smartphone functions for blind people. However, most blind people are not smartphone users and they are aware of neither the affordances available in smartphones nor the potential for interaction through motion gestures. To investigate the most usable gestures for blind people, we conducted a user-defined study with 13 blind participants. Using the gesture set and design heuristics from the user study, we implemented motion gesture based interfaces with speech and vibration feedback for browsing phone books and making a call. We then conducted a second study to investigate the usability of the motion gesture interface and user experiences using the system. The findings indicated that motion gesture interfaces are more efficient than traditional button interfaces. Through the study results, we provided implications for designing smartphone interfaces.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号