首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.  相似文献   

2.
Despite the existence of advanced functions in smartphones, most blind people are still using old-fashioned phones with familiar layouts and dependence on tactile buttons. Smartphones support accessibility features including vibration, speech and sound feedback, and screen readers. However, these features are only intended to provide feedback to user commands or input. It is still a challenge for blind people to discover functions on the screen and to input the commands. Although voice commands are supported in smartphones, these commands are difficult for a system to recognize in noisy environments. At the same time, smartphones are integrated with sophisticated motion sensors, and motion gestures with device tilt have been gaining attention for eyes-free input. We believe that these motion gesture interactions offer more efficient access to smartphone functions for blind people. However, most blind people are not smartphone users and they are aware of neither the affordances available in smartphones nor the potential for interaction through motion gestures. To investigate the most usable gestures for blind people, we conducted a user-defined study with 13 blind participants. Using the gesture set and design heuristics from the user study, we implemented motion gesture based interfaces with speech and vibration feedback for browsing phone books and making a call. We then conducted a second study to investigate the usability of the motion gesture interface and user experiences using the system. The findings indicated that motion gesture interfaces are more efficient than traditional button interfaces. Through the study results, we provided implications for designing smartphone interfaces.  相似文献   

3.
We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in "prosody transplant" and gesture transplant" scenarios.  相似文献   

4.
The use of hand gestures offers an alternative to the commonly used human–computer interfaces (i.e. keyboard, mouse, gamepad, voice, etc.), providing a more intuitive way of navigating among menus and in multimedia applications. This paper presents a dataset for the evaluation of hand gesture recognition approaches in human–computer interaction scenarios. It includes natural data and synthetic data from several State of the Art dictionaries. The dataset considers single-pose and multiple-pose gestures, as well as gestures defined by pose and motion or just by motion. Data types include static pose videos and gesture execution videos—performed by a set of eleven users and recorded with a time-of-flight camera—and synthetically generated gesture images. A novel collection of critical factors involved in the creation of a hand gestures dataset is proposed: capture technology, temporal coherence, nature of gestures, representativeness, pose issues and scalability. Special attention is given to the scalability factor, proposing a simple method for the synthetic generation of depth images of gestures, making possible the extension of a dataset with new dictionaries and gestures without the need of recruiting new users, as well as providing more flexibility in the point-of-view selection. The method is validated for the presented dataset. Finally, a separability study of the pose-based gestures of a dictionary is performed. The resulting corpus, which exceeds in terms of representativity and scalability the datasets existing in the State Of Art, provides a significant evaluation scenario for different kinds of hand gesture recognition solutions.  相似文献   

5.
Gesture and speech are co-expressive and complementary channels of a single human language system. While speech carries the major load of symbolic presentation, gesture provides the imagistic content. We investigate the role of oscillatory/cyclical hand motions in ‘carrying’ this image content. We present our work on the extraction of hand motion oscillation frequencies of gestures that accompany speech. The key challenges are that such motions are characterized by non-stationary oscillations, and multiple frequencies may be simultaneously extant. Also, the duration of the oscillations may be extended over very few cycles. We apply the windowed Fourier transform and wavelet transform to detect and extract gesticulatory oscillations. We tested these against synthetic signals (stationary and non-stationary) and real data sequences of gesticulatory hand movements in natural discourse. Our results show that both filters functioned well for the synthetic signals. For the real data, the wavelet bandpass filter bank is better for detecting and extracting hand gesture oscillations. We relate the hand motion oscillatory gestures detected by wavelet analysis to speech in natural conversation and apply to multimodal language analysis. We demonstrate the ability of our algorithm to extract gesticulatory oscillations and show how oscillatory gestures reveal portions of the multimodal discourse structure.  相似文献   

6.
We propose an approach to achieving early recognition of gesture patterns. Early recognition is a method for recognizing sequential patterns at their earliest stage. Therefore, in the case of gesture recognition, we can get a recognition result for human gestures before the gestures are finished. The most difficult problem in early recognition is knowing when the system has determined the result. Most traditional approaches suffer from this problem, since gestures are often ambiguous. At the start of a gesture, in particular, it is very difficult to determinate the recognition result since insufficient input data have been observed. Therefore, we have improved on the traditional approach by using a self-organizing map.  相似文献   

7.
Communicative behaviors are a very important aspect of human behavior and deserve special attention when simulating groups and crowds of virtual pedestrians. Previous approaches have tended to focus on generating believable gestures for individual characters and talker‐listener behaviors for static groups. In this paper, we consider the problem of creating rich and varied conversational behaviors for data‐driven animation of walking and jogging characters. We captured ground truth data of participants conversing in pairs while walking and jogging. Our stylized splicing method takes as input a motion captured standing gesture performance and a set of looped full body locomotion clips. Guided by the ground truth metrics, we perform stylized splicing and synchronization of gesture with locomotion to produce natural conversations of characters in motion. Copyright © 2016 John Wiley & Sons, Ltd.  相似文献   

8.
This paper reports on the utility of gestures and speech to manipulate graphic objects. In the experiment described herein, three different populations of subjects were asked to communicate with a computer using either speech alone, gestures alone, or both. The task was the manipulation of a three-dimensional cube on the screen. They were asked to assume that the computer could see their hands, hear their voices, and understand their gestures and speech as well as a human could. A gesture classification scheme was developed to analyse the gestures of the subjects. A primary objective of the classification scheme was to determine whether common features would be found among the gestures of different users and classes of users. The collected data show a surprising degree of commonality among subjects in the use of gestures as well as speech. In addition to the uniformity of the observed manipulations, subjects expressed a preference for a combined gesture/speech interface. Furthermore, all subjects easily completed the simulated object manipulation tasks.The results of this research, and of future experiments of this type, can be applied to develop a gesture-based or gesture/speech-based system which enables computer users to manipulate graphic objects using easily learned and intuitive gestures to perform spatial tasks. Such tasks might include editing a three-dimensional rendering, controlling the operation of vehicles or operating virtual tools in three dimensions, or assembling an object from components. Knowledge about how people intuitively use gestures to communicate with computers provides the basis for future development of gesture-based input devices.  相似文献   

9.
刘杰  黄进  田丰  胡伟平  戴国忠  王宏安 《软件学报》2017,28(8):2080-2095
分析了触控交互技术在移动手持设备及可穿戴设备应用的应用现状及存在的问题;基于交互动作的时间连续性及空间连续性,提出了将触控交互动作的接触面轨迹与空间轨迹相结合,同时具有空中手势及触控手势的特性及优点的混合手势输入方法;基于连续交互空间的概念,将混合交互手势,空中手势、表面触控手势进行统一,建立了包括空中层、表面层、混合层的连续交互空间分层处理模型;给出了统一的信息数据定义及数转换流程;构建了通用性的手势识别框架,并对轨迹切分方法及手势分类识别方法进行了阐述.最后设计了应用实例,通过实验,对混合交互手势的可用性及连续空间分层处理模型的可行性进行了验证.实验表明,混合手势输入方式同时兼具了表面触控输入及空中手势输入的优点,在兼顾识别效率的同时,具有较好的空间自由度.  相似文献   

10.
Machine learning is a technique for analyzing data that aids the construction of mathematical models. Because of the growth of the Internet of Things (IoT) and wearable sensor devices, gesture interfaces are becoming a more natural and expedient human-machine interaction method. This type of artificial intelligence that requires minimal or no direct human intervention in decision-making is predicated on the ability of intelligent systems to self-train and detect patterns. The rise of touch-free applications and the number of deaf people have increased the significance of hand gesture recognition. Potential applications of hand gesture recognition research span from online gaming to surgical robotics. The location of the hands, the alignment of the fingers, and the hand-to-body posture are the fundamental components of hierarchical emotions in gestures. Linguistic gestures may be difficult to distinguish from nonsensical motions in the field of gesture recognition. Linguistic gestures may be difficult to distinguish from nonsensical motions in the field of gesture recognition. In this scenario, it may be difficult to overcome segmentation uncertainty caused by accidental hand motions or trembling. When a user performs the same dynamic gesture, the hand shapes and speeds of each user, as well as those often generated by the same user, vary. A machine-learning-based Gesture Recognition Framework (ML-GRF) for recognizing the beginning and end of a gesture sequence in a continuous stream of data is suggested to solve the problem of distinguishing between meaningful dynamic gestures and scattered generation. We have recommended using a similarity matching-based gesture classification approach to reduce the overall computing cost associated with identifying actions, and we have shown how an efficient feature extraction method can be used to reduce the thousands of single gesture information to four binary digit gesture codes. The findings from the simulation support the accuracy, precision, gesture recognition, sensitivity, and efficiency rates. The Machine Learning-based Gesture Recognition Framework (ML-GRF) had an accuracy rate of 98.97%, a precision rate of 97.65%, a gesture recognition rate of 98.04%, a sensitivity rate of 96.99%, and an efficiency rate of 95.12%.  相似文献   

11.
An accurate estimation of sentence units (SUs) in spontaneous speech is important for (1) helping listeners to better understand speech content and (2) supporting other natural language processing tasks that require sentence information. There has been much research on automatic SU detection; however, most previous studies have only used lexical and prosodic cues, but have not used nonverbal cues, e.g., gesture. Gestures play an important role in human conversations, including providing semantic content, expressing emotional status, and regulating conversational structure. Given the close relationship between gestures and speech, gestures may provide additional contributions to automatic SU detection. In this paper, we have investigated the use of gesture cues for enhancing the SU detection. Particularly, we have focused on: (1) collecting multimodal data resources involving gestures and SU events in human conversations, (2) analyzing the collected data sets to enrich our knowledge about co-occurrence of gestures and SUs, and (3) building statistical models for detecting SUs using speech and gestural cues. Our data analyses suggest that some gesture patterns influence a word boundary’s probability of being an SU. On the basis of the data analyses, a set of novel gestural features were proposed for SU detection. A combination of speech and gestural features was found to provide more accurate SU predictions than using only speech features in discriminative models. Findings in this paper support the view that human conversations are processes involving multimodal cues, and so they are more effectively modeled using information from both verbal and nonverbal channels.  相似文献   

12.
为了解决语言障碍者与健康人之间的交流障碍问题,提出了一种基于神经网络的手语到情感语音转换方法。首先,建立了手势语料库、人脸表情语料库和情感语音语料库;然后利用深度卷积神经网络实现手势识别和人脸表情识别,并以普通话声韵母为合成单元,训练基于说话人自适应的深度神经网络情感语音声学模型和基于说话人自适应的混合长短时记忆网络情感语音声学模型;最后将手势语义的上下文相关标注和人脸表情对应的情感标签输入情感语音合成模型,合成出对应的情感语音。实验结果表明,该方法手势识别率和人脸表情识别率分别达到了95.86%和92.42%,合成的情感语音EMOS得分为4.15,合成的情感语音具有较高的情感表达程度,可用于语言障碍者与健康人之间正常交流。  相似文献   

13.
We present ZeroEGGS, a neural network framework for speech-driven gesture generation with zero-shot style control by example. This means style can be controlled via only a short example motion clip, even for motion styles unseen during training. Our model uses a Variational framework to learn a style embedding, making it easy to modify style through latent space manipulation or blending and scaling of style embeddings. The probabilistic nature of our framework further enables the generation of a variety of outputs given the input, addressing the stochastic nature of gesture motion. In a series of experiments, we first demonstrate the flexibility and generalizability of our model to new speakers and styles. In a user study, we then show that our model outperforms previous state-of-the-art techniques in naturalness of motion, appropriateness for speech, and style portrayal. Finally, we release a high-quality dataset of full-body gesture motion including fingers, with speech, spanning across 19 different styles. Our code and data are publicly available at https://github.com/ubisoft/ubisoft-laforge-ZeroEGGS .  相似文献   

14.
针对复杂场景下深度相机环境要求高,可穿戴设备不自然,基于深度学习模型数据集样本少导致识别能力、鲁棒性欠佳的问题,提出了一种基于语义分割的深度学习模型进行手势分割结合迁移学习的神经网络识别的手势识别方法。通过对采集到的图像数据集首进行不同角度旋转,翻转等操作进行数据集样本增强,训练分割模型进行手势区域的分割,通过迁移学习卷积神经网络更好的提取手势特征向量,通过Softmax函数进行手势分类识别。通过4个人在不同背景下做的10个手势,实验结果表明: 针对复杂背景环境下能够正确的识别手势。  相似文献   

15.
Minyoung Kim 《Pattern recognition》2011,44(10-11):2325-2333
We introduce novel discriminative semi-supervised learning algorithms for dynamical systems, and apply them to the problem of 3D human motion estimation. Our recent work on discriminative learning of dynamical systems has been proven to achieve superior performance than traditional generative learning approaches. However, one of the main issues of learning the dynamical systems is to gather labeled output sequences which are typically obtained from precise motion capture tools, hence expensive. In this paper we utilize a large amount of unlabeled (input) video data to improve the prediction performance of the dynamical systems significantly. We suggest two discriminative semi-supervised learning approaches that extend the well-known algorithms in static domains to the sequential, real-valued multivariate output domains: (i) self-training which we derive as coordinate ascent optimization of a proper discriminative objective over both model parameters and the unlabeled state sequences, (ii) minimum entropy approach which maximally reduces the model's uncertainty in state prediction for unlabeled data points. These approaches are shown to achieve significant improvement against the traditional generative semi-supervised learning methods. We demonstrate the benefits of our approaches on the 3D human motion estimation problems.  相似文献   

16.
We have developed a gesture input system that provides a common interaction technique across mobile, wearable and ubiquitous computing devices of diverse form factors. In this paper, we combine our gestural input technique with speech output and test whether or not the absence of a visual display impairs usability in this kind of multimodal interaction. This is of particular relevance to mobile, wearable and ubiquitous systems where visual displays may be restricted or unavailable. We conducted the evaluation using a prototype for a system combining gesture input and speech output to provide information to patients in a hospital Accident and Emergency Department. A group of participants was instructed to access various services using gestural inputs. The services were delivered by automated speech output. Throughout their tasks, these participants could see a visual display on which a GUI presented the available services and their corresponding gestures. Another group of participants performed the same tasks but without this visual display. It was predicted that the participants without the visual display would make more incorrect gestures and take longer to perform correct gestures than the participants with the visual display. We found no significant difference in the number of incorrect gestures made. We also found that participants with the visual display took longer than participants without it. It was suggested that for a small set of semantically distinct services with memorable and distinct gestures, the absence of a GUI visual display does not impair the usability of a system with gesture input and speech output.  相似文献   

17.
手势自古以来在人类交流方面扮演着非常重要的角色,而基于视觉的动态手势识别技术是利用计算机视觉、物联网感知等新兴技术和3D视觉传感器等新型设备让机器能够理解人类的手势,从而让人类能和机器更好地交流,因此对于人机交互等领域的研究很有意义。介绍了动态手势识别中所用到的传感器技术,并比较了相关传感器的技术参数。通过追踪近年来国内外关于视觉的动态手势识别技术,陈述了动态手势识别的处理流程:手势检测与分割、手势追踪、手势分类。通过对比各流程所涉及的方法,可以发现深度学习具有较强的容错性、高度并行性、抗干扰性等一系列优点,在手势识别领域取得了远高于传统学习算法的成就。最后分析了动态手势识别目前遇到的挑战和未来可能的发展方向。  相似文献   

18.
We live in a society that depends on high-tech devices for assistance with everyday tasks, including everything from transportation to health care, communication, and entertainment. Tedious tactile input interfaces to these devices result in inefficient use of our time. Appropriate use of natural hand gestures will result in more efficient communication if the underlying meaning is understood. Overcoming natural hand gesture understanding challenges is vital to meet the needs of these increasingly pervasive devices in our every day lives. This work presents a graph-based approach to understand the meaning of hand gestures by associating dynamic hand gestures with known concepts and relevant knowledge. Conceptual-level processing is emphasized to robustly handle noise and ambiguity introduced during generation, data acquisition, and low-level recognition. A simple recognition stage is used to help relax scalability limitations of conventional stochastic language models. Experimental results show that this graph-based approach to hand gesture understanding is able to successfully understand the meaning of ambiguous sets of phrases consisting of three to five hand gestures. The presented approximate graph-matching technique to understand human hand gestures supports practical and efficient communication of complex intent to the increasingly pervasive high-tech devices in our society.  相似文献   

19.
Gesture plays an important role for recognizing lecture activities in video content analysis. In this paper, we propose a real-time gesture detection algorithm by integrating cues from visual, speech and electronic slides. In contrast to the conventional “complete gesture” recognition, we emphasize detection by the prediction from “incomplete gesture”. Specifically, intentional gestures are predicted by the modified hidden Markov model (HMM) which can recognize incomplete gestures before the whole gesture paths are observed. The multimodal correspondence between speech and gesture is exploited to increase the accuracy and responsiveness of gesture detection. In lecture presentation, this algorithm enables the on-the-fly editing of lecture slides by simulating appropriate camera motion to highlight the intention and flow of lecturing. We develop a real-time application, namely simulated smartboard, and demonstrate the feasibility of our prediction algorithm using hand gesture and laser pen with simple setup without involving expensive hardware.   相似文献   

20.
针对传统手势识别方法存在的耗能大、部署困难等问题,提出了一种基于WiFi的手势识别方法。通过从WiFi信号中收集到的信道状态信息中抽取多普勒频移组件,解决无线手势识别方法中提取的统计特征与具体手势动作映射关系不明确的问题。同时,提出了一种CGRU-ELM的深度混合模型,对提取到的多普勒频移组件进行特征提取和分类,并对常用的6种人机交互手势进行了识别。实验结果表明,该方法对于以WiFi信号为输入参数的手势识别平均准确度达到了93.4%。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号