首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
Assistance is currently a pivotal research area in robotics, with huge societal potential. Since assistant robots directly interact with people, finding natural and easy-to-use user interfaces is of fundamental importance. This paper describes a flexible multimodal interface based on speech and gesture modalities in order to control our mobile robot named Jido. The vision system uses a stereo head mounted on a pan-tilt unit and a bank of collaborative particle filters devoted to the upper human body extremities to track and recognize pointing/symbolic mono but also bi-manual gestures. Such framework constitutes our first contribution, as it is shown, to give proper handling of natural artifacts (self-occlusion, camera out of view field, hand deformation) when performing 3D gestures using one or the other hand even both. A speech recognition and understanding system based on the Julius engine is also developed and embedded in order to process deictic and anaphoric utterances. The second contribution deals with a probabilistic and multi-hypothesis interpreter framework to fuse results from speech and gesture components. Such interpreter is shown to improve the classification rates of multimodal commands compared to using either modality alone. Finally, we report on successful live experiments in human-centered settings. Results are reported in the context of an interactive manipulation task, where users specify local motion commands to Jido and perform safe object exchanges.  相似文献   

2.
Humans use a combination of gesture and speech to interact with objects and usually do so more naturally without holding a device or pointer. We present a system that incorporates user body-pose estimation, gesture recognition and speech recognition for interaction in virtual reality environments. We describe a vision-based method for tracking the pose of a user in real time and introduce a technique that provides parameterized gesture recognition. More precisely, we train a support vector classifier to model the boundary of the space of possible gestures, and train Hidden Markov Models (HMM) on specific gestures. Given a sequence, we can find the start and end of various gestures using a support vector classifier, and find gesture likelihoods and parameters with a HMM. A multimodal recognition process is performed using rank-order fusion to merge speech and vision hypotheses. Finally we describe the use of our multimodal framework in a virtual world application that allows users to interact using gestures and speech.  相似文献   

3.
An accurate estimation of sentence units (SUs) in spontaneous speech is important for (1) helping listeners to better understand speech content and (2) supporting other natural language processing tasks that require sentence information. There has been much research on automatic SU detection; however, most previous studies have only used lexical and prosodic cues, but have not used nonverbal cues, e.g., gesture. Gestures play an important role in human conversations, including providing semantic content, expressing emotional status, and regulating conversational structure. Given the close relationship between gestures and speech, gestures may provide additional contributions to automatic SU detection. In this paper, we have investigated the use of gesture cues for enhancing the SU detection. Particularly, we have focused on: (1) collecting multimodal data resources involving gestures and SU events in human conversations, (2) analyzing the collected data sets to enrich our knowledge about co-occurrence of gestures and SUs, and (3) building statistical models for detecting SUs using speech and gestural cues. Our data analyses suggest that some gesture patterns influence a word boundary’s probability of being an SU. On the basis of the data analyses, a set of novel gestural features were proposed for SU detection. A combination of speech and gestural features was found to provide more accurate SU predictions than using only speech features in discriminative models. Findings in this paper support the view that human conversations are processes involving multimodal cues, and so they are more effectively modeled using information from both verbal and nonverbal channels.  相似文献   

4.
Gesture and speech are co-expressive and complementary channels of a single human language system. While speech carries the major load of symbolic presentation, gesture provides the imagistic content. We investigate the role of oscillatory/cyclical hand motions in ‘carrying’ this image content. We present our work on the extraction of hand motion oscillation frequencies of gestures that accompany speech. The key challenges are that such motions are characterized by non-stationary oscillations, and multiple frequencies may be simultaneously extant. Also, the duration of the oscillations may be extended over very few cycles. We apply the windowed Fourier transform and wavelet transform to detect and extract gesticulatory oscillations. We tested these against synthetic signals (stationary and non-stationary) and real data sequences of gesticulatory hand movements in natural discourse. Our results show that both filters functioned well for the synthetic signals. For the real data, the wavelet bandpass filter bank is better for detecting and extracting hand gesture oscillations. We relate the hand motion oscillatory gestures detected by wavelet analysis to speech in natural conversation and apply to multimodal language analysis. We demonstrate the ability of our algorithm to extract gesticulatory oscillations and show how oscillatory gestures reveal portions of the multimodal discourse structure.  相似文献   

5.
Gesture plays an important role for recognizing lecture activities in video content analysis. In this paper, we propose a real-time gesture detection algorithm by integrating cues from visual, speech and electronic slides. In contrast to the conventional “complete gesture” recognition, we emphasize detection by the prediction from “incomplete gesture”. Specifically, intentional gestures are predicted by the modified hidden Markov model (HMM) which can recognize incomplete gestures before the whole gesture paths are observed. The multimodal correspondence between speech and gesture is exploited to increase the accuracy and responsiveness of gesture detection. In lecture presentation, this algorithm enables the on-the-fly editing of lecture slides by simulating appropriate camera motion to highlight the intention and flow of lecturing. We develop a real-time application, namely simulated smartboard, and demonstrate the feasibility of our prediction algorithm using hand gesture and laser pen with simple setup without involving expensive hardware.   相似文献   

6.
We propose a new two-stage framework for joint analysis of head gesture and speech prosody patterns of a speaker towards automatic realistic synthesis of head gestures from speech prosody. In the first stage analysis, we perform Hidden Markov Model (HMM) based unsupervised temporal segmentation of head gesture and speech prosody features separately to determine elementary head gesture and speech prosody patterns, respectively, for a particular speaker. In the second stage, joint analysis of correlations between these elementary head gesture and prosody patterns is performed using Multi-Stream HMMs to determine an audio-visual mapping model. The resulting audio-visual mapping model is then employed to synthesize natural head gestures from arbitrary input test speech given a head model for the speaker. In the synthesis stage, the audio-visual mapping model is used to predict a sequence of gesture patterns from the prosody pattern sequence computed for the input test speech. The Euler angles associated with each gesture pattern are then applied to animate the speaker head model. Objective and subjective evaluations indicate that the proposed synthesis by analysis scheme provides natural looking head gestures for the speaker with any input test speech, as well as in "prosody transplant" and gesture transplant" scenarios.  相似文献   

7.
针对动态手势跟踪稳定性的不足和识别效率的问题, 提出一种基于TLD和DTW的动态手势跟踪识别框架. 首先利用基于Haar特征的静态手势分类器获得手势区域, 然后使用TLD跟踪算法对获得的手势区域进行跟踪以获取手势轨迹, 最后提取轨迹特征, 使用改进的DTW算法进行识别. 实验表明, 该框架能够长时间稳定地跟踪手势区域, 并能够在保证识别率的基础上显著提高识别效率.  相似文献   

8.
We present a statistical approach to developing multimodal recognition systems and, in particular, to integrating the posterior probabilities of parallel input signals involved in the multimodal system. We first identify the primary factors that influence multimodal recognition performance by evaluating the multimodal recognition probabilities. We then develop two techniques, an estimate approach and a learning approach, which are designed to optimize accurate recognition during the multimodal integration process. We evaluate these methods using Quickset, a speech/gesture multimodal system, and report evaluation results based on an empirical corpus collected with Quickset. From an architectural perspective, the integration technique presented offers enhanced robustness. It also is premised on more realistic assumptions than previous multimodal systems using semantic fusion. From a methodological standpoint, the evaluation techniques that we describe provide a valuable tool for evaluating multimodal systems  相似文献   

9.
针对传统的语音识别系统采用数据驱动并利用语言模型来决策最优的解码路径,导致在部分场景下的解码结果存在明显的音对字错的问题,提出一种基于韵律特征辅助的端到端语音识别方法,利用语音中的韵律信息辅助增强正确汉字组合在语言模型中的概率。在基于注意力机制的编码-解码语音识别框架的基础上,首先利用注意力机制的系数分布提取发音间隔、发音能量等韵律特征;然后将韵律特征与解码端结合,从而显著提升了发音相同或相近、语义歧义情况下的语音识别准确率。实验结果表明,该方法在1 000 h及10 000 h级别的语音识别任务上分别较端到端语音识别基线方法在准确率上相对提升了5.2%和5.0%,进一步改善了语音识别结果的可懂度。  相似文献   

10.
Many ways of communications are used between human and computer, while using gesture is considered to be one of the most natural way in a virtual reality system. Because of its intuitiveness and its capability of helping the hearing impaired or speaking impaired, we develop a gesture recognition system. Considering the world-wide use of ASL (American Sign Language), this system focuses on the recognition of a continuous flow of alphabets in ASL to spell a word followed by the speech synthesis, and adopts a simple and efficient windowed template matching recognition strategy to achieve the goal of a real-time and continuous recognition. In addition to the abduction and the flex information in a gesture, we introduce a concept of contact-point into our system to solve the intrinsic ambiguities of some gestures in ASL. Five tact switches, served as contact-points and sensed by an analogue to digital board, are sewn on a glove cover to enhance the functions of a traditional data glove.  相似文献   

11.
Within the context of hand gesture recognition, spatiotemporal gesture segmentation is the task of determining, in a video sequence, where the gesturing hand is located and when the gesture starts and ends. Existing gesture recognition methods typically assume either known spatial segmentation or known temporal segmentation, or both. This paper introduces a unified framework for simultaneously performing spatial segmentation, temporal segmentation, and recognition. In the proposed framework, information flows both bottom-up and top-down. A gesture can be recognized even when the hand location is highly ambiguous and when information about when the gesture begins and ends is unavailable. Thus, the method can be applied to continuous image streams where gestures are performed in front of moving, cluttered backgrounds. The proposed method consists of three novel contributions: a spatiotemporal matching algorithm that can accommodate multiple candidate hand detections in every frame, a classifier-based pruning framework that enables accurate and early rejection of poor matches to gesture models, and a subgesture reasoning algorithm that learns which gesture models can falsely match parts of other longer gestures. The performance of the approach is evaluated on two challenging applications: recognition of hand-signed digits gestured by users wearing short-sleeved shirts, in front of a cluttered background, and retrieval of occurrences of signs of interest in a video database containing continuous, unsegmented signing in American Sign Language (ASL).  相似文献   

12.
In this paper, we propose a new method for recognizing hand gestures in a continuous video stream using a dynamic Bayesian network or DBN model. The proposed method of DBN-based inference is preceded by steps of skin extraction and modelling, and motion tracking. Then we develop a gesture model for one- or two-hand gestures. They are used to define a cyclic gesture network for modeling continuous gesture stream. We have also developed a DP-based real-time decoding algorithm for continuous gesture recognition. In our experiments with 10 isolated gestures, we obtained a recognition rate upwards of 99.59% with cross validation. In the case of recognizing continuous stream of gestures, it recorded 84% with the precision of 80.77% for the spotted gestures. The proposed DBN-based hand gesture model and the design of a gesture network model are believed to have a strong potential for successful applications to other related problems such as sign language recognition although it is a bit more complicated requiring analysis of hand shapes.  相似文献   

13.
语音文本自动对齐技术广泛应用于语音识别与合成、内容制作等领域,其主要目的是将语音和相应的参考文本在语句、单词、音素等级别的单元进行对齐,并获得语音与参考文本之间的时间对位信息.最新的先进对齐方法大多基于语音识别,一方面,准确率受限于语音识别效果,识别字错误率高时文语对齐精度明显下降,识别字错误率对对齐精度影响较大;另一方面,这种对齐方法不能有效处理不完全匹配的长篇幅语音和文本的对齐.该文提出一种基于锚点和韵律信息的文语对齐方法,通过基于边界锚点加权的片段标注将语料划分为对齐段和未对齐段,针对未对齐段使用双门限端点检测方法提取韵律信息,并检测语句边界,降低了基于语音识别的对齐方法对语音识别效果的依赖程度.实验结果表明,与目前先进的基于语音识别的文语对齐方法比较,即使在识别字错误率为0.52时,该文所提方法的对齐准确率仍能提升45%以上;在音频文本不匹配程度为0.5时,该文所提方法能提高3%.  相似文献   

14.
15.
A precise identification of prosodic phenomena and the construction of tools able to properly manage such phenomena are essential steps to disambiguate the meaning of certain utterances. In particular they are useful for a wide variety of tasks: automatic recognition of spontaneous speech, automatic enhancement of speech-generation systems, solving ambiguities in natural language interpretation, the construction of large annotated language resources, such as prosodically tagged speech corpora, and teaching languages to foreign students using Computer Aided Language Learning (CALL) systems. This paper presents a study on the automatic detection of prosodic prominence in continuous speech, with particular reference to American English, but with good prospects of application to other languages. Prosodic prominence involves two different prosodic features: pitch accent and stress accent. Pitch accent is acoustically connected with fundamental frequency (F0) movements and overall syllable energy, whereas stress exhibits a strong correlation with syllable nuclei duration and mid-to-high-frequency emphasis. This paper shows that a careful measurement of these acoustic parameters, as well as the identification of their connection to prosodic parameters, makes it possible to build an automatic system capable of identifying prominent syllables in utterances with performance comparable with the inter-human agreement reported in the literature. Two different prominence detectors were studied and developed: the first uses a training corpus to set up thresholds properly, while the second uses a pure unsupervised method. In both cases, it is worth stressing that only acoustic parameters derived directly from speech waveforms are exploited.  相似文献   

16.
In this paper, global and local prosodic features extracted from sentence, word and syllables are proposed for speech emotion or affect recognition. In this work, duration, pitch, and energy values are used to represent the prosodic information, for recognizing the emotions from speech. Global prosodic features represent the gross statistics such as mean, minimum, maximum, standard deviation, and slope of the prosodic contours. Local prosodic features represent the temporal dynamics in the prosody. In this work, global and local prosodic features are analyzed separately and in combination at different levels for the recognition of emotions. In this study, we have also explored the words and syllables at different positions (initial, middle, and final) separately, to analyze their contribution towards the recognition of emotions. In this paper, all the studies are carried out using simulated Telugu emotion speech corpus (IITKGP-SESC). These results are compared with the results of internationally known Berlin emotion speech corpus (Emo-DB). Support vector machines are used to develop the emotion recognition models. The results indicate that, the recognition performance using local prosodic features is better compared to the performance of global prosodic features. Words in the final position of the sentences, syllables in the final position of the words exhibit more emotion discriminative information compared to the words and syllables present in the other positions.  相似文献   

17.
Machine learning is a technique for analyzing data that aids the construction of mathematical models. Because of the growth of the Internet of Things (IoT) and wearable sensor devices, gesture interfaces are becoming a more natural and expedient human-machine interaction method. This type of artificial intelligence that requires minimal or no direct human intervention in decision-making is predicated on the ability of intelligent systems to self-train and detect patterns. The rise of touch-free applications and the number of deaf people have increased the significance of hand gesture recognition. Potential applications of hand gesture recognition research span from online gaming to surgical robotics. The location of the hands, the alignment of the fingers, and the hand-to-body posture are the fundamental components of hierarchical emotions in gestures. Linguistic gestures may be difficult to distinguish from nonsensical motions in the field of gesture recognition. Linguistic gestures may be difficult to distinguish from nonsensical motions in the field of gesture recognition. In this scenario, it may be difficult to overcome segmentation uncertainty caused by accidental hand motions or trembling. When a user performs the same dynamic gesture, the hand shapes and speeds of each user, as well as those often generated by the same user, vary. A machine-learning-based Gesture Recognition Framework (ML-GRF) for recognizing the beginning and end of a gesture sequence in a continuous stream of data is suggested to solve the problem of distinguishing between meaningful dynamic gestures and scattered generation. We have recommended using a similarity matching-based gesture classification approach to reduce the overall computing cost associated with identifying actions, and we have shown how an efficient feature extraction method can be used to reduce the thousands of single gesture information to four binary digit gesture codes. The findings from the simulation support the accuracy, precision, gesture recognition, sensitivity, and efficiency rates. The Machine Learning-based Gesture Recognition Framework (ML-GRF) had an accuracy rate of 98.97%, a precision rate of 97.65%, a gesture recognition rate of 98.04%, a sensitivity rate of 96.99%, and an efficiency rate of 95.12%.  相似文献   

18.
Human hand recognition plays an important role in a wide range of applications ranging from sign language translators, gesture recognition, augmented reality, surveillance and medical image processing to various Human Computer Interaction (HCI) domains. Human hand is a complex articulated object consisting of many connected parts and joints. Therefore, for applications that involve HCI one can find many challenges to establish a system with high detection and recognition accuracy for hand posture and/or gesture. Hand posture is defined as a static hand configuration without any movement involved. Meanwhile, hand gesture is a sequence of hand postures connected by continuous motions. During the past decades, many approaches have been presented for hand posture and/or gesture recognition. In this paper, we provide a survey on approaches which are based on Hidden Markov Models (HMM) for hand posture and gesture recognition for HCI applications.  相似文献   

19.
This paper presents an experimental study on an agent system with multimodal interfaces for a smart office environment. The agent system is based upon multimodal interfaces such as recognition modules for both speech and pen-mouse gesture, and identification modules for both face and fingerprint. For essential modules, speech recognition and synthesis were basically used for a virtual interaction between user and system. In this study, a real-time speech recognizer based on a Hidden Markov Network (HM-Net) was incorporated into the proposed system. In addition, identification techniques based on both face and fingerprint were adopted to provide a specific user with the service of a user-customized interaction with security in an office environment. In evaluation, results showed that the proposed system was easy to use and would prove useful in a smart office environment, even though the performance of the speech recognizer was not satisfactory mainly due to noisy environments.  相似文献   

20.
This paper presents the visual recognition of static gesture (SG) or dynamic gesture (DG). Gesture is one of the most natural interface tools for human–computer interaction (HCI) as well as for communication between human beings. In order to implement a human-like interface, gestures could be recognized using only visual information such as the visual mechanism of human beings; SGs and DGs can be processed concurrently as well. This paper aims at recognizing hand gestures obtained from the visual images on a 2D image plane, without any external devices. Gestures are spotted by a task-specific state transition based on natural human articulation. SGs are recognized using image moments of hand posture, while DGs are recognized by analyzing their moving trajectories on the hidden Markov models (HMMs). We have applied our gesture recognition approach to gesture-driven editing systems operating in real time.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号