首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
Advances in computer processing power and emerging algorithms are allowing new ways of envisioning human-computer interaction. Although the benefit of audio-visual fusion is expected for affect recognition from the psychological and engineering perspectives, most of existing approaches to automatic human affect analysis are unimodal: information processed by computer system is limited to either face images or the speech signals. This paper focuses on the development of a computing algorithm that uses both audio and visual sensors to detect and track a user's affective state to aid computer decision making. Using our multistream fused hidden Markov model (MFHMM), we analyzed coupled audio and visual streams to detect four cognitive states (interest, boredom, frustration and puzzlement) and seven prototypical emotions (neural, happiness, sadness, anger, disgust, fear and surprise). The MFHMM allows the building of an optimal connection among multiple streams according to the maximum entropy principle and the maximum mutual information criterion. Person-independent experimental results from 20 subjects in 660 sequences show that the MFHMM approach outperforms face-only HMM, pitch-only HMM, energy-only HMM, and independent HMM fusion, under clean and varying audio channel noise condition.  相似文献   

2.
The ability of a computer to detect and appropriately respond to changes in a user's affective state has significant implications to human-computer interaction (HCI). In this paper, we present our efforts toward audio-visual affect recognition on 11 affective states customized for HCI application (four cognitive/motivational and seven basic affective states) of 20 nonactor subjects. A smoothing method is proposed to reduce the detrimental influence of speech on facial expression recognition. The feature selection analysis shows that subjects are prone to use brow movement in face, pitch and energy in prosody to express their affects while speaking. For person-dependent recognition, we apply the voting method to combine the frame-based classification results from both audio and visual channels. The result shows 7.5% improvement over the best unimodal performance. For person-independent test, we apply multistream HMM to combine the information from multiple component streams. This test shows 6.1% improvement over the best component performance  相似文献   

3.
基于文本及视音频多模态信息的新闻分割   总被引:1,自引:0,他引:1  
提出了一种融合文本和视音频多模态特征的电视新闻自动分割方案。该方案充分考虑各种媒体特征的特点,先用矢量模型和GMM对文本进行预分割,用语谱图和HMM对语音预分割、用改进的直方图和SVM分类器对视频进行预分割。然后在时间同步的基础上,使用复合策略用ANN对预分割的数据进行融合,从而获得具有一定语义内容的视频段。实验结果表明此方法的有效性,并且分割后的视频片段具备较完整的语义信息特征,避免了分割的过度细碎的弊端。  相似文献   

4.
Audio/visual mapping with cross-modal hidden Markov models   总被引:1,自引:0,他引:1  
The audio/visual mapping problem of speech-driven facial animation has intrigued researchers for years. Recent research efforts have demonstrated that hidden Markov model (HMM) techniques, which have been applied successfully to the problem of speech recognition, could achieve a similar level of success in audio/visual mapping problems. A number of HMM-based methods have been proposed and shown to be effective by the respective designers, but it is yet unclear how these techniques compare to each other on a common test bed. In this paper, we quantitatively compare three recently proposed cross-modal HMM methods, namely the remapping HMM (R-HMM), the least-mean-squared HMM (LMS-HMM), and HMM inversion (HMMI). The objective of our comparison is not only to highlight the merits and demerits of different mapping designs, but also to study the optimality of the acoustic representation and HMM structure for the purpose of speech-driven facial animation. This paper presents a brief overview of these models, followed by an analysis of their mapping capabilities on a synthetic dataset. An empirical comparison on an experimental audio-visual dataset consisting of 75 TIMIT sentences is finally presented. Our results show that HMMI provides the best performance, both on synthetic and experimental audio-visual data.  相似文献   

5.
目的 同一视频中的视觉与听觉是两个共生模态,二者相辅相成,同时发生,从而形成一种自监督模式。随着对比学习在视觉领域取得很好的效果,将对比学习这一自监督表示学习范式应用于音视频多模态领域引起了研究人员的极大兴趣。本文专注于构建一个高效的音视频负样本空间,提高对比学习的音视频特征融合能力。方法 提出了面向多模态自监督特征融合的音视频对抗对比学习方法:1)创新性地引入了视觉、听觉对抗性负样本集合来构建音视频负样本空间;2)在模态间与模态内进行对抗对比学习,使得音视频负样本空间中的视觉和听觉对抗性负样本可以不断跟踪难以区分的视听觉样本,有效地促进了音视频自监督特征融合。在上述两点基础上,进一步简化了音视频对抗对比学习框架。结果 本文方法在Kinetics-400数据集的子集上进行训练,得到音视频特征。这一音视频特征用于指导动作识别和音频分类任务,取得了很好的效果。具体来说,在动作识别数据集UCF-101和HMDB-51(human metabolome database)上,本文方法相较于Cross-AVID(cross-audio visual instance discrimination...  相似文献   

6.
Computer simulated avatars and humanoid robots have an increasingly prominent place in today's world. Acceptance of these synthetic characters depends on their ability to properly and recognizably convey basic emotion states to a user population. This study presents an analysis of the interaction between emotional audio (human voice) and video (simple animation) cues. The emotional relevance of the channels is analyzed with respect to their effect on human perception and through the study of the extracted audio-visual features that contribute most prominently to human perception. As a result of the unequal level of expressivity across the two channels, the audio was shown to bias the perception of the evaluators. However, even in the presence of a strong audio bias, the video data were shown to affect human perception. The feature sets extracted from emotionally matched audio-visual displays contained both audio and video features while feature sets resulting from emotionally mismatched audio-visual displays contained only audio information. This result indicates that observers integrate natural audio cues and synthetic video cues only when the information expressed is in congruence. It is therefore important to properly design the presentation of audio-visual cues as incorrect design may cause observers to ignore the information conveyed in one of the channels.  相似文献   

7.
In this paper, we formulate the problem of synthesizing facial animation from an input audio sequence as a dynamic audio-visual mapping. We propose that audio-visual mapping should be modeled with an input-output hidden Markov model, or IOHMM. An IOHMM is an HMM for which the output and transition probabilities are conditional on the input sequence. We train IOHMMs using the expectation-maximization(EM) algorithm with a novel architecture to explicitly model the relationship between transition probabilities and the input using neural networks. Given an input sequence, the output sequence is synthesized by the maximum likelihood estimation. Experimental results demonstrate that IOHMMs can generate natural and good-quality facial animation sequences from the input audio.  相似文献   

8.
针对现有的情感分析方法缺乏对短视频中信息的充分考虑,从而导致不恰当的情感分析结果。 基于音视频的多模态情感分析(AV-MSA)模型便由此产生,模型通过利用视频帧图像中的视觉特征和音频信息 来完成短视频的情感分析。模型分为视觉与音频 2 分支,音频分支采用卷积神经网络(CNN)架构来提取音频图 谱中的情感特征,实现情感分析的目的;视觉分支则采用 3D 卷积操作来增加视觉特征的时间相关性。并在 Resnet 的基础上,突出情感相关特征,添加了注意力机制,以提高模型对信息特征的敏感性。最后,设计了一 种交叉投票机制用于融合视觉分支和音频分支的结果,产生情感分析的最终结果。AV-MSA 模型在 IEMOCAP 和微博视听(WB-AV)数据集上进行了评估, 实验结果表明,与现有算法相比,AV-MSA 在分类精确度上有了较 大的提升。  相似文献   

9.
This paper presents an articulatory modelling approach to convert acoustic speech into realistic mouth animation. We directly model the movements of articulators, such as lips, tongue, and teeth, using a dynamic Bayesian network (DBN)-based audio-visual articulatory model (AVAM). A multiple-stream structure with a shared articulator layer is adopted in the model to synchronously associate the two building blocks of speech, i.e., audio and video. This model not only describes the synchronization between visual articulatory movements and audio speech, but also reflects the linguistic fact that different articulators evolve asynchronously. We also present a Baum-Welch DBN inversion (DBNI) algorithm to generate optimal facial parameters from audio given the trained AVAM under maximum likelihood (ML) criterion. Extensive objective and subjective evaluations on the JEWEL audio-visual dataset demonstrate that compared with phonemic HMM approaches, facial parameters estimated by our approach follow the true parameters more accurately, and the synthesized facial animation sequences are so lively that 38% of them are undistinguishable  相似文献   

10.
Joint scene classification and segmentation based on hidden Markov model   总被引:2,自引:0,他引:2  
Scene classification and segmentation are fundamental steps for efficient accessing, retrieving and browsing large amount of video data. We have developed a scene classification scheme using a Hidden Markov Model (HMM)-based classifier. By utilizing the temporal behaviors of different scene classes, HMM classifier can effectively classify presegmented clips into one of the predefined scene classes. In this paper, we describe three approaches for joint classification and segmentation based on HMM, which search for the most likely class transition path by using the dynamic programming technique. All these approaches utilize audio and visual information simultaneously. The first two approaches search optimal scene class transition based on the likelihood values computed for short video segment belonging to a particular class but with different search constrains. The third approach searches the optimal path in a super HMM by concatenating HMM's for different scene classes.  相似文献   

11.
近年来,视听联合学习的动作识别获得了一定关注.无论在视频(视觉模态)还是音频(听觉模态)中,动作发生是瞬时的,往往在动作发生时间段内的信息才能够显著地表达动作类别.如何更好地利用视听模态的关键帧携带的显著表达动作信息,是视听动作识别待解决的问题之一.针对该问题,提出关键帧筛选网络KFIA-S,通过基于全连接层的线性时间...  相似文献   

12.
We present our studies on the application of Coupled Hidden Markov Models(CHMMs) to sports highlights extraction from broadcast video using both audio and video information. First, we generate audio labels using audio classification via Gaussian mixture models, and video labels using quantization of the average motion vector magnitudes. Then, we model sports highlights using discrete-observations CHMMs on audio and video labels classified from a large training set of broadcast sports highlights. Our experimental results on unseen golf and soccer content show that CHMMs outperform Hidden Markov Models(HMMs) trained on audio-only or video-only observations. Next, we study how the coupling between the two single-modality HMMs offers improvement on modelling capability by making refinements on the states of the models. We also show that the number of states optimized in this fashion also gives better classification results than other number of states. We conclude that CHMMs provide a promising tool for information fusion techniques in the sports domain for audio-visual event detection and analysis.  相似文献   

13.
In this paper, an in-depth analysis is undertaken into effective strategies for integrating the audio-visual speech modalities with respect to two major questions. Firstly, at what level should integration occur? Secondly, given a level of integration how should this integration be implemented? Our work is based around the well-known hidden Markov model (HMM) classifier framework for modeling speech. A novel framework for modeling the mismatch between train and test observation sets is proposed, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most audio-visual speech processing applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise for the task of text-dependent speaker recognition.  相似文献   

14.
经典的隐马尔可夫模型(HMM)是一种基于统计信号的模型,它在基于内容的音频检索系统中具有重要的作用。根据音频分类重类型轻内容的特性,将单状态的HMM用于音频分类,克服了多状态HMM在模型初始化时状态初始概率和转移概率赋值带有假设不准确的缺点。实验结果表明基于单状态的HMM模型音频分类方法能有效地减少误识率,提高音频分类的精确度。  相似文献   

15.
探讨了利用Gabor小波和隐马尔可夫模型(HMM)进行人脸识别的方法,首先对人脸图像进行多分辨率的Gabor小波变换;然后在图像上放置一组网格结点,每个结点用该结点处的多尺度Gabor幅度特征描述,采用独立元分析法对每个结点进行去相关和降维;最后形成特征结,把每个特征结作为观测向量,对隐马尔可夫模型进行训练,并将优化的模型参数用于人脸识别,ORL人脸库的实验结果表明,该方法识别率高,工程上易于应用。  相似文献   

16.
Moving vehicle detection and classification using multimodal data is a challenging task in data collection, audio-visual alignment, data labeling and feature selection under uncontrolled environments with occlusions, motion blurs, varying image resolutions and perspective distortions. In this work, we propose an effective multimodal temporal panorama approach for moving vehicle detection and classification using a novel long-range audio-visual sensing system. A new audio-visual vehicle (AVV) dataset is created, which features automatic vehicle detection and audio-visual alignment, accurate vehicle extraction and reconstruction, and efficient data labeling. In particular, vehicles’ visual images are reconstructed once detected in order to remove most of the occlusions, motion blurs, and variations of perspective views. Multimodal audio-visual features are extracted, including global geometric features (aspect ratios, profiles), local structure features (HOGs), as well various audio features (MFCCs, etc.). Using radial-based SVMs, the effectiveness of the integration of these multimodal features is thoroughly and systematically studied. The concept of MTP may not be only limited to visual, motion and audio modalities; it could also be applicable to other sensing modalities that can obtain data in the temporal domain.  相似文献   

17.
Concept detection stands as an important problem for efficient indexing and retrieval in large video archives. In this work, the KavTan System, which performs high-level semantic classification in one of the largest TV archives of Turkey, is presented. In this system, concept detection is performed using generalized visual and audio concept detection modules that are supported by video text detection, audio keyword spotting and specialized audio-visual semantic detection components. The performance of the presented framework was assessed objectively over a wide range of semantic concepts (5 high-level, 14 visual, 9 audio, 2 supplementary) by using a significant amount of precisely labeled ground truth data. KavTan System achieves successful high-level concept detection performance in unconstrained TV broadcast by efficiently utilizing multimodal information that is systematically extracted from both spatial and temporal extent of multimedia data.  相似文献   

18.
This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is constructed by phoneme grouping using Houtgast similarity approach based on the results of viseme similarity estimation using histogram distance, according to the concept of viseme visually ambiguous. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The implemented real time speech-driven talking face system includes: 1) speech signal processing, including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions for robust acoustic feature vectors; 2) recognition network processing, HMM and MCSVM are combined as a recognition network approach for phoneme recognition and viseme classification, which HMM is good at dealing with sequential inputs, while MCSVM shows superior performance in classifying with good generalization properties, especially for limited samples. The phoneme-viseme mapping table is used for MCSVM to classify the observation sequence of HMM results, which the viseme class is belong to; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. Presented by the experiments, the used speech signal processing with noise speech comparing with clean speech, could gain 1.1 % (16.7 % to 15.6 %) and 4.8 % (30.4 % to 35.2 %) accuracy rate improvements in PER and WER, respectively. For viseme classification, the error rate is decreased from 19.22 % to 9.37 %. Last, we simulated a GSM communication between mobile phone and PC for visual quality rating and speech driven feeling using mean opinion score. Therefore, our method reduces the number of visemes and lip shape images by confusable sets and enables real-time operation.  相似文献   

19.
20.
Animating expressive faces across languages   总被引:2,自引:0,他引:2  
This paper describes a morphing-based audio driven facial animation system. Based on an incoming audio stream, a face image is animated with full lip synchronization and synthesized expressions. A novel scheme to implement a language independent system for audio-driven facial animation given a speech recognition system for just one language, in our case, English, is presented. The method presented here can also be used for text to audio-visual speech synthesis. Visemes in new expressions are synthesized to be able to generate animations with different facial expressions. An animation sequence using optical flow between visemes is constructed, given an incoming audio stream and still pictures of a face representing different visemes. The presented techniques give improved lip synchronization and naturalness to the animated video.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号