首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Automatically recognizing human emotions from spontaneous and non-prototypical real-life data is currently one of the most challenging tasks in the field of affective computing. This article presents our recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human–computer interaction scenario. Building on previous studies which demonstrate that long-range context modeling tends to increase accuracies of emotion recognition, we propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory (LSTM) modeling of word-level audio and video features. LSTM networks are able to incorporate knowledge about how emotions typically evolve over time so that the inferred emotion estimates are produced under consideration of an optimal amount of context. Extensive evaluations on the Audiovisual Sub-Challenge of the 2011 Audio/Visual Emotion Challenge show how acoustic, linguistic, and visual features contribute to the recognition of different affective dimensions as annotated in the SEMAINE database. We apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing our results with the recognition scores of all Audiovisual Sub-Challenge participants, we find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.  相似文献   

2.
This paper proposes and organization of presentation and control that implements a flexible audio management system we call “audio windows”. The result is a new user interface integrating and enhanced spatial sound presentation system, an audio emphasis system, and a gestural input recognition system. We have implemented these ideas in a modest prototype, also described, designed as an audio server appropriate for a teleconferencing system. Our system combines a gestural front end (currently based on a DataGlove, but whose concepts are appropriate for other devices as well) with an enhanced spatial sound system, a digital signal processing separation of multiple sound sources, augmented with “filtears”, audio feedback cues that convey added information without distraction or loss of intelligibility. Our prototype employs a manual front end (requiring no keyboard or mouse) driving an auditory back end (requiring no CRT or visual display).  相似文献   

3.
《Real》2000,6(5):391-405
This paper describes a solution to the problem of transmitting “talking head” video in real-time across low-bandwidth transmission media. Our solution is based on a video reconstruction system, which can generate realistic audio-visual narrations from arbitrary pieces of text. The system uses standard synthetic speech techniques to create an audio track and then produces a synchronized “talking head” using full-frame morphing of real-world facial images (key-frames). We discuss issues of original data capture, transmission protocols, synchronization, interpolation, and the modeling of various physical activities which are vital to maintaining the “plausibility” of the final synthetic video sequence.  相似文献   

4.
基于多模态的检测方法是过滤成人视频的有效手段,然而现有方法中缺乏准确的音频语义表示方法。因此本文提出融合音频单词与视觉特征的成人视频检测方法。先提出基于周期性的能量包络单元(简称EE)分割算法,将音频流准确地分割为EE的序列;再提出基于EE和BoW(Bag-of-Words)的音频语义表示方法,将EE的特征描述为音频单词的出现概率;采用复合加权方法融合音频单词与视觉特征的检测结果;还提出基于周期性的成人视频判别算法,与基于周期性的EE分割算法前后配合,以充分利用周期性进行检测。实验结果表明,与基于视觉特征的方法相比,本文方法显著提高了检测性能。当误检率为9.76%时,检出率可达94.44%。  相似文献   

5.
Understanding how images of objects and scenes behave in response to specific egomotions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose a new “embodied” visual learning paradigm, exploiting proprioceptive motor signals to train visual representations from egocentric video with no manual supervision. Specifically, we enforce that our learned features exhibit equivariance i.e., they respond predictably to transformations associated with distinct egomotions. With three datasets, we show that our unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks. In the most challenging test, we show that features learned from video captured on an autonomous driving platform improve large-scale scene recognition in static images from a disjoint domain.  相似文献   

6.
7.
Advances in the media and entertainment industries, including streaming audio and digital TV, present new challenges for managing and accessing large audio-visual collections. Current content management systems support retrieval using low-level features, such as motion, color, and texture. However, low-level features often have little meaning for naive users, who much prefer to identify content using high-level semantics or concepts. This creates a gap between systems and their users that must be bridged for these systems to be used effectively. To this end, in this paper, we first present a knowledge-based video indexing and content management framework for domain specific videos (using basketball video as an example). We will provide a solution to explore video knowledge by mining associations from video data. The explicit definitions and evaluation measures (e.g., temporal support and confidence) for video associations are proposed by integrating the distinct feature of video data. Our approach uses video processing techniques to find visual and audio cues (e.g., court field, camera motion activities, and applause), introduces multilevel sequential association mining to explore associations among the audio and visual cues, classifies the associations by assigning each of them with a class label, and uses their appearances in the video to construct video indices. Our experimental results demonstrate the performance of the proposed approach.  相似文献   

8.
Image and video processing techniques are being frequently used in medical science applications. Computer vision-based systems have successfully replaced various manual medical processes such as analyzing physical and biomechanical parameters, physical examination of patients. These systems are gaining popularity because of their robustness and the objectivity they bring to various medical procedures. Hammersmith Infant Neurological Examinations (HINE) is a set of physical tests that are carried out on infants in the age group of 3–24 months with neurological disorders. However, these tests are graded through visual observations, which can be highly subjective. Therefore, computer vision-aided approach can be used to assist the experts in the grading process. In this paper, we present a method of automatic exercise classification through visual analysis of the HINE videos recorded at hospitals. We have used scale-invariant-feature-transform features to generate a bag-of-words from the image frames of the video sequences. Frequency of these visual words is then used to classify the video sequences using HMM. We also present a method of event segmentation in long videos containing more than two exercises. Event segmentation coupled with a classifier can help in automatic indexing of long and continuous video sequences of the HINE set. Our proposed framework is a step forward in the process of automation of HINE tests through computer vision-based methods. We conducted tests on a dataset comprising of 70 HINE video sequences. It has been found that the proposed method can successfully classify exercises with accuracy as high as 84%. The proposed work has direct applications in automatic or semiautomatic analysis of “vertical suspension” and “ventral suspension” tests of HINE. Though some of the critical tests such as “pulled-to-sit,” “lateral tilting,” or “adductor’s angle measurement” have already been addressed using image- and video-guided techniques, scopes are there for further improvement.  相似文献   

9.
The multi-modal emotion recognition lacks the explicit mapping relation between emotion state and audio and image features, so extracting the effective emotion information from the audio/visual data is always a challenging issue. In addition, the modeling of noise and data redundancy is not solved well, so that the emotion recognition model is often confronted with the problem of low efficiency. The deep neural network (DNN) performs excellently in the aspects of feature extraction and highly non-linear feature fusion, and the cross-modal noise modeling has great potential in solving the data pollution and data redundancy. Inspired by these, our paper proposes a deep weighted fusion method for audio-visual emotion recognition. Firstly, we conduct the cross-modal noise modeling for the audio and video data, which eliminates most of the data pollution in the audio channel and the data redundancy in visual channel. The noise modeling is implemented by the voice activity detection(VAD), and the data redundancy in the visual data is solved through aligning the speech area both in audio and visual data. Then, we extract the audio emotion features and visual expression features via two feature extractors. The audio emotion feature extractor, audio-net, is a 2D CNN, which accepting the image-based Mel-spectrograms as input data. On the other hand, the facial expression feature extractor, visual-net, is a 3D CNN to which facial expression image sequence is feeded. To train the two convolutional neural networks on the small data set efficiently, we adopt the strategy of transfer learning. Next, we employ the deep belief network(DBN) for highly non-linear fusion of multi-modal emotion features. We train the feature extractors and the fusion network synchronously. And finally the emotion classification is obtained by the support vector machine using the output of the fusion network. With consideration of cross-modal feature fusion, denoising and redundancy removing, our fusion method show excellent performance on the selected data set.  相似文献   

10.
《Pattern recognition》2014,47(2):509-524
This paper presents a computationally efficient 3D face recognition system based on a novel facial signature called Angular Radial Signature (ARS) which is extracted from the semi-rigid region of the face. Kernel Principal Component Analysis (KPCA) is then used to extract the mid-level features from the extracted ARSs to improve the discriminative power. The mid-level features are then concatenated into a single feature vector and fed into a Support Vector Machine (SVM) to perform face recognition. The proposed approach addresses the expression variation problem by using facial scans with various expressions of different individuals for training. We conducted a number of experiments on the Face Recognition Grand Challenge (FRGC v2.0) and the 3D track of Shape Retrieval Contest (SHREC 2008) datasets, and a superior recognition performance has been achieved. Our experimental results show that the proposed system achieves very high Verification Rates (VRs) of 97.8% and 88.5% at a 0.1% False Acceptance Rate (FAR) for the “neutral vs. nonneutral” experiments on the FRGC v2.0 and the SHREC 2008 datasets respectively, and 96.7% for the ROC III experiment of the FRGC v2.0 dataset. Our experiments also demonstrate the computational efficiency of the proposed approach.  相似文献   

11.
For the aging population, surveillance in household environments has become more and more important. In this paper, we present a household robot that can detect abnormal events by utilizing video and audio information. In our approach, moving targets can be detected by the robot using a passive acoustic location device. The robot then tracks the targets by employing a particle filter algorithm. To adapt to different lighting conditions, the target model is updated regularly based on an update mechanism. To ensure robust tracking, the robot detects abnormal human behavior by tracking the upper body of a person. For audio surveillance, Mel frequency cepstral coefficients (MFCC) is used to extract features from audio information. Those features are input to a support vector machine classifier for analysis. Experimental results show that the robot can detect abnormal behavior such as “falling down” and “running”. Also, a 88.17% accuracy rate is achieved in the detection of abnormal audio information like “crying”, “groan”, and “gun shooting”. To lower the false alarms by abnormal sound detection system, the passive acoustic location device directs the robot to the scene where abnormal events occur and the robot can employ its camera to further confirm the occurrence of the events. At last, the robot will send the image captured by the robot to the mobile phone of master.  相似文献   

12.
基于声、像特征的视频暴力场面的探测   总被引:2,自引:0,他引:2  
本文介绍了一种探测视频中暴力场面的方法,该方法综合声音和图像特征对视频进行分析,提高了探测的准确度,这种方法可用于对视频的高层语义特征建立索引,从而支持视频的基于内容检索。  相似文献   

13.
Pornographic video detection based on multimodal fusion is an effective approach for filtering pornography. However, existing methods lack accurate representation of audio semantics and pay little attention to the characteristics of pornographic audios. In this paper, we propose a novel framework of fusing audio vocabulary with visual features for pornographic video detection. The novelty of our approach lies in three aspects: an audio semantics representation method based on an energy envelope unit (EEU) and bag-of-words (BoW), a periodicity-based audio segmentation algorithm, and a periodicity-based video decision algorithm. The first one, named the EEU+BoW representation method, is proposed to describe the audio semantics via an audio vocabulary. The audio vocabulary is constructed by k-means clustering of EEUs. The latter two aspects echo with each other to make full use of the periodicities in pornographic audios. Using the periodicity-based audio segmentation algorithm, audio streams are divided into EEU sequences. After these EEUs are classified, videos are judged to be pornographic or not by the periodicity-based video decision algorithm. Before fusion, two support vector machines are respectively applied for the audio-vocabulary-based and visual-features-based methods. To fuse their results, a keyframe is selected from each EEU in terms of the beginning and ending positions, and then an integrated weighted scheme and a periodicity-based video decision algorithm are adopted to yield final detection results. Experimental results show that our approach outperforms the traditional one which is only based on visual features, and achieves satisfactory performance. The true positive rate achieves 94.44% while the false positive rate is 9.76%.  相似文献   

14.
15.
While most existing sports video research focuses on detecting event from soccer and baseball etc., little work has been contributed to flexible content summarization on racquet sports video, e.g. tennis, table tennis etc. By taking advantages of the periodicity of video shot content and audio keywords in the racquet sports video, we propose a novel flexible video content summarization framework. Our approach combines the structure event detection method with the highlight ranking algorithm. Firstly, unsupervised shot clustering and supervised audio classification are performed to obtain the visual and audio mid-level patterns respectively. Then, a temporal voting scheme for structure event detection is proposed by utilizing the correspondence between audio and video content. Finally, by using the affective features extracted from the detected events, a linear highlight model is adopted to rank the detected events in terms of their exciting degrees. Experimental results show that the proposed approach is effective.  相似文献   

16.
现有多数视频只包含单声道音频,缺乏双声道音频所带来的立体感。针对这一问题,本文提出了一种基于多模态感知的双声道音频生成方法,其在分析视频中视觉信息的基础上,将视频的空间信息与音频内容融合,自动为原始单声道音频添加空间化特征,生成更接近真实听觉体验的双声道音频。我们首先采用一种改进的音频视频融合分析网络,以编码器-解码器的结构,对单声道视频进行编码,接着对视频特征和音频特征进行多尺度融合,并对视频及音频信息进行协同分析,使得双声道音频拥有了原始单声道音频所没有的空间信息,最终生成得到视频对应的双声道音频。在公开数据集上的实验结果表明,本方法取得了优于现有模型的双声道音频生成效果,在STFT距离以及ENV距离两项指标上均取得提升。  相似文献   

17.
赵奇  刘皎瑶  徐敬东 《计算机工程》2007,33(22):134-136,
提出了一种结合音视频双重特征检测视频内容的新方法,以提高对视频内容的识别准确率.该方法分别对视觉特征和音频特征进行分析,引入支持向量机对音频段进行分类,并综合音视域的分析结果对视频内容进行判断.针对特殊视频片断进行分析,证明结合音视特征的分析方法可行有效,可应用于视频内容监控及特定视频片段的检索与分割.  相似文献   

18.
This paper analyzes the issue of catastrophic fusion, a problem that occurs in multimodal recognition systems that integrate the output from several modules while working in non-stationary environments. For concreteness we frame the analysis with regard to the problem of automatic audio visual speech recognition (AVSR), but the issues at hand are very general and arise in multimodal recognition systems which need to work in a wide variety of contexts. Catastrophic fusion is said to have occurred when the performance of a multimodal system is inferior to the performance of some isolated modules, e.g., when the performance of the audio visual speech recognition system is inferior to that of the audio system alone. Catastrophic fusion arises because recognition modules make implicit assumptions and thus operate correctly only within a certain context. Practice shows that when modules are tested in contexts inconsistent with their assumptions, their influence on the fused product tends to increase, with catastrophic results. We propose a principled solution to this problem based upon Bayesian ideas of competitive models and inference robustification. We study the approach analytically on a classic Gaussian discrimination task and then apply it to a realistic problem on audio visual speech recognition (AVSR) with excellent results.  相似文献   

19.
近年来,视听联合学习的动作识别获得了一定关注.无论在视频(视觉模态)还是音频(听觉模态)中,动作发生是瞬时的,往往在动作发生时间段内的信息才能够显著地表达动作类别.如何更好地利用视听模态的关键帧携带的显著表达动作信息,是视听动作识别待解决的问题之一.针对该问题,提出关键帧筛选网络KFIA-S,通过基于全连接层的线性时间...  相似文献   

20.
基于音视特征的视频内容检测方法   总被引:1,自引:1,他引:1       下载免费PDF全文
蔡群  陆松年  杨树堂 《计算机工程》2007,33(22):240-242
提出了一种结合音视频双重特征检测视频内容的新方法,以提高对视频内容的识别准确率。该方法分别对视觉特征和音频特征进行分析,引入支持向量机对音频段进行分类,并综合音视域的分析结果对视频内容进行判断。针对特殊视频片断进行分析,证明结合音视特征的分析方法可行有效,可应用于视频内容监控及特定视频片段的检索与分割。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号