首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Video question answering (Video QA) involves a thorough understanding of video content and question language, as well as the grounding of the textual semantic to the visual content of videos. Thus, to answer the questions more accurately, not only the semantic entity should be associated with certain visual instance in video frames, but also the action or event in the question should be localized to a corresponding temporal slot. It turns out to be a more challenging task that requires the ability of conducting reasoning with correlations between instances along temporal frames. In this paper, we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization. In our model, both visual instances and textual representations are firstly embedded into graph nodes, which benefits the integration of intra- and inter-modality. Then, we propose graph causal convolution (GCC) on graph-structured sequence with a large receptive field to capture more causal connections, which is vital for visual grounding and instance-sequence reasoning. Finally, we evaluate our model on TVQA+ dataset, which contains the groundtruth of instance grounding and temporal localization, three other Video QA datasets and three multimodal language processing datasets. Extensive experiments demonstrate the effectiveness and generalization of the proposed method. Specifically, our method outperforms the state-of-the-art methods on these benchmarks.  相似文献   

2.
3.
Moving vehicle detection and classification using multimodal data is a challenging task in data collection, audio-visual alignment, data labeling and feature selection under uncontrolled environments with occlusions, motion blurs, varying image resolutions and perspective distortions. In this work, we propose an effective multimodal temporal panorama approach for moving vehicle detection and classification using a novel long-range audio-visual sensing system. A new audio-visual vehicle (AVV) dataset is created, which features automatic vehicle detection and audio-visual alignment, accurate vehicle extraction and reconstruction, and efficient data labeling. In particular, vehicles’ visual images are reconstructed once detected in order to remove most of the occlusions, motion blurs, and variations of perspective views. Multimodal audio-visual features are extracted, including global geometric features (aspect ratios, profiles), local structure features (HOGs), as well various audio features (MFCCs, etc.). Using radial-based SVMs, the effectiveness of the integration of these multimodal features is thoroughly and systematically studied. The concept of MTP may not be only limited to visual, motion and audio modalities; it could also be applicable to other sensing modalities that can obtain data in the temporal domain.  相似文献   

4.
Gao  Bingkun  Ma  Ke  Bi  Hongbo  Wang  Ling  Wu  Chenlei 《Multimedia Tools and Applications》2021,80(19):29251-29265

The human pose estimation in images and videos is a challenging task in many applications. Most of the network structures used to estimate the pose only use the convolution feature of the last layer, which will cause the loss of information. In this paper, we propose a multi-scales fusion framework based on the hourglass network for the human pose estimation, which can effectively obtain sufficient information of different resolutions. In the process of extracting different resolution features, the network constantly complements the high resolution features. Additionally, we design the depth pyramid residual module to fuse different various scales features. The whole network is stacked by sub-networks. For applying in limited storage space better, we only use 2-stage stacked network. We test the network on standard benchmarks MPII dataset, our method achieves 88.9% PCKh score and improves the PCK score by 0.7%, compared with the original network. Our approach gains state-of-the-art results.

  相似文献   

5.
Video contents contain complex structures due to the variety of the components and events involved. For example, surveillance videos often record multi-object interactions and consist of various scales of motion detail; Web videos are composed of multimodal cues, and each cue generally consists of a variety of scales of information. Generally, video contents comprise two types of the combination of the inherent structures: multi-modality/multi-scale and multi-object /multi-scale. Therefore, in this paper, we propose a new framework for video content modeling, under which video contents are decomposed into multiple interacting processes by double decomposition that aims at each type of combination of structures. To model the resulting processes, we propose a method named double-decomposed hidden Markov models (DDHMMs). DDHMMs contain multiple state chains that correspond to the interacting processes. To make the switching frequency of states in each chain consistent with the scale of the corresponding process, a durational state variable is introduced in DDHMMs. The proposed method performs well in modeling the relations among the interacting processes and the dynamics of each. We discuss the appropriate features under the proposed framework and evaluate DDHMMs in two applications, human motion recognition and web video categorization. The experimental results demonstrate that the double decomposition enhances video categorization performance in both cases.  相似文献   

6.
With the advance of digital video recording and playback systems, the request for efficiently managing recorded TV video programs is evident so that users can readily locate and browse their favorite programs. In this paper, we propose a multimodal scheme to segment and represent TV video streams. The scheme aims to recover the temporal and structural characteristics of TV programs with visual, auditory, and textual information. In terms of visual cues, we develop a novel concept named program-oriented informative images (POIM) to identify the candidate points correlated with the boundaries of individual programs. For audio cues, a multiscale Kullback-Leibler (K-L) distance is proposed to locate audio scene changes (ASC), and accordingly ASC is aligned with video scene changes to represent candidate boundaries of programs. In addition, latent semantic analysis (LSA) is adopted to calculate the textual content similarity (TCS) between shots to model the inter-program similarity and intra-program dissimilarity in terms of speech content. Finally, we fuse the multimodal features of POIM, ASC, and TCS to detect the boundaries of programs including individual commercials (spots). Towards effective program guide and attracting content browsing, we propose a multimodal representation of individual programs by using POIM images, key frames, and textual keywords in a summarization manner. Extensive experiments are carried out over an open benchmarking dataset TRECVID 2005 corpus and promising results have been achieved. Compared with the electronic program guide (EPG), our solution provides a more generic approach to determine the exact boundaries of diverse TV programs even including dramatic spots.  相似文献   

7.
Anticipating future actions without observing any partial videos of future actions plays an important role in action prediction and is also a challenging task. To obtain abundant information for action anticipation, some methods integrate multimodal contexts, including scene object labels. However, extensively labelling each frame in video datasets requires considerable effort. In this paper, we develop a weakly supervised method that integrates global motion and local fine-grained features from current action videos to predict next action label without the need for specific scene context labels. Specifically, we extract diverse types of local features with weakly supervised learning, including object appearance and human pose representations without ground truth. Moreover, we construct a graph convolutional network for exploiting the inherent relationships of humans and objects under present incidents. We evaluate the proposed model on two datasets, the MPII-Cooking dataset and the EPIC-Kitchens dataset, and we demonstrate the generalizability and effectiveness of our approach for action anticipation.  相似文献   

8.
遥感视觉问答(remote sensing visual question answering,RSVQA)旨在从遥感图像中抽取科学知识.近年来,为了弥合遥感视觉信息与自然语言之间的语义鸿沟,涌现出许多方法.但目前方法仅考虑多模态信息的对齐和融合,既忽略了对遥感图像目标中的多尺度特征及其空间位置信息的深度挖掘,又缺乏对尺度特征的建模和推理的研究,导致答案预测不够全面和准确.针对以上问题,本文提出了一种多尺度引导的融合推理网络(multi-scale guided fusion inference network,MGFIN),旨在增强RSVQA系统的视觉空间推理能力.首先,本文设计了基于Swin Transformer的多尺度视觉表征模块,对嵌入空间位置信息的多尺度视觉特征进行编码;其次,在语言线索的引导下,本文使用多尺度关系推理模块以尺度空间为线索学习跨多个尺度的高阶群内对象关系,并进行空间层次推理;最后,设计基于推理的融合模块来弥合多模态语义鸿沟,在交叉注意力基础上,通过自监督范式、对比学习方法、图文匹配机制等训练目标来自适应地对齐融合多模态特征,并辅助预测最终答案.实验结果表明,本文提出的模型在两个公共RSVQA数据集上具有显著优势.  相似文献   

9.
目的 时序动作检测(temporal action detection)作为计算机视觉领域的一个热点课题,其目的是检测视频中动作发生的具体区间,并确定动作的类别。这一课题在现实生活中具有深远的实际意义。如何在长视频中快速定位且实现时序动作检测仍然面临挑战。为此,本文致力于定位并优化动作发生时域的候选集,提出了时域候选区域优化的时序动作检测方法TPO (temporal proposal optimization)。方法 采用卷积神经网络(convolutional neural network,CNN)和双向长短期记忆网络(bidirectional long short term memory,BLSTM)来捕捉视频的局部时序关联性和全局时序信息;并引入联级时序分类优化(connectionist temporal classification,CTC)方法,评估每个时序位置的边界概率和动作概率得分;最后,融合两者的概率得分曲线,优化时域候选区域候选并排序,最终实现时序上的动作检测。结果 在ActivityNet v1.3数据集上进行实验验证,TPO在各评价指标,如一定时域候选数量下的平均召回率AR@100(average recall@100),曲线下的面积AUC (area under a curve)和平均均值平均精度mAP (mean average precision)上分别达到74.66、66.32、30.5,而各阈值下的均值平均精度mAP@IoU (mAP@intersection over union)在阈值为0.75和0.95时也分别达到了30.73和8.22,与SSN (structured segment network)、TCN (temporal context network)、Prop-SSAD (single shot action detector for proposal)、CTAP (complementary temporal action proposal)和BSN (boundary sensitive network)等方法相比,TPO的所有性能指标均有提高。结论 本文提出的模型兼顾了视频的全局时序信息和局部时序信息,使得预测的动作候选区域边界更为准确和灵活,同时也验证了候选区域的准确性能够有效提高时序动作检测的精确度。  相似文献   

10.
基于时空关注度LSTM的行为识别   总被引:1,自引:0,他引:1  
针对现有基于视频整体序列结构建模的行为识别方法中,存在着大量时空背景混杂信息,而引起的行为表达的判决能力低和行为类别错误判定的问题,提出一种基于双流特征的时空关注度长短时记忆网络模型.首先,本文定义了一种基于双流的时空关注度模块,其中,空间关注度用于抑制空间背景混杂,时间关注度用于抑制低信息量的视频帧.其次,本文为双流模型设计了两种不同的时空关注度模块,分别讨论不带融合形式和双流融合的形式对行为识别的影响.最后,为了适应不同长度视频的处理需求,本文方法采用分段策略构建行为识别框架,通过调整段的数量自适应视频长度.在UCF101和HMDB51两个数据集上进行实验验证,与现有多种基于时间和空间显著性模型的行为识别方法进行比较,实验结果表明,本文方法在识别率上优于现有行为识别方法I3D,在UCF101上提高了0.66%,在HMDB51上提高了0.75%.  相似文献   

11.

Videos are tampered by the forgers to modify or remove their content for malicious purpose. Many video authentication algorithms are developed to detect this tampering. At present, very few standard and diversified tampered video dataset is publicly available for reliable verification and authentication of forensic algorithms. In this paper, we propose the development of total 210 videos for Temporal Domain Tampered Video Dataset (TDTVD) using Frame Deletion, Frame Duplication and Frame Insertion. Out of total 210 videos, 120 videos are developed based on Event/Object/Person (EOP) removal or modification and remaining 90 videos are created based on Smart Tampering (ST) or Multiple Tampering. 16 original videos from SULFA and 24 original videos from YouTube (VTD Dataset) are used to develop different tampered videos. EOP based videos include 40 videos for each tampering type of frame deletion, frame insertion and frame duplication. ST based tampered video contains multiple tampering in a single video. Multiple tampering is developed in three categories (1) 10-frames tampered (frame deletion, frame duplication or frame insertion) at 3-different locations (2) 20-frames tampered at 3- different locations and (3) 30-frames tampered at 3-different locations in the video. Proposed TDTVD dataset includes all temporal domain tampering and also includes multiple tampering videos. The resultant tampered videos have video length ranging from 6 s to 18 s with resolution 320X240 or 640X360 pixels. The database is comprised of static and dynamic videos with various activities, like traffic, sports, news, a ball rolling, airport, garden, highways, zoom in zoom out etc. This entire dataset is publicly accessible for researchers, and this will be especially valuable to test their algorithms on this vast dataset. The detailed ground truth information like tampering type, frames tampered, location of tampering is also given for each developed tampered video to support verifying tampering detection algorithms. The dataset is compared with state of the art and validated with two video tampering detection methods.

  相似文献   

12.
13.
图像的自然语言描述(image captioning)是一个融合计算机视觉、自然语言处理和机器学习的跨领域课题。它作为多模态处理的关键技术,近年来取得了显著成果。当前研究大多针对图像生成英文摘要,而对于中文摘要的生成方法研究较少。该文提出了一种基于多模态神经网络的图像中文摘要生成方法。该方法由编码器和解码器组成,编码器基于卷积神经网络,包括单标签视觉特征提取网络和多标签关键词特征预测网络,解码器基于长短时记忆网络,由多模态摘要生成网络构成。在解码过程中,该文针对长短时记忆网络的特点提出了四种多模态摘要生成方法CNIC-X、CNIC-H、CNIC-C和CNIC-HC。在中文摘要数据集Flickr8k-CN上实验,结果表明该文提出的方法优于现有的中文摘要生成模型。  相似文献   

14.
针对动态复杂场景下的操作动作识别,提出一种基于手势特征融合的动作识别框架,该框架主要包含RGB视频特征提取模块、手势特征提取模块与动作分类模块。其中RGB视频特征提取模块主要使用I3D网络提取RGB视频的时间和空间特征;手势特征提取模块利用Mask R-CNN网络提取操作者手势特征;动作分类模块融合上述特征,并输入到分类器中进行分类。在EPIC-Kitchens数据集上,提出的方法识别抓取手势的准确性高达89.63%,识别综合动作的准确度达到了74.67%。  相似文献   

15.
In heterogeneous networks, different modalities are coexisting. For example, video sources with certain lengths usually have abundant time-varying audiovisual data. From the users’ perspective, different video segments will trigger different kinds of emotions. In order to better interact with users in heterogeneous networks and improve their user experiences, affective video content analysis to predict users’ emotions is essential. Academically, users’ emotions can be evaluated by arousal and valence values, and fear degree, which provides an approach to quantize the prediction accuracy of the reaction of the audience and users towards videos. In this paper, we propose the multimodal data fusion method for integrating the visual and audio data in order to perform the affective video content analysis. Specifically, to align the visual and audio data, the temporal attention filters are proposed to obtain the time-span features of the entire video segments. Then, by using the two-branch network structure, matched visual and audio features are integrated in the common space. At last, the fused audiovisual feature is employed for the regression and classification subtasks in order to measure the emotional responses of users. Simulation results show that the proposed method can accurately predict the subjective feelings of users towards the video contents, which provides a way to predict users’ preferences and recommend videos according to their own demand.  相似文献   

16.
目的 立体视频能提供身临其境的逼真感而越来越受到人们的喜爱,而视觉显著性检测可以自动预测、定位和挖掘重要视觉信息,可以帮助机器对海量多媒体信息进行有效筛选。为了提高立体视频中的显著区域检测性能,提出了一种融合双目多维感知特性的立体视频显著性检测模型。方法 从立体视频的空域、深度以及时域3个不同维度出发进行显著性计算。首先,基于图像的空间特征利用贝叶斯模型计算2D图像显著图;接着,根据双目感知特征获取立体视频图像的深度显著图;然后,利用Lucas-Kanade光流法计算帧间局部区域的运动特征,获取时域显著图;最后,将3种不同维度的显著图采用一种基于全局-区域差异度大小的融合方法进行相互融合,获得最终的立体视频显著区域分布模型。结果 在不同类型的立体视频序列中的实验结果表明,本文模型获得了80%的准确率和72%的召回率,且保持了相对较低的计算复杂度,优于现有的显著性检测模型。结论 本文的显著性检测模型能有效地获取立体视频中的显著区域,可应用于立体视频/图像编码、立体视频/图像质量评价等领域。  相似文献   

17.
Interaction and integration of multimodality media types such as visual, audio, and textual data in video are the essence of video semantic analysis. Contextual information propagation is useful for both intra- and inter-shot correlations. However, the traditional concatenated vector representation of videos weakens the power of the propagation and compensation among the multiple modalities. In this paper, we introduce a higher-order tensor framework for video analysis. We represent image frame, audio, and text in video shots as data points by the 3rd-order tensor. Then we propose a novel dimension reduction algorithm which explicitly considers the manifold structure of the tensor space from contextual temporal associated cooccurring multimodal media data. Our algorithm inherently preserves the intrinsic structure of the sub- manifold where tensorshots are sampled and is also able to map out-of-sample data points directly. We propose a new transductive support tensor machines algorithm to train effective classifier using large amount of unlabeled data together with the labeled data. Experiment results on TREVID 2005 data set show that our method improves the performance of video semantic concept detection.  相似文献   

18.
Jiang  Guanghao  Jiang  Xiaoyan  Fang  Zhijun  Chen  Shanshan 《Applied Intelligence》2021,51(10):7043-7057

Due to illumination changes, varying postures, and occlusion, accurately recognizing actions in videos is still a challenging task. A three-dimensional convolutional neural network (3D CNN), which can simultaneously extract spatio-temporal features from sequences, is one of the mainstream models for action recognition. However, most of the existing 3D CNN models ignore the importance of individual frames and spatial regions when recognizing actions. To address this problem, we propose an efficient attention module (EAM) that contains two sub-modules, that is, a spatial efficient attention module (EAM-S) and a temporal efficient attention module (EAM-T). Specifically, without dimensionality reduction, EAM-S concentrates on mining category-based correlation by local cross-channel interaction and assigns high weights to important image regions, while EAM-T estimates the importance score of different frames by cross-frame interaction between each frame and its neighbors. The proposed EAM module is lightweight yet effective, and it can be easily embedded into 3D CNN-based action recognition models. Extensive experiments on the challenging HMDB-51 and UCF-101 datasets showed that our proposed module achieves state-of-the-art performance and can significantly improve the recognition accuracy of 3D CNN-based action recognition methods.

  相似文献   

19.
陈师哲  王帅  金琴 《软件学报》2018,29(4):1060-1070
自动情感识别是一个非常具有挑战性的课题,并且有着广泛的应用价值.本文探讨了在多文化场景下的多模态情感识别问题.我们从语音声学和面部表情等模态分别提取了不同的情感特征,包括传统的手工定制特征和基于深度学习的特征,并通过多模态融合方法结合不同的模态,比较不同单模态特征和多模态特征融合的情感识别性能.我们在CHEAVD中文多模态情感数据集和AFEW英文多模态情感数据集进行实验,通过跨文化情感识别研究,我们验证了文化因素对于情感识别的重要影响,并提出3种训练策略提高在多文化场景下情感识别的性能,包括:分文化选择模型、多文化联合训练以及基于共同情感空间的多文化联合训练,其中基于共同情感空间的多文化联合训练通过将文化影响与情感特征分离,在语音和多模态情感识别中均取得最好的识别效果.  相似文献   

20.

TV news channels present rich and complete experience of various events through audio-visual content. This makes television news an influential medium to affect masses and thus persuaded various social scientists and regulators to monitor and analyze the content of broadcast videos. An organized archive of newscast is a prerequisite for any such analysis. Creating such archive requires segmentation of continuous news videos into suitable logical units. Based on the application, these logical units may be one of channel content obtained after advertisement removal, different shows, news stories or video shots. In this work, we propose an end to end system with software architecture for segmenting the TV broadcast videos at all these four granularities. The videos are segmented into shots. Video shots are used as basic unit for all further processing. Video shots are first subjected to advertisement detection and removal to obtain the non-commercial channel content. This channel content is further processed to identify various program boundaries. We propose to identify three types of shows based on the presentation format viz. news bulletins, interviews and debates. News bulletins so obtained are processed further to obtain news stories. We propose a modular and scalable framework and software architecture for the broadcast segmentation system for deployment on a computation cluster. This involves scheduler based recording module and broadcast segmentation module. We have presented the detailed software architecture for individual modules, automation of entire processing pipeline along with resource and database management systems. We have implemented and verified the software architecture by deploying the proposed system on a cluster of nine desktops and one workstation. The deployed system was used for round the clock processing of three Indian English news channels.

  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号