首页 | 本学科首页   官方微博 | 高级检索  
     

基于时空关注度LSTM的行为识别
引用本文:谢昭,周义,吴克伟,张顺然.基于时空关注度LSTM的行为识别[J].计算机学报,2021,44(2):261-274.
作者姓名:谢昭  周义  吴克伟  张顺然
作者单位:合肥工业大学计算机与信息学院 合肥 230601;合肥工业大学计算机与信息学院 合肥 230601;合肥工业大学计算机与信息学院 合肥 230601;合肥工业大学计算机与信息学院 合肥 230601
基金项目:本课题得到国家重点研发计划重点专项;国家自然科学基金项目
摘    要:针对现有基于视频整体序列结构建模的行为识别方法中,存在着大量时空背景混杂信息,而引起的行为表达的判决能力低和行为类别错误判定的问题,提出一种基于双流特征的时空关注度长短时记忆网络模型.首先,本文定义了一种基于双流的时空关注度模块,其中,空间关注度用于抑制空间背景混杂,时间关注度用于抑制低信息量的视频帧.其次,本文为双流模型设计了两种不同的时空关注度模块,分别讨论不带融合形式和双流融合的形式对行为识别的影响.最后,为了适应不同长度视频的处理需求,本文方法采用分段策略构建行为识别框架,通过调整段的数量自适应视频长度.在UCF101和HMDB51两个数据集上进行实验验证,与现有多种基于时间和空间显著性模型的行为识别方法进行比较,实验结果表明,本文方法在识别率上优于现有行为识别方法I3D,在UCF101上提高了0.66%,在HMDB51上提高了0.75%.

关 键 词:行为识别  时空关注度  双流融合  长短期记忆网络  深度特征

Activity Recognition Based on Spatial-Temporal Attention LSTM
XIE Zhao,ZHOU Yi,WU Ke-Wei,ZHANG Shun-Ran.Activity Recognition Based on Spatial-Temporal Attention LSTM[J].Chinese Journal of Computers,2021,44(2):261-274.
Authors:XIE Zhao  ZHOU Yi  WU Ke-Wei  ZHANG Shun-Ran
Affiliation:(School of Computer Science and InformaLion Engineering,Hefei UniuersiLy of Technology,Hefei 230601)
Abstract:Objective:With the increasing mature video media business,activity recognition in videos is an important research topic in the field of computer vision.Activity recognition is the advanced processing part of human activity visual analysis,which can enable the computer system to understand the activity and its situation,by extracting interesting vision information and identifying ongoing activity in video.Various existing models of activity recognition are usually modeled with whole structure of video sequence.However,massive videos often have uneven content,and therefore,these models have a large amount of spatio-temporal background information noise,which lead to uncertain activity representation and false activity recognition.To solve these problems,we propose a spatio-temporal attention long short term memory network(STA-LSTM)based on two-stream features.Firstly,we define a spatial-temporal attention module with two-stream features.And its two-stream features are appearance and motion.To be specific,we use GoogLeNet/I3D on RGB image and optical flow image separately,and use the last layer feature map in GoogLeNet/I3D as our appearance feature and motion feature.Because spatial attention precedes temporal attention in human visual perception,we design a serial process.for spatial-temporal attention module,in which spatial attention is used to suppress,spatial background noise,and temporal attention is used to suppress low-information video frames.Secondly,we design two different fusion approaches for spatial-temporal attention modules with two-stream features.One approach is spatial-temporal attention with appearance and motion channel separately.The other approach is spatial-temporal attention with fused appearance and motion channels,in which we use average way to fuse twO-stream attention in both spatial attention and temporal attention.When we get spatial attention and temporal attention,we use weighted way to denoise low-information spatial-temporal feature.Finally,in order to adapt to the recognizing different length video,our method adopts the segmentation strategy for activity recognition.Each video segment is modeled as a STA-LSTM.We use softmax to estimate the confidence of each activity category in STA-LSTM of each segment,then we apply sumpooling to predict activity label of all segments in one video.Therefore,our framework is extensible,which can adjusts the video length by setting the number of segments.We can get the result:Experiments are carried out on two acknowledged datasets,UCF 101 and HMDB51.We carefully select comparative activity recognition methods according to temporal and spatial attention.(1)Methods without attention module include Inflated 3D ConvNet(I3D),Composite LSTM Model,Two stream 3Dnet,Hidden Two Stream and so on.(2)Methods with spatial attention module include Soft Attention Model,Recurrent mixture density network.(3)Method with temporal attention module includes Beyond Short Snippets Models.(4)Method with spatio-temporal attention module includes Spatio-temporal Attention CNN.The experimental results show that our method is superior to the existing activity recognition method I3D.Our performance increased by 0.66%on the UCF101 dataset and 0.75%on the HMDB51 dataset.Our model outperforms other models with spatial attention or temporal attention module,which can prove our contributions that(1)spatio-temporal attention module can improve activity representation by denoising spatio-temporal background information,(2)module with fused appearance and motion channels can make a better spatial-temporal attention on uneven content.
Keywords:action recognition  spatial-temporal attention  two stream fusion  long-short term memory  deep features
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号