首页 | 本学科首页   官方微博 | 高级检索  
     

T-STAM:基于双流时空注意力机制的端到端的动作识别模型
引用本文:石祥滨,李怡颖,刘芳,代钦.T-STAM:基于双流时空注意力机制的端到端的动作识别模型[J].计算机应用研究,2021,38(4):1235-1239,1276.
作者姓名:石祥滨  李怡颖  刘芳  代钦
作者单位:辽宁大学 信息学院,沈阳 110036;沈阳航空航天大学 计算机学院,沈阳 110136;辽宁大学 信息学院,沈阳 110036;沈阳航空航天大学 计算机学院,沈阳 110136;沈阳工程学院 信息学院,沈阳 110136
基金项目:国家自然科学基金资助项目
摘    要:针对双流法进行视频动作识别时忽略特征通道间的相互联系、特征存在大量冗余的时空信息等问题,提出一种基于双流时空注意力机制的端到端的动作识别模型T-STAM,实现了对视频关键时空信息的充分利用。首先,将通道注意力机制引入到双流基础网络中,通过对特征通道间的依赖关系进行建模来校准通道信息,提高特征的表达能力。其次,提出一种基于CNN的时间注意力模型,使用较少的参数学习每帧的注意力得分,重点关注运动幅度明显的帧。同时提出一种多空间注意力模型,从不同角度计算每帧中各个位置的注意力得分,提取多个运动显著区域,并且对时空特征进行融合进一步增强视频的特征表示。最后,将融合后的特征输入到分类网络,按不同权重融合两流输出得到动作识别结果。在数据集HMDB51和UCF101上的实验结果表明T-STAM能有效地识别视频中的动作。

关 键 词:动作识别  双流  通道信息  时空注意力  运动显著区域
收稿时间:2020/2/24 0:00:00
修稿时间:2021/3/10 0:00:00

T-STAM: end-to-end action recognition model based on two-stream network with spatio-temporal attention mechanism
shixiangbin,liyiying,liufang and daiqin.T-STAM: end-to-end action recognition model based on two-stream network with spatio-temporal attention mechanism[J].Application Research of Computers,2021,38(4):1235-1239,1276.
Authors:shixiangbin  liyiying  liufang and daiqin
Affiliation:(College of Information,Liaoning University,Shenyang 110036,China;College of Computer Science,Shenyang Aerospace University,Shenyang 110136,China;College of Information,Shenyang Institute of Engineering,Shenyang 110136,China)
Abstract:Aiming at the problems that the action recognition methods based on two-stream ignores the inter-relationship between feature channels,and has large amount of redundant spatio-temporal information,this paper proposed an end-to-end action recognition model based on two-stream network with spatio-temporal attention mechanism(T-STAM),which realized the full utilization of the key spatio-temporal information in the video.Firstly,this paper introduced the channel attention mechanism to the two-stream basic network,and calibrated the channel information by modeling the dependencies between feature channels to improve the ability of future expression.Secondly,it proposed a CNN-based temporal attention model to learn the attention score of each frame with fewer parameters,which could focus on the frames with significant amplitude of motion.At the same time,it proposed a multi-spatial attention model,which calculated the attention score of each position in frame from different angles to extract motion saliency areas.Then,it fused temporal and spatial features to further enhance the feature representation of video.Finally,this paper input the fused features into the classification network,and fused the results of each stream according to different weights to obtain the recognition results.The experimental results on HMDB51 and UCF101 dataset show that T-STAM can effectively recognize actions in video.
Keywords:action recognition  two-stream  channel information  spatio-temporal attention  motion saliency areas
本文献已被 维普 万方数据 等数据库收录!
点击此处可从《计算机应用研究》浏览原始摘要信息
点击此处可从《计算机应用研究》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号