T-STAM:基于双流时空注意力机制的端到端的动作识别模型 T-STAM: end-to-end action recognition model based on two-stream network with spatio-temporal attention mechanism期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

T-STAM:基于双流时空注意力机制的端到端的动作识别模型

引用本文：	石祥滨,李怡颖,刘芳,代钦.T-STAM:基于双流时空注意力机制的端到端的动作识别模型[J].计算机应用研究,2021,38(4):1235-1239,1276.

作者姓名：	石祥滨李怡颖刘芳代钦

作者单位：	辽宁大学信息学院,沈阳 110036;沈阳航空航天大学计算机学院,沈阳 110136;辽宁大学信息学院,沈阳 110036;沈阳航空航天大学计算机学院,沈阳 110136;沈阳工程学院信息学院,沈阳 110136

基金项目：	国家自然科学基金资助项目

摘要：	针对双流法进行视频动作识别时忽略特征通道间的相互联系、特征存在大量冗余的时空信息等问题,提出一种基于双流时空注意力机制的端到端的动作识别模型T-STAM,实现了对视频关键时空信息的充分利用。首先,将通道注意力机制引入到双流基础网络中,通过对特征通道间的依赖关系进行建模来校准通道信息,提高特征的表达能力。其次,提出一种基于CNN的时间注意力模型,使用较少的参数学习每帧的注意力得分,重点关注运动幅度明显的帧。同时提出一种多空间注意力模型,从不同角度计算每帧中各个位置的注意力得分,提取多个运动显著区域,并且对时空特征进行融合进一步增强视频的特征表示。最后,将融合后的特征输入到分类网络,按不同权重融合两流输出得到动作识别结果。在数据集HMDB51和UCF101上的实验结果表明T-STAM能有效地识别视频中的动作。
关键词：	动作识别双流通道信息时空注意力运动显著区域
收稿时间：	2020/2/24 0:00:00
修稿时间：	2021/3/10 0:00:00
T-STAM: end-to-end action recognition model based on two-stream network with spatio-temporal attention mechanism

shixiangbin,liyiying,liufang and daiqin.T-STAM: end-to-end action recognition model based on two-stream network with spatio-temporal attention mechanism[J].Application Research of Computers,2021,38(4):1235-1239,1276.

Authors:	shixiangbin liyiying liufang and daiqin

Affiliation:	(College of Information,Liaoning University,Shenyang 110036,China;College of Computer Science,Shenyang Aerospace University,Shenyang 110136,China;College of Information,Shenyang Institute of Engineering,Shenyang 110136,China)

Abstract:	Aiming at the problems that the action recognition methods based on two-stream ignores the inter-relationship between feature channels,and has large amount of redundant spatio-temporal information,this paper proposed an end-to-end action recognition model based on two-stream network with spatio-temporal attention mechanism(T-STAM),which realized the full utilization of the key spatio-temporal information in the video.Firstly,this paper introduced the channel attention mechanism to the two-stream basic network,and calibrated the channel information by modeling the dependencies between feature channels to improve the ability of future expression.Secondly,it proposed a CNN-based temporal attention model to learn the attention score of each frame with fewer parameters,which could focus on the frames with significant amplitude of motion.At the same time,it proposed a multi-spatial attention model,which calculated the attention score of each position in frame from different angles to extract motion saliency areas.Then,it fused temporal and spatial features to further enhance the feature representation of video.Finally,this paper input the fused features into the classification network,and fused the results of each stream according to different weights to obtain the recognition results.The experimental results on HMDB51 and UCF101 dataset show that T-STAM can effectively recognize actions in video.

Keywords:	action recognition two-stream channel information spatio-temporal attention motion saliency areas
本文献已被维普万方数据等数据库收录！
	点击此处可从《计算机应用研究》浏览原始摘要信息
	点击此处可从《计算机应用研究》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏