基于视频时空特征的行为识别方法 Action recognition method based on video spatio-temporal features期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

基于视频时空特征的行为识别方法

引用本文：	倪苒岩,张轶. 基于视频时空特征的行为识别方法[J]. 计算机应用, 2023, 43(2): 521-528. DOI: 10.11772/j.issn.1001-9081.2022010017

作者姓名：	倪苒岩张轶

作者单位：	四川大学计算机学院，成都 610065

基金项目：	国家自然科学基金资助项目(U20A20161)

摘要：	针对双流网络提取运动信息需要预先计算光流图，从而无法实现端到端的识别以及三维卷积网络参数量巨大的问题，提出了一种基于视频时空特征的行为识别方法。该方法能够高效提取视频中的时空信息，且无需添加任何光流计算和三维卷积操作。首先，利用基于注意力机制的运动信息提取模块捕获相邻两帧之间的运动位移信息，从而模拟双流网络中光流图的作用；其次，提出了一种解耦的时空信息提取模块代替三维卷积，从而实现时空信息的编码；最后，在将两个模块嵌入二维的残差网络中后，完成端到端的行为识别。将所提方法在几个主流的行为识别数据集上进行实验，结果表明在仅使用RGB视频帧作为输入的情况下，在UCF101、HMDB51、Something-Something-V1数据集上的识别准确率分别为96.5%、73.1%和46.6%，与使用双流结构的时间分段网络（TSN）方法相比，在UCF101数据集上的识别准确率提高了2.5个百分点。可见，所提方法能够高效提取视频中的时空特征。
关键词：	卷积神经网络行为识别时空信息时序推理运动信息
收稿时间：	2022-01-07
修稿时间：	2022-03-18
Action recognition method based on video spatio-temporal features

Ranyan NI,Yi ZHANG. Action recognition method based on video spatio-temporal features[J]. Journal of Computer Applications, 2023, 43(2): 521-528. DOI: 10.11772/j.issn.1001-9081.2022010017

Authors:	Ranyan NI Yi ZHANG

Affiliation:	College of Computer Science，Sichuan University，Chengdu Sichuan 610065，China

Abstract:	Aiming at the problems that the end-to-end recognition of two-stream networks cannot be realized due to the need of calculating optical flow maps in advance to extract motion information and the three-dimensional convolutional networks have a lot of parameters， an action recognition method based on video spatio-temporal features was proposed. In this method， the spatio-temporal information in videos were able to be extracted efficiently without adding any calculation of optical flows or any three-dimensional convolution operation. Firstly， the motion information extraction module based on attention mechanism was used to capture the motion shift information between two adjacent frames， thereby simulating the function of optical flows in two-stream network. Secondly， a decoupled spatio-temporal information extraction module was proposed to replace the three-dimensional convolution in order to encode the spatio-temporal information. Finally， the two modules were embedded into the two-dimensional residual network to complete the end-to-end action recognition. Experiments were carried out on several mainstream action recognition datasets. The results show that when only using RGB （Red-Green-Blue） video frames as input， the recognition accuracies of the proposed method on UCF101， HMDB51 and Something-Something-V1 datasets are 96.5%， 73.1% and 46.6% respectively. Compared with Temporal Segment Network （TSN） method using two-stream structure， the proposed method has the recognition accuracy on UCF101 improved by 2.5 percentage points. It can be seen that the proposed method is able to extract spatio-temporal features in videos efficiently.

Keywords:	Convolutional Neural Network (CNN) action recognition spatio-temporal information temporal reasoning motion information

	点击此处可从《计算机应用》浏览原始摘要信息
	点击此处可从《计算机应用》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏