首页 | 本学科首页   官方微博 | 高级检索  
     

基于视频时空特征的行为识别方法
引用本文:倪苒岩,张轶. 基于视频时空特征的行为识别方法[J]. 计算机应用, 2023, 43(2): 521-528. DOI: 10.11772/j.issn.1001-9081.2022010017
作者姓名:倪苒岩  张轶
作者单位:四川大学 计算机学院,成都 610065
基金项目:国家自然科学基金资助项目(U20A20161)
摘    要:针对双流网络提取运动信息需要预先计算光流图,从而无法实现端到端的识别以及三维卷积网络参数量巨大的问题,提出了一种基于视频时空特征的行为识别方法。该方法能够高效提取视频中的时空信息,且无需添加任何光流计算和三维卷积操作。首先,利用基于注意力机制的运动信息提取模块捕获相邻两帧之间的运动位移信息,从而模拟双流网络中光流图的作用;其次,提出了一种解耦的时空信息提取模块代替三维卷积,从而实现时空信息的编码;最后,在将两个模块嵌入二维的残差网络中后,完成端到端的行为识别。将所提方法在几个主流的行为识别数据集上进行实验,结果表明在仅使用RGB视频帧作为输入的情况下,在UCF101、HMDB51、Something-Something-V1数据集上的识别准确率分别为96.5%、73.1%和46.6%,与使用双流结构的时间分段网络(TSN)方法相比,在UCF101数据集上的识别准确率提高了2.5个百分点。可见,所提方法能够高效提取视频中的时空特征。

关 键 词:卷积神经网络  行为识别  时空信息  时序推理  运动信息
收稿时间:2022-01-07
修稿时间:2022-03-18

Action recognition method based on video spatio-temporal features
Ranyan NI,Yi ZHANG. Action recognition method based on video spatio-temporal features[J]. Journal of Computer Applications, 2023, 43(2): 521-528. DOI: 10.11772/j.issn.1001-9081.2022010017
Authors:Ranyan NI  Yi ZHANG
Affiliation:College of Computer Science,Sichuan University,Chengdu Sichuan 610065,China
Abstract:Aiming at the problems that the end-to-end recognition of two-stream networks cannot be realized due to the need of calculating optical flow maps in advance to extract motion information and the three-dimensional convolutional networks have a lot of parameters, an action recognition method based on video spatio-temporal features was proposed. In this method, the spatio-temporal information in videos were able to be extracted efficiently without adding any calculation of optical flows or any three-dimensional convolution operation. Firstly, the motion information extraction module based on attention mechanism was used to capture the motion shift information between two adjacent frames, thereby simulating the function of optical flows in two-stream network. Secondly, a decoupled spatio-temporal information extraction module was proposed to replace the three-dimensional convolution in order to encode the spatio-temporal information. Finally, the two modules were embedded into the two-dimensional residual network to complete the end-to-end action recognition. Experiments were carried out on several mainstream action recognition datasets. The results show that when only using RGB (Red-Green-Blue) video frames as input, the recognition accuracies of the proposed method on UCF101, HMDB51 and Something-Something-V1 datasets are 96.5%, 73.1% and 46.6% respectively. Compared with Temporal Segment Network (TSN) method using two-stream structure, the proposed method has the recognition accuracy on UCF101 improved by 2.5 percentage points. It can be seen that the proposed method is able to extract spatio-temporal features in videos efficiently.
Keywords:Convolutional Neural Network (CNN)  action recognition  spatio-temporal information  temporal reasoning  motion information  
点击此处可从《计算机应用》浏览原始摘要信息
点击此处可从《计算机应用》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号