多模态特征融合的长视频行为识别方法 Long Video Action Recognition Method Based on Multimodal Feature Fusion期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

多模态特征融合的长视频行为识别方法

引用本文：	王婷,刘光辉,张钰敏,孟月波,徐胜军.多模态特征融合的长视频行为识别方法[J].计算机测量与控制,2021,29(11):165-170.

作者姓名：	王婷刘光辉张钰敏孟月波徐胜军

作者单位：	西安建筑科技大学信息与控制工程学院,西安 710055

基金项目：	陕西省自然科学基金面上项目（2020JM-473，2020JM-472）、西安建筑科技大学基础研究基金项目（JC1703）、西安建筑科技大学自然科学基金项目（ZR19046）

摘要：	行为识别技术在视频检索具有重要的应用价值;针对基于卷积神经网络的行为识别方法存在的长时序行为识别能力不足、尺度特征提取困难、光照变化及复杂背景干扰等问题,提出一种多模态特征融合的长视频行为识别方法;首先,考虑到长时序行为帧间差距较小,易造成视频帧的冗余,基于此,通过均匀稀疏采样策略完成全视频段的时域建模,在降低视频帧冗余度的前提下实现长时序信息的充分保留;其次,通过多列卷积获取多尺度时空特征,弱化视角变化对视频图像带来的干扰;后引入光流数据信息,通过空间注意力机制引导的特征提取网络获取光流数据的深层次特征,进而利用不同数据模式之间的优势互补,提高网络在不同场景下的准确性和鲁棒性;最后,将获取的多尺度时空特征和光流信息在网络的全连接层进行融合,实现了端到端的长视频行为识别;实验结果表明,所提方法在UCF101和HMDB51数据集上平均精度分别为97.2％和72.8％,优于其他对比方法,实验结果证明了该方法的有效性.
关键词：	深度学习行为识别特征提取多模态特征融合
收稿时间：	2021/4/8 0:00:00
修稿时间：	2021/5/11 0:00:00
Long Video Action Recognition Method Based on Multimodal Feature Fusion

WANG Ting,LIU Guanghui,ZHANG Yumin,MENG Yuebo,XU Shengjun.Long Video Action Recognition Method Based on Multimodal Feature Fusion[J].Computer Measurement & Control,2021,29(11):165-170.

Authors:	WANG Ting LIU Guanghui ZHANG Yumin MENG Yuebo XU Shengjun

Abstract:	Action recognition technology has important application value in video retrieval. In order to solve the problems of convolutional neural network based action recognition methods, such as insufficient ability of long time sequence action recognition, difficulty in scale feature extraction, illumination change and complex background interference, a long-video action recognition method based on multi-mode feature fusion is proposed. Firstly, considering that the gap between the frames of the long-sequence behavior is small, it is easy to cause the redundancy of the video frames. Based on this, the time-domain modeling of the whole video segment is completed by using the uniform sparse sampling strategy, and the long-sequence information is fully retained on the premise of reducing the redundancy of the video frames. Secondly, multi-column convolution is used to obtain multi-scale spatial and temporal features, so as to weaken the interference caused by the change of perspective on video images. Then, the optical flow data information is introduced, and the deep features of the optical flow data are obtained through the feature extraction network guided by the spatial attention mechanism. Furthermore, the complementary advantages among different data modes are utilized to improve the accuracy and robustness of the network in different scenarios. Finally, the obtained multi-scale spatial and temporal features and optical flow information are fused in the full connection layer of the network to realize end-to-end long video action recognition. Experimental results show that the average accuracy of the proposed method on UCF101 and HMDB51 datasets is 97.2% and 72.8%, respectively, which is better than other comparison methods. The experimental results prove the effectiveness of the method.

Keywords:	deep learning action recognition feature extraction multimodal feature fusion
本文献已被万方数据等数据库收录！
	点击此处可从《计算机测量与控制》浏览原始摘要信息
	点击此处可从《计算机测量与控制》下载全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏