首页 | 本学科首页   官方微博 | 高级检索  
     

基于骨架动作识别的协作卷积Transformer网络
引用本文:石跃祥, 朱茂清. 基于骨架动作识别的协作卷积Transformer网络[J]. 电子与信息学报, 2023, 45(4): 1485-1493. doi: 10.11999/JEIT220270
作者姓名:石跃祥  朱茂清
作者单位:湘潭大学计算机学院网络空间安全学院 湘潭 411105
基金项目:国家自然科学基金(62172349, 62172350),湖南省学位和研究生教育改革研究一般项目(2021JGYB085)
摘    要:近年来,基于骨架的人体动作识别任务因骨架数据的鲁棒性和泛化能力而受到了广泛关注。其中,将人体骨骼建模为时空图的图卷积网络取得了显著的性能。然而图卷积主要通过一系列3D卷积来学习长期交互联系,这种联系偏向于局部并且受到卷积核大小的限制,无法有效地捕获远程依赖关系。该文提出一种协作卷积Transformer网络(Co-ConvT),通过引入Transformer中的自注意力机制建立远程依赖关系,并将其与图卷积神经网络(GCNs)相结合进行动作识别,使模型既能通过图卷积神经网络提取局部信息,也能通过Transformer捕获丰富的远程依赖项。另外,Transformer的自注意力机制在像素级进行计算,因此产生了极大的计算代价,该模型通过将整个网络分为两个阶段,第1阶段使用纯卷积来提取浅层空间特征,第2阶段使用所提出的ConvT块捕获高层语义信息,降低了计算复杂度。此外,原始Transformer中的线性嵌入被替换为卷积嵌入,获得局部空间信息增强,并由此去除了原始模型中的位置编码,使模型更轻量。在两个大规模权威数据集NTU-RGB+D和Kinetics-Skeleton上进行实验验证,该模型分别达到了88.1%和36.6%的Top-1精度。实验结果表明,该模型的性能有了很大的提高。

关 键 词:动作识别   图卷积网络   自注意力机制   Transformer
收稿时间:2022-03-14
修稿时间:2022-07-07

Collaborative Convolutional Transformer Network Based on Skeleton Action Recognition
SHI Yuexiang, ZHU Maoqing. Collaborative Convolutional Transformer Network Based on Skeleton Action Recognition[J]. Journal of Electronics & Information Technology, 2023, 45(4): 1485-1493. doi: 10.11999/JEIT220270
Authors:SHI Yuexiang  ZHU Maoqing
Affiliation:School of Computer Science and Cyberspace Security, Xiangtan University, Xiangtan 411105, China
Abstract:In recent years, skeleton-based human action recognition has attracted widespread attention because of the robustness and generalization ability of skeleton data. Among them, the graph convolutional network that models the human skeleton into a spatiotemporal graph has achieved remarkable performance. However, graph convolutions learn mainly long-term interactive connections through a series of 3D convolutions, which are localized and limited by the size of convolution kernels, which can not effectively capture long-range dependencies. In this paper, a Collaborative Convolutional Transformer (Co-ConvT) network is proposed to establish remote dependencies by introducing Transformer's self-attention mechanism and combining it with Graph Convolutional Neural Networks (GCNs) for action recognition, enabling the model to extract local information through graph convolution while capturing the rich remote dependencies through Transformer. In addition, Transformer's self-attention mechanism is calculated at the pixel level, a huge computational cost is generated. The model divides the entire network into two stages. The first stage uses pure convolution to extract shallow spatial features, and the second stage uses the proposed ConvT block to capture high-level semantic information, reducing the computational complexity. Moreover, the linear embeddings in the original Transformer are replaced with convolutional embeddings to obtain local spatial information enhancement, and thus removing the positional encoding in the original model, making the model lighter. Experimentally validated on two large-scale authoritative datasets NTU-RGB+D and Kinetics-Skeleton, the model achieves respectively Top-1 accuracy of 88.1% and 36.6%. The experimental results demonstrate that the performance of the model is greatly improved.
Keywords:Action recognition  Graph Convolutional Neural Networks (GCNs)  Self-attention mechanism  Transformer
点击此处可从《电子与信息学报》浏览原始摘要信息
点击此处可从《电子与信息学报》下载全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号