首页 | 本学科首页   官方微博 | 高级检索  
     

用于骨架行为识别的时空卷积Transformer网络
引用本文:刘斌斌,赵宏涛,王田,杨艺. 用于骨架行为识别的时空卷积Transformer网络[J]. 电子测量技术, 2024, 47(1): 169-177
作者姓名:刘斌斌  赵宏涛  王田  杨艺
作者单位:2. 河南理工大学电气工程与自动化学院;3. 北京航空航天大学人工智能研究院
基金项目:国家自然科学基金(61972016)项目资助;
摘    要:针对基于图卷积的骨架行为识别方法在建模关节特征时严重依赖手工设计图形拓扑,缺乏建模全局关节间依赖关系的缺点,设计了一种时空卷积Transformer实现对空间和时间关节特征的建模。空间关节特征建模中,提出一种动态分组解耦Transformer,通过将输入骨架序列在通道维度进行分组并为每个组动态生成不同的注意力矩阵,允许建模关节之间的全局空间依赖关系,无需事先知道人体拓扑结构。时间关节特征建模中,通过多尺度时间卷积实现对不同时间尺度行为特征的提取。最后,提出一种时空-通道联合注意力模块,进一步对所提取到的时空特征进行修正。在NTU-RGB+D和NTU-RGB+D 120数据集的跨主体评估标准上达到了92.5%和89.3%的Top1识别准确率,实验结果表明了所提方法的有效性。

关 键 词:行为识别  人体骨架  自注意机制  Transformer

Spatial temporal convolutional Transformer network for skeleton-based action recognition
Liu Binbin,Zhao Hongtao,Wang Tian,Yang Yi. Spatial temporal convolutional Transformer network for skeleton-based action recognition[J]. Electronic Measurement Technology, 2024, 47(1): 169-177
Authors:Liu Binbin  Zhao Hongtao  Wang Tian  Yang Yi
Affiliation:Zhengzhou Hengda Intelligent Control Technology Company Limited,Zhengzhou 450000, China;School of Electrical Engineering and Automation, Henan Polytechnic University,Jiaozuo 454003, China;Research Institute for Artificial Intelligence, Beihang University,Beijing 100191, China
Abstract:In the methon of skeleton action recognition based on graph convolution, the rely heavily on hand-designed graph topology in modelling joint features, and lack the ability to model global joint dependencies. To address this issue, we proposed a spatio-temporal convolutional Transformer network to implement the modelling of spatial and temporal joint features. In the spatial joint feature modeling, we proposed a dynamic grouping decoupling Transformer that grouped the input skeleton sequence in the channel dimension and dynamically generated different attention matrices for each group, establishing global dependencies between joints without requiring knowledge of the human topology. In the temporal joint feature modeling, multi-scale temporal convolution was used to extract features of target behaviors at different scales. Finally, we proposed a spatio-temporal channel joint attention module to further refine the extracted spatio-temporal features. The proposed method achieved Top1 recognition accuracy rates of 92.5% and 89.3% on the cross-subject evaluation criteria for the NTU-RGB+D and NTU-RGB+D 120 datasets, respectively, demonstrating its effectiveness.
Keywords:action recognition;human skeleton;self-attention mechanism;Transformer
点击此处可从《电子测量技术》浏览原始摘要信息
点击此处可从《电子测量技术》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号