首页 | 本学科首页   官方微博 | 高级检索  
     

一种基于 Transformer 的三维人体姿态估计方法
引用本文:王玉萍,曾毅,李胜辉,张磊. 一种基于 Transformer 的三维人体姿态估计方法[J]. 图学学报, 2023, 44(1): 139-145. DOI: 10.11996/JG.j.2095-302X.2023010139
作者姓名:王玉萍  曾毅  李胜辉  张磊
作者单位:1. 郑州科技学院信息工程学院,河南 郑州 450064;2. 河南机电职业学院大数据学院,河南 郑州 450064;3. 郑州大学信息工程学院,河南 郑州 450001
基金项目:河南省科技厅科技攻关项目(222102210174)
摘    要:三维人体姿态估计是人类行为理解的基础,但是预测出合理的三维人体姿态序列仍然是具有挑战性的问题。为了解决这个问题,提出一种基于 Transformer 的三维人体姿态估计方法,利用多层长短期记忆(LSTM)单元和多尺度 Transformer 结构增强人体姿态序列预测的准确性。首先,设计基于时间序列的生成器,通过 ResNet 预训练神经网络提取图像特征;其次,采用多层 LSTM 单元学习时间连续性的图像序列中人体姿态之间的关系,输出合理的 SMPL 人体参数模型序列;最后,构建基于多尺度 Transformer 的判别器,利用多尺度 Transformer 结构对多个分割粒度进行细节特征学习,尤其是 Transformer block 对相对位置进行编码增强局部特征学习能力。实验结果表明,该方法相对于 VIBE 方法具有更好地预测精度,在 3DPW 数据集上比 VIBE的平均(每)关节位置误差(MPJPE)低了 7.5%;在 MP-INF-3DHP 数据集上比 VIBE 的 MPJPE 降低了 1.8%。 


A Transformer-based 3D human pose estimation method
WANG Yu-ping,ZENG Yi,LI Sheng-hui,ZHANG Lei. A Transformer-based 3D human pose estimation method[J]. Journal of Graphics, 2023, 44(1): 139-145. DOI: 10.11996/JG.j.2095-302X.2023010139
Authors:WANG Yu-ping  ZENG Yi  LI Sheng-hui  ZHANG Lei
Affiliation:1. School of Information Engineering, Zhengzhou University of Science and Technology, Zhengzhou Henan 450064, China;2. College of Big Data, Henan Electromechanical Vocational College, Zhengzhou Henan 450064, China;3. School of Information Engineering, Zhengzhou University, Zhengzhou Henan 450001, China
Abstract:3D human pose estimation is the foundation of human behavior understanding, but predicting reasonable3D human pose sequences remains a challenging problem. To solve this problem, a Transformer-based 3D human poseestimation method was proposed, utilizing a multi-layer long short-term memory (LSTM) unit and a multi-scaleTransformer structure to enhance the accuracy of human pose sequence prediction. First, a generator based on timeseries was designed to extract image features through the ResNet pre-trained neural network. Secondly, multi-layerLSTM units were used to learn the relationship between human poses in temporally continuous image sequences,thereby outputting a reasonable skinned multi-person linear (SMPL) human parameter model sequence. Finally, amulti-scale Transformer-based discriminator was constructed, and the multi-scale Transformer structure was employedto learn detailed features for multiple segmentation granularities, especially the Transformer block encoding therelative position to enhance the local feature learning ability. Experimental results show that the proposed methodcould yield better prediction accuracy than the VIBE method, which is 7.5% lower than the average (per) jointposition error (MPJPE) of VIBE on the 3DPW dataset, and 1.8% lower than VIBEʹs MPJPE on the MP-INF-3DHP dataset.
Keywords:
点击此处可从《图学学报》浏览原始摘要信息
点击此处可从《图学学报》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号