Skip to main content
Log in

U-shaped spatial–temporal transformer network for 3D human pose estimation

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

3D human pose estimation has achieved much progress with the development of convolution neural networks. There still have some challenges to accurately estimate 3D joint locations from single-view images or videos due to depth ambiguity and severe occlusion. Motivated by the effectiveness of introducing vision transformer into computer vision tasks, we present a novel U-shaped spatial–temporal transformer-based network (U-STN) for 3D human pose estimation. The core idea of the proposed method is to process the human joints by designing a multi-scale and multi-level U-shaped transformer model. We construct a multi-scale architecture with three different scales based on the human skeletal topology, in which the local and global features are processed through three different scales with kinematic constraints. Furthermore, a multi-level feature representations is introduced by fusing intermediate features from different depths of the U-shaped network. With a skeletal constrained pooling and unpooling operations devised for U-STN, the network can transform features across different scales and extract meaningful semantic features at all levels. Experiments on two challenging benchmark datasets show that the proposed method achieves a good performance on 2D-to-3D pose estimation. The code is available at https://github.com/l-fay/Pose3D.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Zheng, C., Zhu, S., Mendieta, M., et al: 3D human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11656–11665 (2021)

  2. Malik, Z., Shapiai, M.: Human action interpretation using convolutional neural network: a survey. Mach. Vis. Appl. 33(3), 1–23 (2022)

    Article  Google Scholar 

  3. Moon, G., Lee, K.M.: I2l-meshnet: Image to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 752–768 (2020)

  4. Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7307–7316 (2018)

  5. Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3D human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. 32(1), 198–209 (2022). https://doi.org/10.1109/TCSVT.2021.3057267

    Article  Google Scholar 

  6. Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3D pose estimation from videos. In: Proceedings of the European Conference on Computer Vision 2020 (ECCV), pp. 764–780. Springer, (2020)

  7. Wang, R., Tong, J., Wang, X.: Enhancing feature fusion for human pose estimation. Mach. Vis. Appl. 31, 60 (2020). https://doi.org/10.1007/s00138-020-01104-2

    Article  Google Scholar 

  8. Cai, Y., Ge, L., Liu, J., et al.: exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2272–2281 (2019)

  9. Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3D human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 6869–8486, Springer, (2018)

  10. Pavllo, D., Feichtenhofer, C., Grangier, D., et al.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7745–7754 (2019)

  11. Huang, Z., Shen, X., Tian, X., et al.: Spatio-temporal inception graph convolutional networks for skeleton-based action recognition. In: ACM Deep Learning of Multimedia, Seattle, WA, USA, pp. 2122–2130 (2020). https://doi.org/10.1145/3394171.3413666

  12. Li, S., Chan, A.: 3D human pose estimation from monocular images with deep convolutional neural network. In: Asian Conference on Computer Vision, pp. 332–347 (2014)

  13. Park, S., Hwang, J., Kwak, N.: 3D human pose estimation using convolutional neural networks with 2d pose information. In: European Conference on Computer Vision (ECCV), pp. 156–169, Springer, (2016)

  14. Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7025–7034(2017)

  15. Zeng, A., Sun, X., Huang, F., et al.: SRNet: improving generalization in 3D human pose estimation with a split-and-recombine approach. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 507–523 (2020)

  16. Martinez, J., Hossain, R., Romero, J., Little, J.J: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2659–2668 (2017) https://doi.org/10.1109/ICCV.2017.288.

  17. Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16105–16114 (2021)

  18. Liu, J., Guang, Y., Rojas, J.: A graph attention spatio-temporal convolutional network for 3D human pose estimation in video. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 3374–3380 (2021)

  19. Li, W., Liu, H., Tang, H., et al.: MHFormer: multi-hypothesis transformer for 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13147–13156 (2022)

  20. Li, W., Liu, H., Ding, R., et al.: Exploiting temporal contexts with strided transformer for 3D human pose estimation. IEEE Trans. Multimed. (2022). https://doi.org/10.1109/TMM.2022.3141231

    Article  Google Scholar 

  21. Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1954–1963, (2021) https://doi.org/10.1109/CVPR46437.2021.00199

  22. Lin, T., Dollar, P., Girshick, R., He, K., Hariharan, H., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 936–944 (2017) https://doi.org/10.1109/CVPR.2017.106

  23. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Proceedings of the European Conference on Computer Vision 2020 (ECCV), pp. 483–499 (2020)

  24. Sun, K., Xiao, B., Liu, D., et al.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5686–5696 (2019)

  25. Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., Ling, H.: M2det: a single-shot object detector based on multi-level feature pyramid network. In: The Thirty-Third AAAI Conference on Artificial Intellilgence (AAAI), pp. 9259–9266, (2019) https://doi.org/10.1609/aaai.v33i01.33019259

  26. Hua, G., Li, W., Zhang, Q., et al.: Weakly-supervised 3D human pose estimation with cross-view U-shaped graph convolutional network. In: IEEE Transactions on Multimedia, arXiv preprint http://arxiv.org/abs/2105.10882, (2022) https://doi.org/10.48550/arXiv.2105.10882

  27. Dosovitskiy, A., Beyer, L., Kolesnikov., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint http://arxiv.org/abs/2010.11929 (2021) https://doi.org/10.48550/arXiv.2010.11929

  28. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. IEEE Trans. Patt. Anal. Mach. Intell. 42(8), 2011–2023 (2020). https://doi.org/10.1109/TPAMI.2019.2913372

    Article  Google Scholar 

  29. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Patt. Anal. Mach. Intell. 36(7), 1325–1339 (2014)

    Article  Google Scholar 

  30. Sigal, L., Balan, A.O., Black, M.J.: Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comput. Vis. 87(1), 4–27 (2010)

    Article  Google Scholar 

  31. Zheng, C., Wu, W., Yang, T., Zhu, S., Chen, C., Liu, R., Shen, J., Kehtarnavaz, N., Shah, M.: Deep learning-based human pose estimation: a http://arxiv.org/abs/2012.13392v4, https://doi.org/10.48550/arXiv.2012.13392

  32. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR), pp. 1–15 (2015), https://doi.org/10.48550/arXiv.1412.6980.

  33. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: 1Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  34. Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Proceedings of the European conference on computer vision (ECCV), pp. 646–661 (2016)

  35. Fang, H., Xu, Y., Wang, W., Liu, X., Zhu, S.: Learning pose grammar to encode human body configuration for 3D pose estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, pp. 6821–6828 (2018)

  36. Zou, Z., Tang, W.: Modulated graph convolutional network for 3D human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 11477–11487 (2021)

  37. Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N..: Semantic graph convolutional networks for 3D human pose regression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp: 3425–3435 (2019)

  38. Yeh, R.A., Hu, Y., Schwing, A.G.: Chirality nets for human pose regression. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (NIPS), pp. 8163–8173 (2019) https://doi.org/10.48550/arXiv.1911.00029

  39. Lin, J., Lee, G.H.: Trajectory space factorization for deep video-based 3d human pose estimation. In: Proceedings of the British Machine Vision Conference (BMVC), pp. 1–13(2019) https://doi.org/10.48550/arXiv.1908.08289

  40. Gong, K., Zhang, J., Poseaug, J.F.: A differentiable pose augmentation framework for 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8575–8584(2021) https://doi.org/10.48550/arXiv.2105.02465

  41. Xu, J., Yu, Z., Ni, B., Yang, J., Yang, X., Zhang, W.: Deep kinematics analysis for monocular 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 896–905, (2020) https://doi.org/10.1109/CVPR42600.2020.00098

  42. Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S., Asari, V.: Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5063–5072 (2020) https://doi.org/10.1109/CVPR42600.2020.00511.

  43. Lee, K., Lee, I., Lee, S.: Propagating lstm: 3D pose estimation based on joint interdependency. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 123–141 (2018) https://doi.org/10.1007/978-3-030-01234-2_8

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (No.61907028, No.11872036), the Young science and technology stars in Shaanxi Province (2021KJXX-91), the Young Talent fund of University Association for Science and Technology in Shaanxi (No. 20200105) and the Fundamental Research Funds for the Central Universities (No. GK202103114, No. GK2021011004, No. 2022TD-26).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yumei Zhang or Xiaojun Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, H., Guo, L., Zhang, Y. et al. U-shaped spatial–temporal transformer network for 3D human pose estimation. Machine Vision and Applications 33, 82 (2022). https://doi.org/10.1007/s00138-022-01334-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-022-01334-6

Keywords

Navigation