Skip to main content
Log in

Video-based body geometric aware network for 3D human pose estimation

  • Published:
Optoelectronics Letters Aims and scope Submit manuscript

Abstract

Three-dimensional human pose estimation (3D HPE) has broad application prospects in the fields of trajectory prediction, posture tracking and action analysis. However, the frequent self-occlusions and the substantial depth ambiguity in two-dimensional (2D) representations hinder the further improvement of accuracy. In this paper, we propose a novel video-based human body geometric aware network to mitigate the above problems. Our network can implicitly be aware of the geometric constraints of the human body by capturing spatial and temporal context information from 2D skeleton data. Specifically, a novel skeleton attention (SA) mechanism is proposed to model geometric context dependencies among different body joints, thereby improving the spatial feature representation ability of the network. To enhance the temporal consistency, a novel multilayer perceptron (MLP)-Mixer based structure is exploited to comprehensively learn temporal context information from input sequences. We conduct experiments on publicly available challenging datasets to evaluate the proposed approach. The results outperform the previous best approach by 0.5 mm in the Human3.6m dataset. It also demonstrates significant improvements in HumanEva-I dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]//2017 International Conference on 3D Vision (3DV), October 10–12, 2017, Qingdao, China. New York: IEEE, 2017: 506–516.

    Google Scholar 

  2. HOSSAIN M RI, LITTLE J J. Exploiting temporal information for 3D human pose estimation[C]//Proceedings of the European Conference on Computer Vision, September 8–14, 2018, Munich, Germany. Berlin: Springer, 2018: 68–84.

    Google Scholar 

  3. LIN J, LEE G H. Trajectory space factorization for deep video-based 3D human pose estimation[C]//2019 British Machine Vision Conference (BMVC), September 9–12, 2019, Cardiff, UK. BMVA, 2019.

  4. LUVIZON D C, PICARD D, TABIA H. 2D/3D pose estimation and action recognition using multitask deep learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 18–22, 2018, Salt Lake, UT, USA. New York: IEEE, 2018: 5137–5146.

    Google Scholar 

  5. MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision, October 22–29, 2017, Venice, Italy. New York: IEEE, 2017: 2640–2649.

    Google Scholar 

  6. PARK S, HWANG J, KWAK N. 3D human pose estimation using convolutional neural networks with 2D pose information[C]//Proceedings of the European Conference on Computer Vision, October 11–14, 2016, Amsterdam, The Netherlands. Berlin: Springer, 2016: 156–169.

    Google Scholar 

  7. PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16–20, 2019, Long Beach, CA, USA. New York: IEEE, 2019: 7753–7762.

    Google Scholar 

  8. CHEN X, LIN K Y, LIU W, et al. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16–20, 2019, Long Beach, CA, USA. New York: IEEE, 2019: 7753–7762.

    Google Scholar 

  9. FANG H S, XU Y, WANG W, et al. Learning pose grammar to encode human body configuration for 3D pose estimation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, February 2–7, 2018, New Orleans, Louisiana, USA. Cambridge: AAAI Press, 2018: 6821–6828.

    Google Scholar 

  10. PAVLAKOS G, ZHOU X, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3D human pose[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 7025–7034.

    Google Scholar 

  11. XU J, YU Z, NI B, et al. Deep kinematics analysis for monocular 3D human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13–19, 2020, Seattle, WA, USA. New York: IEEE, 2020: 899–908.

    Google Scholar 

  12. CAI Y, GE L, LIU J, et al. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27–November 2, 2019, Seoul, Korea (South). New York: IEEE, 2019: 2272–2281.

    Google Scholar 

  13. ZHAO L, PENG X, TIAN Y, et al. Semantic graph convolutional networks for 3D human pose regression[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16–20, 2019, Long Beach, CA, USA. New York: IEEE, 2019: 3425–3435.

    Google Scholar 

  14. LIU K, DING R, ZOU Z, et al. A comprehensive study of weight sharing in graph networks for 3D human pose estimation[C]//Proceedings of the European Conference on Computer Vision, August 23–28, 2020, Glasgow, UK. Berlin: Springer, 2020: 318–334.

    Google Scholar 

  15. CI H, WANG C, MA X, et al. Optimizing network structure for 3D human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27–November 2, 2019, Seoul, Korea (South). New York: IEEE, 2019: 2262–2271.

    Google Scholar 

  16. WANG J, YAN S, XIONG Y, et al. Motion guided 3D pose estimation from videos[C]//Proceedings of the European Conference on Computer Vision, August 23–28, 2020, Glasgow, UK. Berlin: Springer, 2020: 764–780.

    Google Scholar 

  17. LIU R, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13–19, 2020, Seattle, WA, USA. New York: IEEE, 2020: 5064–5073.

    Google Scholar 

  18. TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[C]//Thirty-Fifth Conference on Neural Information Processing Systems (NeurlPS), December 6–12, 2021, Virtual Event. New York: Curran Associates, 2021: 24261–24272.

    Google Scholar 

  19. CHEN C H, RAMANAN D. 3D human pose estimation= 2D pose estimation + matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 7035–7043.

    Google Scholar 

  20. ZHENG C, ZHU S, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 10–17, 2021, Montreal, QC, Canada. New York: IEEE, 2021: 11656–11665.

    Google Scholar 

  21. DABRAL R, MUNDHADA A, KUSUPATI U, et al. Learning 3D human pose from structure and motion[C]//Proceedings of the European Conference on Computer Vision, September 8–14, 2018, Munich, Germany. Berlin: Springer, 2018: 668–683.

    Google Scholar 

  22. CHENG Y, YANG B, WANG B, et al. Occlusion-aware networks for 3D human pose estimation in video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27–November 2, 2019, Seoul, Korea (South). New York: IEEE, 2019: 723–732.

    Google Scholar 

  23. LIU J, ROJAS J, LI Y, et al. A graph attention spatio-temporal convolutional network for 3D human pose estimation in video[C]//2021 IEEE International Conference on Robotics and Automation (ICRA), May 30–June 5, 2021, Xi’an, China. New York: IEEE, 2021: 3374–3380.

    Google Scholar 

  24. HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735–1780.

    Article  Google Scholar 

  25. DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]//9th International Conference on Learning Representations (ICLR), May 3–7, 2021, Virtual Event, Austria. 2021.

  26. HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[EB/OL]. (2016-06-27) [2021-12-26]. https://arxiv.org/abs/1606.08415v1.

  27. IONESCU C, PAPAVA D, OLARU V, et al. Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 36(7): 1325–1339.

    Article  Google Scholar 

  28. CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 18–22, 2018, Salt Lake, UT, USA. New York: IEEE, 2018: 7103–7112.

    Google Scholar 

  29. SIGAL L, BALAN A O, BLACK M J. Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion[J]. International journal of computer vision, 2010, 87(1–2): 4.

    Article  Google Scholar 

  30. KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. (2014-12-22) [2021-12-26]. https://arxiv.org/abs/1412.6980v1.

  31. LOSHCHILOV I, HUTTER F. SGDR: stochastic gradient descent with warm restarts[EB/OL]. (2016-08-13) [2021-12-26]. https://arxiv.org/abs/1608.03983v1.

  32. LEE K, LEE I, LEE S. Propagating LSTM: 3D pose estimation based on joint interdependency[C]//Proceedings of the European Conference on Computer Vision, September 8–14, 2018, Munich, Germany. Berlin: Springer, 2018: 119–135.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sheng Liu.

Additional information

This work has been supported by the National Key R&D Program of China (No.2018YFB1305200).

Statements and Declarations

The authors declare that there are no conflicts of interest related to this article.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Liu, S., Yao, L. et al. Video-based body geometric aware network for 3D human pose estimation. Optoelectron. Lett. 18, 313–320 (2022). https://doi.org/10.1007/s11801-022-2015-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11801-022-2015-8

Document code

Navigation