Video-based body geometric aware network for 3D human pose estimation

Li, Chaonan; Liu, Sheng; Yao, Lu; Zou, Siyu

doi:10.1007/s11801-022-2015-8

Video-based body geometric aware network for 3D human pose estimation

Published: 07 June 2022

Volume 18, pages 313–320, (2022)
Cite this article

Optoelectronics Letters Aims and scope Submit manuscript

Chaonan Li¹,
Sheng Liu¹,
Lu Yao¹ &
…
Siyu Zou¹

110 Accesses
1 Citation
Explore all metrics

Abstract

Three-dimensional human pose estimation (3D HPE) has broad application prospects in the fields of trajectory prediction, posture tracking and action analysis. However, the frequent self-occlusions and the substantial depth ambiguity in two-dimensional (2D) representations hinder the further improvement of accuracy. In this paper, we propose a novel video-based human body geometric aware network to mitigate the above problems. Our network can implicitly be aware of the geometric constraints of the human body by capturing spatial and temporal context information from 2D skeleton data. Specifically, a novel skeleton attention (SA) mechanism is proposed to model geometric context dependencies among different body joints, thereby improving the spatial feature representation ability of the network. To enhance the temporal consistency, a novel multilayer perceptron (MLP)-Mixer based structure is exploited to comprehensively learn temporal context information from input sequences. We conduct experiments on publicly available challenging datasets to evaluate the proposed approach. The results outperform the previous best approach by 0.5 mm in the Human3.6m dataset. It also demonstrates significant improvements in HumanEva-I dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Pose Estimation Based on Feature Fusion and Graph Encoding Optimization

Bidirectional Optimization Coupled Lightweight Networks for Efficient and Robust Multi-Person 2D Pose Estimation

Article 10 May 2019

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

Article 03 July 2023

References

MEHTA D, RHODIN H, CASAS D, et al. Monocular 3D human pose estimation in the wild using improved CNN supervision[C]//2017 International Conference on 3D Vision (3DV), October 10–12, 2017, Qingdao, China. New York: IEEE, 2017: 506–516.
Google Scholar
HOSSAIN M RI, LITTLE J J. Exploiting temporal information for 3D human pose estimation[C]//Proceedings of the European Conference on Computer Vision, September 8–14, 2018, Munich, Germany. Berlin: Springer, 2018: 68–84.
Google Scholar
LIN J, LEE G H. Trajectory space factorization for deep video-based 3D human pose estimation[C]//2019 British Machine Vision Conference (BMVC), September 9–12, 2019, Cardiff, UK. BMVA, 2019.
LUVIZON D C, PICARD D, TABIA H. 2D/3D pose estimation and action recognition using multitask deep learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 18–22, 2018, Salt Lake, UT, USA. New York: IEEE, 2018: 5137–5146.
Google Scholar
MARTINEZ J, HOSSAIN R, ROMERO J, et al. A simple yet effective baseline for 3D human pose estimation[C]//Proceedings of the IEEE International Conference on Computer Vision, October 22–29, 2017, Venice, Italy. New York: IEEE, 2017: 2640–2649.
Google Scholar
PARK S, HWANG J, KWAK N. 3D human pose estimation using convolutional neural networks with 2D pose information[C]//Proceedings of the European Conference on Computer Vision, October 11–14, 2016, Amsterdam, The Netherlands. Berlin: Springer, 2016: 156–169.
Google Scholar
PAVLLO D, FEICHTENHOFER C, GRANGIER D, et al. 3D human pose estimation in video with temporal convolutions and semi-supervised training[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16–20, 2019, Long Beach, CA, USA. New York: IEEE, 2019: 7753–7762.
Google Scholar
CHEN X, LIN K Y, LIU W, et al. Weakly-supervised discovery of geometry-aware representation for 3D human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16–20, 2019, Long Beach, CA, USA. New York: IEEE, 2019: 7753–7762.
Google Scholar
FANG H S, XU Y, WANG W, et al. Learning pose grammar to encode human body configuration for 3D pose estimation[C]//Proceedings of the AAAI Conference on Artificial Intelligence, February 2–7, 2018, New Orleans, Louisiana, USA. Cambridge: AAAI Press, 2018: 6821–6828.
Google Scholar
PAVLAKOS G, ZHOU X, DERPANIS K G, et al. Coarse-to-fine volumetric prediction for single-image 3D human pose[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 7025–7034.
Google Scholar
XU J, YU Z, NI B, et al. Deep kinematics analysis for monocular 3D human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13–19, 2020, Seattle, WA, USA. New York: IEEE, 2020: 899–908.
Google Scholar
CAI Y, GE L, LIU J, et al. Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27–November 2, 2019, Seoul, Korea (South). New York: IEEE, 2019: 2272–2281.
Google Scholar
ZHAO L, PENG X, TIAN Y, et al. Semantic graph convolutional networks for 3D human pose regression[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 16–20, 2019, Long Beach, CA, USA. New York: IEEE, 2019: 3425–3435.
Google Scholar
LIU K, DING R, ZOU Z, et al. A comprehensive study of weight sharing in graph networks for 3D human pose estimation[C]//Proceedings of the European Conference on Computer Vision, August 23–28, 2020, Glasgow, UK. Berlin: Springer, 2020: 318–334.
Google Scholar
CI H, WANG C, MA X, et al. Optimizing network structure for 3D human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27–November 2, 2019, Seoul, Korea (South). New York: IEEE, 2019: 2262–2271.
Google Scholar
WANG J, YAN S, XIONG Y, et al. Motion guided 3D pose estimation from videos[C]//Proceedings of the European Conference on Computer Vision, August 23–28, 2020, Glasgow, UK. Berlin: Springer, 2020: 764–780.
Google Scholar
LIU R, SHEN J, WANG H, et al. Attention mechanism exploits temporal contexts: real-time 3D human pose reconstruction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 13–19, 2020, Seattle, WA, USA. New York: IEEE, 2020: 5064–5073.
Google Scholar
TOLSTIKHIN I, HOULSBY N, KOLESNIKOV A, et al. MLP-mixer: an all-MLP architecture for vision[C]//Thirty-Fifth Conference on Neural Information Processing Systems (NeurlPS), December 6–12, 2021, Virtual Event. New York: Curran Associates, 2021: 24261–24272.
Google Scholar
CHEN C H, RAMANAN D. 3D human pose estimation= 2D pose estimation + matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 21–26, 2017, Honolulu, HI, USA. New York: IEEE, 2017: 7035–7043.
Google Scholar
ZHENG C, ZHU S, MENDIETA M, et al. 3D human pose estimation with spatial and temporal transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 10–17, 2021, Montreal, QC, Canada. New York: IEEE, 2021: 11656–11665.
Google Scholar
DABRAL R, MUNDHADA A, KUSUPATI U, et al. Learning 3D human pose from structure and motion[C]//Proceedings of the European Conference on Computer Vision, September 8–14, 2018, Munich, Germany. Berlin: Springer, 2018: 668–683.
Google Scholar
CHENG Y, YANG B, WANG B, et al. Occlusion-aware networks for 3D human pose estimation in video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision, October 27–November 2, 2019, Seoul, Korea (South). New York: IEEE, 2019: 723–732.
Google Scholar
LIU J, ROJAS J, LI Y, et al. A graph attention spatio-temporal convolutional network for 3D human pose estimation in video[C]//2021 IEEE International Conference on Robotics and Automation (ICRA), May 30–June 5, 2021, Xi’an, China. New York: IEEE, 2021: 3374–3380.
Google Scholar
HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735–1780.
Article Google Scholar
DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[C]//9th International Conference on Learning Representations (ICLR), May 3–7, 2021, Virtual Event, Austria. 2021.
HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[EB/OL]. (2016-06-27) [2021-12-26]. https://arxiv.org/abs/1606.08415v1.
IONESCU C, PAPAVA D, OLARU V, et al. Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 36(7): 1325–1339.
Article Google Scholar
CHEN Y, WANG Z, PENG Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 18–22, 2018, Salt Lake, UT, USA. New York: IEEE, 2018: 7103–7112.
Google Scholar
SIGAL L, BALAN A O, BLACK M J. Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion[J]. International journal of computer vision, 2010, 87(1–2): 4.
Article Google Scholar
KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. (2014-12-22) [2021-12-26]. https://arxiv.org/abs/1412.6980v1.
LOSHCHILOV I, HUTTER F. SGDR: stochastic gradient descent with warm restarts[EB/OL]. (2016-08-13) [2021-12-26]. https://arxiv.org/abs/1608.03983v1.
LEE K, LEE I, LEE S. Propagating LSTM: 3D pose estimation based on joint interdependency[C]//Proceedings of the European Conference on Computer Vision, September 8–14, 2018, Munich, Germany. Berlin: Springer, 2018: 119–135.
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, 310023, China
Chaonan Li, Sheng Liu, Lu Yao & Siyu Zou

Authors

Chaonan Li
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lu Yao
View author publications
You can also search for this author in PubMed Google Scholar
Siyu Zou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheng Liu.

Additional information

This work has been supported by the National Key R&D Program of China (No.2018YFB1305200).

Statements and Declarations

The authors declare that there are no conflicts of interest related to this article.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, C., Liu, S., Yao, L. et al. Video-based body geometric aware network for 3D human pose estimation. Optoelectron. Lett. 18, 313–320 (2022). https://doi.org/10.1007/s11801-022-2015-8

Download citation

Received: 03 February 2022
Revised: 10 March 2022
Published: 07 June 2022
Issue Date: May 2022
DOI: https://doi.org/10.1007/s11801-022-2015-8

Document code

A

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video-based body geometric aware network for 3D human pose estimation

Abstract

Access this article

Similar content being viewed by others

Human Pose Estimation Based on Feature Fusion and Graph Encoding Optimization

Bidirectional Optimization Coupled Lightweight Networks for Efficient and Robust Multi-Person 2D Pose Estimation

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Statements and Declarations

Rights and permissions

About this article

Cite this article

Document code

Navigation

Video-based body geometric aware network for 3D human pose estimation

Abstract

Access this article

Similar content being viewed by others

Human Pose Estimation Based on Feature Fusion and Graph Encoding Optimization

Bidirectional Optimization Coupled Lightweight Networks for Efficient and Robust Multi-Person 2D Pose Estimation

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Statements and Declarations

Rights and permissions

About this article

Cite this article

Share this article

Document code

Search

Navigation