Multi-task neural network with physical constraint for real-time multi-person 3D pose estimation from monocular camera

Luo, Dingli; Du, Songlin; Ikenaga, Takeshi

doi:10.1007/s11042-021-10982-1

Multi-task neural network with physical constraint for real-time multi-person 3D pose estimation from monocular camera

Published: 15 May 2021

Volume 80, pages 27223–27244, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Dingli Luo¹,
Songlin Du^2,3,4,5 &
Takeshi Ikenaga¹

549 Accesses
5 Citations
Explore all metrics

Abstract

3D human pose estimation has many important applications in human-computer interaction and human action recognition. Simultaneously achieving real-time speed, varying human number, and high accuracy from a single RGB image is a challenging problem. To this end, this paper proposes a multi-task and multi-level neural network structure with physical constraint. The unique network structure estimates 3D human poses from single RGB image in an end-to-end way and achieves both high accuracy and high speed. Experimental results shows that the proposed system achieves 21 fps on RTX 2080 GPU with only 33 mm accuracy loss compared with conventional works. The mechanism of the network is also analyzed through network visualization. This work shows the possibility of estimating 3D human pose from a single RGB monocular camera with real-time speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Computer vision-based hand gesture recognition for human-robot interaction: a review

Article Open access 19 July 2023

Deep Learning vs. Traditional Computer Vision

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

Article 03 June 2022

References

Abdulla W (2017) Mask r-cnn for object detection and instance segmentation on keras and tensorflow
Blumenthal-Barby DC, Eisert P (2014) High-resolution depth for binocular image-based modeling. Comput Graph 39:89–100
Article Google Scholar
Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2018) OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv:1812.08008
Cao S, Lu W, Xu Q (2016) Deep neural networks for learning graph representations. In: AAAI conference on artificial intelligence (AAAI)
Chen X, Yuille AL (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in Neural Information Processing Systems, vol 27, Curran Associates Inc., pp 1736–1744
Chen C-H, Tyagi A, Agrawal A, Drover D, MV R, Stojanov S, Rehg JM (2019) Unsupervised 3d pose estimation with geometric self-supervision. arXiv:1904.04812
Cheng B, Xiao B, Wang J, Shi H, S Huang T, Zhang L (2020) Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In: International conference on computer vision and pattern recognition (CVPR), pp 5386–5395
Cheng Y, Yang B, Wang B, Yan W, Tan RT (2019) Occlusion-aware networks for 3d human pose estimation in video. In: International conference on computer vision and pattern recognition (CVPR), pp 723–732
Drennan M (2010) An implementation of camera calibration algorithms. Clemson University
Fang H-S, Xie S, Tai Y-W, Lu C (2017) RMPE: Regional multi-person pose estimation. In: International conference on computer vision (ICCV)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: International conference on computer vision and pattern recognition (CVPR), pp 770–778
Kocabas M, Karagoz S, Akbas E (2019) Self-supervised learning of 3d human pose using multi-view geometry. In: International conference on computer vision and pattern recognition (CVPR)
Li Z, Wang X, Wang F, Jiang P (2019) On boosting single-frame 3d human pose estimation via monocular videos. In: International conference on computer vision and pattern recognition (CVPR), pp 2192–2201
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: International conference on computer vision and pattern recognition (CVPR)
Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ (2015) SMPL: A skinned multi-person linear model. ACM Trans Graphics 34 (6):248:1–248:16
Article Google Scholar
Luo D, Du S, Ikenaga T (2019) End-to-end feature pyramid network for real-timemulti-person pose estimation. In: International conference on machine vision applications (MVA)
Luo D, Du S, Ikenaga T (2019) Multi-task and multi-level detection neural network based real-time 3d pose estimation. In: Asia-pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 1427–1434
Martinez J, Hossain R, Romero J, Little JJ (2017) A simple yet effective baseline for 3d human pose estimation. In: International conference on computer vision (ICCV)
Mehta D, Sotnychenko O, Mueller F, Xu W, Sridhar S, Pons-Moll G, Theobalt C (2018) Single-shot multi-person 3d pose estimation from monocular rgb. In: 2018 international conference on 3D vision (3DV). IEEE, pp 120–130
Nie X, Feng J, Xing J, Yan S (2018) Pose partition networks for multi-person pose estimation. In: Europeon conference on computer vision (ECCV), pp 684–699
Omran M, Lassner C, Pons-Moll G, Gehler P, Schiele B (2018) Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In: International conference on 3D vision (3DV), pp 484–494
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: International conference on computer vision and pattern recognition (CVPR)
Redmon J, Farhadi A (2016) Yolo9000: Better, faster, stronger. In: International conference on computer vision and pattern recognition (CVPR)
Rogez G, Weinzaepfel P, Schmid C (2017) Lcr-net: Localization-classification-regression for human pose. In: International conference on computer vision and pattern recognition (CVPR), pp 3433–3441
Sharifi A, Harati A, Vahedian A (2014) Marker based human pose estimation using annealed particle swarm optimization with search space partitioning. In: International conference on computer and knowledge engineering (ICCKE), pp 135–140
Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via deep neural networks. In: International conference on computer vision and pattern recognition (CVPR)
Vatahska T, Bennewitz M, Behnke S (2007) Feature-based head pose estimation from images. In: IEEE-RAS international conference on humanoid robots, pp 330–335
Xiu Y, Li J, Wang H, Fang Y, Lu C (2018) Pose Flow: Efficient online pose tracking. In: British machine vision conference (BMVC)
Xu J, Yu Z, Ni B, Yang J, Yang X, Zhang W (2020) Deep kinematics analysis for monocular 3d human pose estimation. In: International conference on computer vision and pattern recognition (CVPR), pp 899–908
Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE MultiMed 19(2):4–10
Article Google Scholar
Zhang Z, Wang C, Qin W, Zeng W (2020) Fusing wearable imus with multi-view images for human pose estimation: A geometric approach. In: International conference on computer vision and pattern recognition (CVPR), pp 2200–2209
Zhou X, Zhu M, Leonardos S, Derpanis KG, Daniilidis K (2016) Sparseness meets deepness: 3d human pose estimation from monocular video. In: International conference on computer vision and pattern recognition (CVPR), pp 4966–4975
Zhu D-X (2010) Binocular vision-slam using improved sift algorithm. In: International workshop on intelligent systems and applications, pp 1–4

Download references

Acknowledgements

This work was jointly supported by the Waseda University Grant for Special Research Projects under grants 2020C-657 and 2020R-040, the National Natural Science Foundation of China under grant 62001110, the Natural Science Foundation of Jiangsu Province under grant BK20200353, the Guangdong Basic and Applied Basic Research Foundation under grant 2020A1515110145, the Shenzhen Science and Technology Program under grant RCBS20200714114858072, the 111 Project under grant B17040, and the Fundamental Research Funds for the Central Universities under grant 2242021R10115.

Author information

Authors and Affiliations

Graduate School of Information, Production and Systems, Waseda University, Kitakyushu, 808-0135, Japan
Dingli Luo & Takeshi Ikenaga
School of Automation, Southeast Universit, Nanjing, 210096, China
Songlin Du
Shenzhen Institute of Southeast University, Shenzhen, 518063, China
Songlin Du
Hubei Key Laboratory of Advanced Control and Intelligent Automation for Complex Systems, Wuhan, 430074, China
Songlin Du
Engineering Research Center of Intelligent Geodetection Technology, Ministry of Education, Wuhan, 430074, China
Songlin Du

Authors

Dingli Luo
View author publications
You can also search for this author in PubMed Google Scholar
Songlin Du
View author publications
You can also search for this author in PubMed Google Scholar
Takeshi Ikenaga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Songlin Du.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, D., Du, S. & Ikenaga, T. Multi-task neural network with physical constraint for real-time multi-person 3D pose estimation from monocular camera. Multimed Tools Appl 80, 27223–27244 (2021). https://doi.org/10.1007/s11042-021-10982-1

Download citation

Received: 30 July 2020
Revised: 16 April 2021
Accepted: 30 April 2021
Published: 15 May 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s11042-021-10982-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-task neural network with physical constraint for real-time multi-person 3D pose estimation from monocular camera

Abstract

Access this article

Similar content being viewed by others

Computer vision-based hand gesture recognition for human-robot interaction: a review

Deep Learning vs. Traditional Computer Vision

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-task neural network with physical constraint for real-time multi-person 3D pose estimation from monocular camera

Abstract

Access this article

Similar content being viewed by others

Computer vision-based hand gesture recognition for human-robot interaction: a review

Deep Learning vs. Traditional Computer Vision

Yoga pose classification: a CNN and MediaPipe inspired deep learning approach for real-world application

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation