Abstract
3D human pose estimation has many important applications in human-computer interaction and human action recognition. Simultaneously achieving real-time speed, varying human number, and high accuracy from a single RGB image is a challenging problem. To this end, this paper proposes a multi-task and multi-level neural network structure with physical constraint. The unique network structure estimates 3D human poses from single RGB image in an end-to-end way and achieves both high accuracy and high speed. Experimental results shows that the proposed system achieves 21 fps on RTX 2080 GPU with only 33 mm accuracy loss compared with conventional works. The mechanism of the network is also analyzed through network visualization. This work shows the possibility of estimating 3D human pose from a single RGB monocular camera with real-time speed.
Similar content being viewed by others
References
Abdulla W (2017) Mask r-cnn for object detection and instance segmentation on keras and tensorflow
Blumenthal-Barby DC, Eisert P (2014) High-resolution depth for binocular image-based modeling. Comput Graph 39:89–100
Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2018) OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv:1812.08008
Cao S, Lu W, Xu Q (2016) Deep neural networks for learning graph representations. In: AAAI conference on artificial intelligence (AAAI)
Chen X, Yuille AL (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in Neural Information Processing Systems, vol 27, Curran Associates Inc., pp 1736–1744
Chen C-H, Tyagi A, Agrawal A, Drover D, MV R, Stojanov S, Rehg JM (2019) Unsupervised 3d pose estimation with geometric self-supervision. arXiv:1904.04812
Cheng B, Xiao B, Wang J, Shi H, S Huang T, Zhang L (2020) Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In: International conference on computer vision and pattern recognition (CVPR), pp 5386–5395
Cheng Y, Yang B, Wang B, Yan W, Tan RT (2019) Occlusion-aware networks for 3d human pose estimation in video. In: International conference on computer vision and pattern recognition (CVPR), pp 723–732
Drennan M (2010) An implementation of camera calibration algorithms. Clemson University
Fang H-S, Xie S, Tai Y-W, Lu C (2017) RMPE: Regional multi-person pose estimation. In: International conference on computer vision (ICCV)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: International conference on computer vision and pattern recognition (CVPR), pp 770–778
Kocabas M, Karagoz S, Akbas E (2019) Self-supervised learning of 3d human pose using multi-view geometry. In: International conference on computer vision and pattern recognition (CVPR)
Li Z, Wang X, Wang F, Jiang P (2019) On boosting single-frame 3d human pose estimation via monocular videos. In: International conference on computer vision and pattern recognition (CVPR), pp 2192–2201
Lin T-Y, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: International conference on computer vision and pattern recognition (CVPR)
Loper M, Mahmood N, Romero J, Pons-Moll G, Black MJ (2015) SMPL: A skinned multi-person linear model. ACM Trans Graphics 34 (6):248:1–248:16
Luo D, Du S, Ikenaga T (2019) End-to-end feature pyramid network for real-timemulti-person pose estimation. In: International conference on machine vision applications (MVA)
Luo D, Du S, Ikenaga T (2019) Multi-task and multi-level detection neural network based real-time 3d pose estimation. In: Asia-pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 1427–1434
Martinez J, Hossain R, Romero J, Little JJ (2017) A simple yet effective baseline for 3d human pose estimation. In: International conference on computer vision (ICCV)
Mehta D, Sotnychenko O, Mueller F, Xu W, Sridhar S, Pons-Moll G, Theobalt C (2018) Single-shot multi-person 3d pose estimation from monocular rgb. In: 2018 international conference on 3D vision (3DV). IEEE, pp 120–130
Nie X, Feng J, Xing J, Yan S (2018) Pose partition networks for multi-person pose estimation. In: Europeon conference on computer vision (ECCV), pp 684–699
Omran M, Lassner C, Pons-Moll G, Gehler P, Schiele B (2018) Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In: International conference on 3D vision (3DV), pp 484–494
Pavllo D, Feichtenhofer C, Grangier D, Auli M (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: International conference on computer vision and pattern recognition (CVPR)
Redmon J, Farhadi A (2016) Yolo9000: Better, faster, stronger. In: International conference on computer vision and pattern recognition (CVPR)
Rogez G, Weinzaepfel P, Schmid C (2017) Lcr-net: Localization-classification-regression for human pose. In: International conference on computer vision and pattern recognition (CVPR), pp 3433–3441
Sharifi A, Harati A, Vahedian A (2014) Marker based human pose estimation using annealed particle swarm optimization with search space partitioning. In: International conference on computer and knowledge engineering (ICCKE), pp 135–140
Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via deep neural networks. In: International conference on computer vision and pattern recognition (CVPR)
Vatahska T, Bennewitz M, Behnke S (2007) Feature-based head pose estimation from images. In: IEEE-RAS international conference on humanoid robots, pp 330–335
Xiu Y, Li J, Wang H, Fang Y, Lu C (2018) Pose Flow: Efficient online pose tracking. In: British machine vision conference (BMVC)
Xu J, Yu Z, Ni B, Yang J, Yang X, Zhang W (2020) Deep kinematics analysis for monocular 3d human pose estimation. In: International conference on computer vision and pattern recognition (CVPR), pp 899–908
Zhang Z (2012) Microsoft kinect sensor and its effect. IEEE MultiMed 19(2):4–10
Zhang Z, Wang C, Qin W, Zeng W (2020) Fusing wearable imus with multi-view images for human pose estimation: A geometric approach. In: International conference on computer vision and pattern recognition (CVPR), pp 2200–2209
Zhou X, Zhu M, Leonardos S, Derpanis KG, Daniilidis K (2016) Sparseness meets deepness: 3d human pose estimation from monocular video. In: International conference on computer vision and pattern recognition (CVPR), pp 4966–4975
Zhu D-X (2010) Binocular vision-slam using improved sift algorithm. In: International workshop on intelligent systems and applications, pp 1–4
Acknowledgements
This work was jointly supported by the Waseda University Grant for Special Research Projects under grants 2020C-657 and 2020R-040, the National Natural Science Foundation of China under grant 62001110, the Natural Science Foundation of Jiangsu Province under grant BK20200353, the Guangdong Basic and Applied Basic Research Foundation under grant 2020A1515110145, the Shenzhen Science and Technology Program under grant RCBS20200714114858072, the 111 Project under grant B17040, and the Fundamental Research Funds for the Central Universities under grant 2242021R10115.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Luo, D., Du, S. & Ikenaga, T. Multi-task neural network with physical constraint for real-time multi-person 3D pose estimation from monocular camera. Multimed Tools Appl 80, 27223–27244 (2021). https://doi.org/10.1007/s11042-021-10982-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-10982-1