Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates

Luvizon, Diogo C.; Picard, David; Tabia, Hedi

doi:10.1007/s11263-021-01570-9

Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates

Published: 02 February 2022

Volume 130, pages 869–882, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

859 Accesses
9 Citations
2 Altmetric
Explore all metrics

Abstract

3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated data and 3D poses and a straightforward multi-view generalization. To that end, we cast the problem as a view frustum space pose estimation, where absolute depth prediction and joint relative depth estimations are disentangled. Final 3D predictions are obtained in camera coordinates by the inverse camera projection. Based on this, we also present a consensus-based optimization algorithm for multi-view predictions from uncalibrated images, which requires a single monocular training procedure. Although our method is indirectly tied to the training camera intrinsics, it still converges for cameras with different intrinsic parameters, resulting in coherent estimations up to a scale factor. Our method improves the state of the art on well known 3D human pose datasets, reducing the prediction error by 32% in the most common benchmark. We also reported our results in absolute pose position error, achieving 80 mm for monocular estimations and 51 mm for multi-view, on average. Source code is available at https://github.com/dluvizon/3d-pose-consensus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CosyPose: Consistent Multi-view Multi-object 6D Pose Estimation

AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

Article 16 November 2020

Simultaneous Multi-view Relative Pose Estimation and 3D Reconstruction from Planar Regions

References

Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 28(1), 44–58.
Article Google Scholar
Amin, S., Andriluka, M., Rohrbach, M., & Schiele, B. (2013). Multi-view pictorial structures for 3d human pose estimation. In The British Machine Vision Conference (BMVC).
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: new benchmark and state of the art analysis. In The IEEE conference on computer vision and pattern recognition (CVPR).
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., & Ilic, S. (2014). 3D pictorial structures for multiple human pose estimation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1669–1676).
Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., & Ilic, S. (2016). 3d pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(10), 1929–1942.
Article Google Scholar
Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3D pictorial structures for multiple view articulated pose estimation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3618–3625).
Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3d pictorial structures for multiple view articulated pose estimation. In 2013 IEEE conference on computer vision and pattern recognition (pp. 3618–3625). https://doi.org/10.1109/CVPR.2013.464
Chen, C. H., & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In The IEEE conference on computer vision and pattern recognition (CVPR)
Chen, W., Fu, Z., Yang, D., & Deng, J. (2016). Single-image depth perception in the wild. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 730–738).
Dijk, T. V., & Croon, G. D. (2019). How do neural networks see depth in single images? In The IEEE international conference on computer vision (ICCV).
Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27, pp. 2366–2374).
Gunel, S., Rhodin, H., & Fua, P. (2019). What face and body shapes can tell us about height. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
Hofmann, M., & Gavrila, D. M. (2012). Multi-view 3D human pose estimation in complex environment. International Journal of Computer Vision, 96(1), 103–124.
Article MathSciNet Google Scholar
Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In International conference on computer vision (ICCV) (pp. 2220–2227).
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7), 1325–1339.
Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7718–7727).
Kadkhodamohammadi, A., & Padoy, N. (2021). A generalizable approach for multi-view 3d human pose regression. Machine Vision and Applications, 32(1), 1–14.
Article Google Scholar
Kazemi, V., Burenius, M., Azizpour, H., & Sullivan, J. (2013). Multi-view body part recognition with random forests. In 2013 24th British Machine Vision Conference, BMVC 2013, Bristol, United Kingdom; 9 September 2013 through 13 September 2013. British Machine Vision Association.
Kocabas, M., Karagoz, S., & Akbas, E. (2019). Emre: Self-supervised learning of 3d human pose using multi-view geometry. In The IEEE conference on computer vision and pattern recognition (CVPR).
Kolotouros, N., Pavlakos, G., Black, M.J., & Daniilidis, K. (2019). Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In International conference on 3D vision (3DV) (pp. 239–248). IEEE.
Lee, K., Lee, I., & Lee, S. (2018). Propagating LSTM: 3D pose estimation based on joint interdependency. In The European Conference on Computer Vision (ECCV).
Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In The IEEE international conference on computer vision (ICCV).
Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In The IEEE conference on computer vision and pattern recognition (CVPR).
Luvizon, D. C., Tabia, H., & Picard, D. (2019). Human pose regression by combining indirect part detection and contextual information. Computers& Graphics, 85, 15–22.
Article Google Scholar
Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. In ICCV.
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In International conference on 3D vision (3DV).
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Fua, P., Seidel, H. P., Rhodin, H., Pons-Moll, G., & Theobalt, C. (2020). XNect: Real-time multi-person 3D motion capture with a single RGB camera. https://doi.org/10.1145/3386569.3392410; http://gvv.mpi-inf.mpg.de/projects/XNect/
Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.P., Xu, W., Casas, D., & Theobalt, C. (2017). Vnect: Real-time 3d human pose estimation with a single rgb camera. https://doi.org/10.1145/3072959.3073596; http://gvv.mpi-inf.mpg.de/projects/VNect/
Micusik, B., & Pajdla, T. (2010). Simultaneous surveillance camera calibration and foot-head homology estimation from human detections. In 2010 IEEE computer society conference on computer vision and pattern recognition (CVPR) (pp. 1562–1569).
Moon, G., Chang, J. Y., & Lee, K. M. (2019). Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In The IEEE International Conference on Computer Vision (ICCV).
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In The European conference on computer vision (ECCV) (pp. 483–499).
Nie, B. X., Wei, P., & Zhu, S. C. (2017). Monocular 3D human pose estimation by predicting depth on joints. In The IEEE International Conference on Computer Vision (ICCV) (pp. 3467–3475). IEEE.
Núñez, J. C., Cabido, R., Vélez, J. F., Montemayor, A. S., & Pantrigo, J. J. (2019). Multiview 3D human pose estimation using improved least-squares and LSTM networks. Neurocomputing, 323, 335–343.
Article Google Scholar
Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In The IEEE conference on computer vision and pattern recognition (CVPR).
Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Harvesting multiple views for marker-less 3D human pose annotations. In The IEEE conference on computer vision and pattern recognition (CVPR)
Popa, A., Zanfir, M., & Sminchisescu, C. (2017). Deep multitask architecture for integrated 2D and 3D human sensing. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4714–4723).
Hossain, M. R. I., & Little, J. J. (2018). Exploiting temporal information for 3D human pose estimation. In The European conference on computer vision (ECCV).
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In The IEEE conference on computer vision and pattern recognition (CVPR).
Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3d human pose estimation from multi-view images. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8437–8446).
Rogez, G., Weinzaepfel, P., & Schmid, C. (2017). LCR-Net: localization-classification-regression for human pose. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In ECCV.
Sarafianos, N., Boteanu, B., Ionescu, B., & Kakadiaris, I. A. (2016). 3D human pose estimation: A review of the literature and analysis of covariates. Computer Vision and Image Understanding, 152, 1–20.
Article Google Scholar
Shi, Y., Han, X., Jiang, N., Zhou, K., Jia, K., & Lu, J. (2018). Fbi-pose: Towards bridging the gap between 2d images and 3d human poses using forward-or-backward information. CoRRarXiv:1806.09241
Sun, X., Shang, J., Liang, S., & Wei, Y. (2017). Compositional human pose regression. In The IEEE international conference on computer vision (ICCV).
Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In The European conference on computer vision (ECCV).
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In The British Machine Vision Conference (BMVC).
Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. In The IEEE conference on computer vision and pattern recognition (CVPR).
Tome, D., Toso, M., Agapito, L., & Russell, C. (2018). Rethinking pose in 3d: Multi-stage refinement and recovery for markerless motion capture. In 2018 international conference on 3D vision (3DV) (pp. 474–483). IEEE.
Trumble, M., Gilbert, A., Hilton, A., & Collomosse, J. (2018). Deep autoencoder for combined human pose estimation and body model upscaling. In The European Conference on Computer Vision (ECCV) (pp. 784–800).
Trumble, M., Gilbert, A., Malleson, C., Hilton, A., & Collomosse, J. (2017). Total capture: 3D human pose estimation fusing video and inertial sensors. In The British machine vision conference (BMVC).
Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., Daniele, A. F., Mostajabi, M., Basart, S., Walter, M. R., & Shakhnarovich, G. (2019). DIODE: A dense indoor and outdoor DEpth dataset. CoRRarXiv:1908.00463.
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3D human pose estimation in the wild by adversarial learning. In The IEEE conference on computer vision and pattern recognition (CVPR).
Yi, K. M., Trulls, E., Lepetit, V., & Fua, P. (2016). LIFT: Learned invariant feature transform. In The European conference on computer vision (ECCV).
Zanfir, A., Marinoiu, E., & Sminchisescu, C. (2018). Monocular 3D pose and shape estimation of multiple people in natural scenes—The importance of multiple scene constraints. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A. I., & Sminchisescu, C. (2018). Deep network for the integrated 3D sensing of multiple people in natural images. In: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31, pp. 8410–8419).
Zhang, W., Zhu, M., & Derpanis, K.G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In 2013 IEEE international conference on computer vision (pp. 2248–2255). https://doi.org/10.1109/ICCV.2013.280
Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017). Towards 3D human pose estimation in the wild: A weakly-supervised approach. In The IEEE international conference on computer vision (ICCV).
Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016). Deep kinematic pose regression. In Computer vision ECCV 2016 workshops.
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016). Sparseness Meets deepness: 3D human pose estimation from monocular video. In The IEEE conference on computer vision and pattern recognition (CVPR).
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67, 301–320.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

ETIS UMR 8051, CY Cergy Paris Université, ENSEA, CNRS, 95000, Cergy, France
Diogo C. Luvizon
LIGM, IMAGINE, Ecole des Ponts, Univ Gustave Eiffel, CNRS, 77455, Marne-la-Vallée, France
David Picard
IBISC, Univ Evry, Université Paris-Saclay, 91025, Evry, France
Hedi Tabia

Authors

Diogo C. Luvizon
View author publications
You can also search for this author in PubMed Google Scholar
David Picard
View author publications
You can also search for this author in PubMed Google Scholar
Hedi Tabia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diogo C. Luvizon.

Additional information

Communicated by Yasushi Yagi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially funded by CNPq (Brazil)—Grant 233342/2014-1.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luvizon, D.C., Picard, D. & Tabia, H. Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates. Int J Comput Vis 130, 869–882 (2022). https://doi.org/10.1007/s11263-021-01570-9

Download citation

Received: 20 April 2020
Accepted: 13 December 2021
Published: 02 February 2022
Issue Date: March 2022
DOI: https://doi.org/10.1007/s11263-021-01570-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates

Abstract

Access this article

Similar content being viewed by others

CosyPose: Consistent Multi-view Multi-object 6D Pose Estimation

AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

Simultaneous Multi-view Relative Pose Estimation and 3D Reconstruction from Planar Regions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates

Abstract

Access this article

Similar content being viewed by others

CosyPose: Consistent Multi-view Multi-object 6D Pose Estimation

AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild

Simultaneous Multi-view Relative Pose Estimation and 3D Reconstruction from Planar Regions

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation