Skip to main content

Advertisement

Log in

Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

3D human pose estimation is frequently seen as the task of estimating 3D poses relative to the root body joint. Alternatively, we propose a 3D human pose estimation method in camera coordinates, which allows effective combination of 2D annotated data and 3D poses and a straightforward multi-view generalization. To that end, we cast the problem as a view frustum space pose estimation, where absolute depth prediction and joint relative depth estimations are disentangled. Final 3D predictions are obtained in camera coordinates by the inverse camera projection. Based on this, we also present a consensus-based optimization algorithm for multi-view predictions from uncalibrated images, which requires a single monocular training procedure. Although our method is indirectly tied to the training camera intrinsics, it still converges for cameras with different intrinsic parameters, resulting in coherent estimations up to a scale factor. Our method improves the state of the art on well known 3D human pose datasets, reducing the prediction error by 32% in the most common benchmark. We also reported our results in absolute pose position error, achieving 80 mm for monocular estimations and 51 mm for multi-view, on average. Source code is available at https://github.com/dluvizon/3d-pose-consensus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 28(1), 44–58.

    Article  Google Scholar 

  2. Amin, S., Andriluka, M., Rohrbach, M., & Schiele, B. (2013). Multi-view pictorial structures for 3d human pose estimation. In The British Machine Vision Conference (BMVC).

  3. Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2D human pose estimation: new benchmark and state of the art analysis. In The IEEE conference on computer vision and pattern recognition (CVPR).

  4. Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., & Ilic, S. (2014). 3D pictorial structures for multiple human pose estimation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1669–1676).

  5. Belagiannis, V., Amin, S., Andriluka, M., Schiele, B., Navab, N., & Ilic, S. (2016). 3d pictorial structures revisited: Multiple human pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(10), 1929–1942.

    Article  Google Scholar 

  6. Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3D pictorial structures for multiple view articulated pose estimation. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3618–3625).

  7. Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3d pictorial structures for multiple view articulated pose estimation. In 2013 IEEE conference on computer vision and pattern recognition (pp. 3618–3625). https://doi.org/10.1109/CVPR.2013.464

  8. Chen, C. H., & Ramanan, D. (2017). 3D human pose estimation = 2D pose estimation + matching. In The IEEE conference on computer vision and pattern recognition (CVPR)

  9. Chen, W., Fu, Z., Yang, D., & Deng, J. (2016). Single-image depth perception in the wild. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 29, pp. 730–738).

  10. Dijk, T. V., & Croon, G. D. (2019). How do neural networks see depth in single images? In The IEEE international conference on computer vision (ICCV).

  11. Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, & K. Q. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27, pp. 2366–2374).

  12. Gunel, S., Rhodin, H., & Fua, P. (2019). What face and body shapes can tell us about height. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) workshops.

  13. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).

  14. Hofmann, M., & Gavrila, D. M. (2012). Multi-view 3D human pose estimation in complex environment. International Journal of Computer Vision, 96(1), 103–124.

    Article  MathSciNet  Google Scholar 

  15. Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In International conference on computer vision (ICCV) (pp. 2220–2227).

  16. Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7), 1325–1339.

  17. Iskakov, K., Burkov, E., Lempitsky, V., & Malkov, Y. (2019). Learnable triangulation of human pose. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 7718–7727).

  18. Kadkhodamohammadi, A., & Padoy, N. (2021). A generalizable approach for multi-view 3d human pose regression. Machine Vision and Applications, 32(1), 1–14.

    Article  Google Scholar 

  19. Kazemi, V., Burenius, M., Azizpour, H., & Sullivan, J. (2013). Multi-view body part recognition with random forests. In 2013 24th British Machine Vision Conference, BMVC 2013, Bristol, United Kingdom; 9 September 2013 through 13 September 2013. British Machine Vision Association.

  20. Kocabas, M., Karagoz, S., & Akbas, E. (2019). Emre: Self-supervised learning of 3d human pose using multi-view geometry. In The IEEE conference on computer vision and pattern recognition (CVPR).

  21. Kolotouros, N., Pavlakos, G., Black, M.J., & Daniilidis, K. (2019). Learning to reconstruct 3d human pose and shape via model-fitting in the loop. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

  22. Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, N. (2016). Deeper depth prediction with fully convolutional residual networks. In International conference on 3D vision (3DV) (pp. 239–248). IEEE.

  23. Lee, K., Lee, I., & Lee, S. (2018). Propagating LSTM: 3D pose estimation based on joint interdependency. In The European Conference on Computer Vision (ECCV).

  24. Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In The IEEE international conference on computer vision (ICCV).

  25. Luvizon, D. C., Picard, D., & Tabia, H. (2018). 2D/3D pose estimation and action recognition using multitask deep learning. In The IEEE conference on computer vision and pattern recognition (CVPR).

  26. Luvizon, D. C., Tabia, H., & Picard, D. (2019). Human pose regression by combining indirect part detection and contextual information. Computers& Graphics, 85, 15–22.

    Article  Google Scholar 

  27. Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. In ICCV.

  28. Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., & Theobalt, C. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In International conference on 3D vision (3DV).

  29. Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Elgharib, M., Fua, P., Seidel, H. P., Rhodin, H., Pons-Moll, G., & Theobalt, C. (2020). XNect: Real-time multi-person 3D motion capture with a single RGB camera. https://doi.org/10.1145/3386569.3392410; http://gvv.mpi-inf.mpg.de/projects/XNect/

  30. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.P., Xu, W., Casas, D., & Theobalt, C. (2017). Vnect: Real-time 3d human pose estimation with a single rgb camera. https://doi.org/10.1145/3072959.3073596; http://gvv.mpi-inf.mpg.de/projects/VNect/

  31. Micusik, B., & Pajdla, T. (2010). Simultaneous surveillance camera calibration and foot-head homology estimation from human detections. In 2010 IEEE computer society conference on computer vision and pattern recognition (CVPR) (pp. 1562–1569).

  32. Moon, G., Chang, J. Y., & Lee, K. M. (2019). Camera distance-aware top-down approach for 3D multi-person pose estimation from a single RGB image. In The IEEE International Conference on Computer Vision (ICCV).

  33. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In The European conference on computer vision (ECCV) (pp. 483–499).

  34. Nie, B. X., Wei, P., & Zhu, S. C. (2017). Monocular 3D human pose estimation by predicting depth on joints. In The IEEE International Conference on Computer Vision (ICCV) (pp. 3467–3475). IEEE.

  35. Núñez, J. C., Cabido, R., Vélez, J. F., Montemayor, A. S., & Pantrigo, J. J. (2019). Multiview 3D human pose estimation using improved least-squares and LSTM networks. Neurocomputing, 323, 335–343.

    Article  Google Scholar 

  36. Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In The IEEE conference on computer vision and pattern recognition (CVPR).

  37. Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Harvesting multiple views for marker-less 3D human pose annotations. In The IEEE conference on computer vision and pattern recognition (CVPR)

  38. Popa, A., Zanfir, M., & Sminchisescu, C. (2017). Deep multitask architecture for integrated 2D and 3D human sensing. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 4714–4723).

  39. Hossain, M. R. I., & Little, J. J. (2018). Exploiting temporal information for 3D human pose estimation. In The European conference on computer vision (ECCV).

  40. Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In The IEEE conference on computer vision and pattern recognition (CVPR).

  41. Rhodin, H., Spörri, J., Katircioglu, I., Constantin, V., Meyer, F., Müller, E., Salzmann, M., & Fua, P. (2018). Learning monocular 3d human pose estimation from multi-view images. In The IEEE conference on computer vision and pattern recognition (CVPR) (pp. 8437–8446).

  42. Rogez, G., Weinzaepfel, P., & Schmid, C. (2017). LCR-Net: localization-classification-regression for human pose. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  43. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In ECCV.

  44. Sarafianos, N., Boteanu, B., Ionescu, B., & Kakadiaris, I. A. (2016). 3D human pose estimation: A review of the literature and analysis of covariates. Computer Vision and Image Understanding, 152, 1–20.

    Article  Google Scholar 

  45. Shi, Y., Han, X., Jiang, N., Zhou, K., Jia, K., & Lu, J. (2018). Fbi-pose: Towards bridging the gap between 2d images and 3d human poses using forward-or-backward information. CoRRarXiv:1806.09241

  46. Sun, X., Shang, J., Liang, S., & Wei, Y. (2017). Compositional human pose regression. In The IEEE international conference on computer vision (ICCV).

  47. Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. In The European conference on computer vision (ECCV).

  48. Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In The British Machine Vision Conference (BMVC).

  49. Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. In The IEEE conference on computer vision and pattern recognition (CVPR).

  50. Tome, D., Toso, M., Agapito, L., & Russell, C. (2018). Rethinking pose in 3d: Multi-stage refinement and recovery for markerless motion capture. In 2018 international conference on 3D vision (3DV) (pp. 474–483). IEEE.

  51. Trumble, M., Gilbert, A., Hilton, A., & Collomosse, J. (2018). Deep autoencoder for combined human pose estimation and body model upscaling. In The European Conference on Computer Vision (ECCV) (pp. 784–800).

  52. Trumble, M., Gilbert, A., Malleson, C., Hilton, A., & Collomosse, J. (2017). Total capture: 3D human pose estimation fusing video and inertial sensors. In The British machine vision conference (BMVC).

  53. Vasiljevic, I., Kolkin, N., Zhang, S., Luo, R., Wang, H., Dai, F. Z., Daniele, A. F., Mostajabi, M., Basart, S., Walter, M. R., & Shakhnarovich, G. (2019). DIODE: A dense indoor and outdoor DEpth dataset. CoRRarXiv:1908.00463.

  54. Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., & Wang, X. (2018). 3D human pose estimation in the wild by adversarial learning. In The IEEE conference on computer vision and pattern recognition (CVPR).

  55. Yi, K. M., Trulls, E., Lepetit, V., & Fua, P. (2016). LIFT: Learned invariant feature transform. In The European conference on computer vision (ECCV).

  56. Zanfir, A., Marinoiu, E., & Sminchisescu, C. (2018). Monocular 3D pose and shape estimation of multiple people in natural scenes—The importance of multiple scene constraints. In The IEEE conference on computer vision and pattern recognition (CVPR).

  57. Zanfir, A., Marinoiu, E., Zanfir, M., Popa, A. I., & Sminchisescu, C. (2018). Deep network for the integrated 3D sensing of multiple people in natural images. In: S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31, pp. 8410–8419).

  58. Zhang, W., Zhu, M., & Derpanis, K.G. (2013). From actemes to action: A strongly-supervised representation for detailed action understanding. In 2013 IEEE international conference on computer vision (pp. 2248–2255). https://doi.org/10.1109/ICCV.2013.280

  59. Zhou, X., Huang, Q., Sun, X., Xue, X., & Wei, Y. (2017). Towards 3D human pose estimation in the wild: A weakly-supervised approach. In The IEEE international conference on computer vision (ICCV).

  60. Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016). Deep kinematic pose regression. In Computer vision ECCV 2016 workshops.

  61. Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016). Sparseness Meets deepness: 3D human pose estimation from monocular video. In The IEEE conference on computer vision and pattern recognition (CVPR).

  62. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67, 301–320.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diogo C. Luvizon.

Additional information

Communicated by Yasushi Yagi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially funded by CNPq (Brazil)—Grant 233342/2014-1.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luvizon, D.C., Picard, D. & Tabia, H. Consensus-Based Optimization for 3D Human Pose Estimation in Camera Coordinates. Int J Comput Vis 130, 869–882 (2022). https://doi.org/10.1007/s11263-021-01570-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01570-9

Keywords

Navigation