International Journal of Computer Vision

, Volume 126, Issue 12, pp 1326–1341 | Cite as

Learning Latent Representations of 3D Human Pose with Deep Neural Networks

  • Isinsu Katircioglu
  • Bugra Tekin
  • Mathieu Salzmann
  • Vincent Lepetit
  • Pascal Fua


Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from an image to a 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images or 2D joint location heatmaps that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and accounts for joint dependencies. We further propose an efficient Long Short-Term Memory network to enforce temporal consistency on 3D pose predictions. We demonstrate that our approach achieves state-of-the-art performance both in terms of structure preservation and prediction accuracy on standard 3D human pose estimation benchmarks.


3D human pose estimation Structured prediction Deep learning 


  1. Agarwal, A., & Triggs, B. (2004). 3D human pose from silhouettes by relevance vector regression. In CVPR.Google Scholar
  2. Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR.Google Scholar
  3. Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. International Journal of Computer Vision, 87, 28–52.CrossRefGoogle Scholar
  4. Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV.CrossRefGoogle Scholar
  5. Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3D pictorial structures for multiple view articulated pose estimation. In CVPR.Google Scholar
  6. Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In CVPR.Google Scholar
  7. Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., et al. (2016). Synthesizing training images for boosting human 3D pose estimation. In 3DV.Google Scholar
  8. Chen, X., & Yuille, A. L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.Google Scholar
  9. Cortes, C., Mohri, M., & Weston, J. (2005). A general regression technique for learning transductions. In ICML.Google Scholar
  10. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.Google Scholar
  11. Du, M., & Chellappa, R. (2012). Face association across unconstrained video frames using conditional random fields. In ECCV.CrossRefGoogle Scholar
  12. Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., et al. (2016). Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV.CrossRefGoogle Scholar
  13. Elhayek, A., Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., et al. (2015). Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR.Google Scholar
  14. Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In ICCV.Google Scholar
  15. Gkioxari, G., Toshev, A., & Jaitly, N. (2016). Chained predictions using convolutional neural networks. In ECCV.CrossRefGoogle Scholar
  16. Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS.Google Scholar
  17. Graves, A., Fernandez, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. In ICANN.Google Scholar
  18. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, pp. 770–778.Google Scholar
  19. Hinton, G., & Salakutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504–507.MathSciNetCrossRefGoogle Scholar
  20. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.CrossRefGoogle Scholar
  21. Hong, C., Yu, J., Wan, J., Tao, D., & Wang, M. (2014). Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing, 24, 5659–5670.MathSciNetCrossRefGoogle Scholar
  22. Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In ICCV.Google Scholar
  23. Ionescu, C., Papava, I., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI.CrossRefGoogle Scholar
  24. Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks. In ICLR.Google Scholar
  25. Jain, A., Zamir, A., Savarese, S., & Saxena, A. (2016). Structural-RNN: Deep learning on spatio-temporal graphs. In CVPR.Google Scholar
  26. Johnson, J., Karpathy, A., & Fei-fei, L. (2016). Densecap: Fully convolutional localization networks for dense captioning. In CVPR.Google Scholar
  27. Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC.Google Scholar
  28. Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR.Google Scholar
  29. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimisation. In ICLR.Google Scholar
  30. Kombrink, S., Mikolov, T., Karafiat, M., & Burget, L. (2011). Recurrent neural network based language modeling in meeting recognition. In INTERSPEECH.Google Scholar
  31. Konda, K., Memisevic, R., & Krueger, D. (2015). Zero-bias autoencoders and the benefits of co-adapting features. In ICLR.Google Scholar
  32. Li, S., & Chan, A. B. (2014). 3D human pose estimation from monocular images with deep convolutional neural network. In ACCV.Google Scholar
  33. Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In ICCV.Google Scholar
  34. Li, S., Zhang, W., Chan, A. B. (2016). Maximum-margin structured learning with deep networks for 3D human pose estimation. In IJCV.Google Scholar
  35. Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for object recognition. In CVPR.Google Scholar
  36. Maaten, L. V. D., & Hinton, G. E. (2008). Visualizing high dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.zbMATHGoogle Scholar
  37. Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., et al. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In International Conference on 3D Vision.Google Scholar
  38. Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.CrossRefGoogle Scholar
  39. Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep learning for hand pose estimation. arXiv Preprint, arXiv:abs/1502.06807.
  40. Park, S., Hwang, J., & Kwak, N. (2016) 3D human pose estimation using convolutional neural networks with 2D pose information. In ECCV Workshops.Google Scholar
  41. Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR.Google Scholar
  42. Pfister, T., Charles, J., & Zisserman, A. (2015). Flowing convnets for human pose estimation in videos. In ICCV.Google Scholar
  43. Pinheiro, P. O., & Collobert, R. (2014). Recurrent neural networks for scenel labelling. In ICML.Google Scholar
  44. Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., et al. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR.Google Scholar
  45. Popa, A.-I., Zanfir, M., & Sminchisescu, C. (2017). Deep multitask architecture for integrated 2D and 3D human sensing. In CVPR.Google Scholar
  46. Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3D human pose from 2D image landmarks. In ECCV.CrossRefGoogle Scholar
  47. Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In ICML.Google Scholar
  48. Rogez, G., & Schmid, C. (2016). Mocap guided data augmentation for 3D pose estimation in the wild. In NIPS.Google Scholar
  49. Salzmann, M., & Urtasun, R. (2010). Implicitly constrained Gaussian process regression for monocular non-rigid pose estimation. In NIPS.Google Scholar
  50. Sanzari, M., Ntouskos, V., & Pirri, F. (2016). Bayesian image based 3D pose estimation. In ECCV.CrossRefGoogle Scholar
  51. Sigal, L., & Black, M. J. (2006). Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical report, Department of Computer Science, Brown University.Google Scholar
  52. Simo-Serra, E., Quattoni, A., Torras, C., & Moreno-Noguer, F. (2013). A joint model for 2D and 3D pose estimation from a single image. In CVPR.Google Scholar
  53. Sutskever, I., Hinton, G. E., & Taylor, G. W. (2011). Generating text with recurrent neural networks. In ICML.Google Scholar
  54. Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In BMVC.Google Scholar
  55. Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3D body poses from motion compensated sequences. In CVPR, pp. 991–1000.Google Scholar
  56. Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. arXiv preprint, arXiv:1701.00295.
  57. Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.Google Scholar
  58. Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR.Google Scholar
  59. Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML.Google Scholar
  60. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.MathSciNetzbMATHGoogle Scholar
  61. Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.Google Scholar
  62. Weinland, D., Ozuysal, M., & Fua, P. (2010). Making action recognition robust to occlusions and viewpoint changes. In ECCV, pp. 635–648.CrossRefGoogle Scholar
  63. Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR.Google Scholar
  64. Yasin, H., Iqbal, U., Kruger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3D pose estimation from a single image. In CVPR.Google Scholar
  65. Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016). Deep kinematic pose regression. In ECCV Workshops.Google Scholar
  66. Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., & Daniilidis, K. (2016). Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Isinsu Katircioglu
    • 1
  • Bugra Tekin
    • 1
  • Mathieu Salzmann
    • 1
  • Vincent Lepetit
    • 2
  • Pascal Fua
    • 1
  1. 1.Computer Vision Laboratory (CVLab)École Polytechnique Fédérale de Lausanne (EPFL)LausanneSwitzerland
  2. 2.LaBRIUniversity of BordeauxTalenceFrance

Personalised recommendations