Skip to main content
Log in

Learning Latent Representations of 3D Human Pose with Deep Neural Networks

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from an image to a 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images or 2D joint location heatmaps that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and accounts for joint dependencies. We further propose an efficient Long Short-Term Memory network to enforce temporal consistency on 3D pose predictions. We demonstrate that our approach achieves state-of-the-art performance both in terms of structure preservation and prediction accuracy on standard 3D human pose estimation benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. We experimented on only 6 actions due to time limitations of the submission server.

References

  • Agarwal, A., & Triggs, B. (2004). 3D human pose from silhouettes by relevance vector regression. In CVPR.

  • Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR.

  • Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. International Journal of Computer Vision, 87, 28–52.

    Article  Google Scholar 

  • Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV.

    Chapter  Google Scholar 

  • Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3D pictorial structures for multiple view articulated pose estimation. In CVPR.

  • Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In CVPR.

  • Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., et al. (2016). Synthesizing training images for boosting human 3D pose estimation. In 3DV.

  • Chen, X., & Yuille, A. L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.

  • Cortes, C., Mohri, M., & Weston, J. (2005). A general regression technique for learning transductions. In ICML.

  • Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.

  • Du, M., & Chellappa, R. (2012). Face association across unconstrained video frames using conditional random fields. In ECCV.

    Chapter  Google Scholar 

  • Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., et al. (2016). Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV.

    Chapter  Google Scholar 

  • Elhayek, A., Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., et al. (2015). Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR.

  • Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In ICCV.

  • Gkioxari, G., Toshev, A., & Jaitly, N. (2016). Chained predictions using convolutional neural networks. In ECCV.

    Chapter  Google Scholar 

  • Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS.

  • Graves, A., Fernandez, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. In ICANN.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, pp. 770–778.

  • Hinton, G., & Salakutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504–507.

    Article  MathSciNet  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Hong, C., Yu, J., Wan, J., Tao, D., & Wang, M. (2014). Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing, 24, 5659–5670.

    Article  MathSciNet  Google Scholar 

  • Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In ICCV.

  • Ionescu, C., Papava, I., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI.

    Article  Google Scholar 

  • Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks. In ICLR.

  • Jain, A., Zamir, A., Savarese, S., & Saxena, A. (2016). Structural-RNN: Deep learning on spatio-temporal graphs. In CVPR.

  • Johnson, J., Karpathy, A., & Fei-fei, L. (2016). Densecap: Fully convolutional localization networks for dense captioning. In CVPR.

  • Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC.

  • Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR.

  • Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimisation. In ICLR.

  • Kombrink, S., Mikolov, T., Karafiat, M., & Burget, L. (2011). Recurrent neural network based language modeling in meeting recognition. In INTERSPEECH.

  • Konda, K., Memisevic, R., & Krueger, D. (2015). Zero-bias autoencoders and the benefits of co-adapting features. In ICLR.

  • Li, S., & Chan, A. B. (2014). 3D human pose estimation from monocular images with deep convolutional neural network. In ACCV.

  • Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In ICCV.

  • Li, S., Zhang, W., Chan, A. B. (2016). Maximum-margin structured learning with deep networks for 3D human pose estimation. In IJCV.

  • Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for object recognition. In CVPR.

  • Maaten, L. V. D., & Hinton, G. E. (2008). Visualizing high dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.

    MATH  Google Scholar 

  • Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., et al. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In International Conference on 3D Vision.

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.

    Chapter  Google Scholar 

  • Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep learning for hand pose estimation. arXiv Preprint, arXiv:abs/1502.06807.

  • Park, S., Hwang, J., & Kwak, N. (2016) 3D human pose estimation using convolutional neural networks with 2D pose information. In ECCV Workshops.

  • Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR.

  • Pfister, T., Charles, J., & Zisserman, A. (2015). Flowing convnets for human pose estimation in videos. In ICCV.

  • Pinheiro, P. O., & Collobert, R. (2014). Recurrent neural networks for scenel labelling. In ICML.

  • Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., et al. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR.

  • Popa, A.-I., Zanfir, M., & Sminchisescu, C. (2017). Deep multitask architecture for integrated 2D and 3D human sensing. In CVPR.

  • Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3D human pose from 2D image landmarks. In ECCV.

    Chapter  Google Scholar 

  • Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In ICML.

  • Rogez, G., & Schmid, C. (2016). Mocap guided data augmentation for 3D pose estimation in the wild. In NIPS.

  • Salzmann, M., & Urtasun, R. (2010). Implicitly constrained Gaussian process regression for monocular non-rigid pose estimation. In NIPS.

  • Sanzari, M., Ntouskos, V., & Pirri, F. (2016). Bayesian image based 3D pose estimation. In ECCV.

    Chapter  Google Scholar 

  • Sigal, L., & Black, M. J. (2006). Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical report, Department of Computer Science, Brown University.

  • Simo-Serra, E., Quattoni, A., Torras, C., & Moreno-Noguer, F. (2013). A joint model for 2D and 3D pose estimation from a single image. In CVPR.

  • Sutskever, I., Hinton, G. E., & Taylor, G. W. (2011). Generating text with recurrent neural networks. In ICML.

  • Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In BMVC.

  • Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3D body poses from motion compensated sequences. In CVPR, pp. 991–1000.

  • Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. arXiv preprint, arXiv:1701.00295.

  • Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.

  • Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR.

  • Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML.

  • Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.

    MathSciNet  MATH  Google Scholar 

  • Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.

  • Weinland, D., Ozuysal, M., & Fua, P. (2010). Making action recognition robust to occlusions and viewpoint changes. In ECCV, pp. 635–648.

    Chapter  Google Scholar 

  • Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR.

  • Yasin, H., Iqbal, U., Kruger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3D pose estimation from a single image. In CVPR.

  • Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016). Deep kinematic pose regression. In ECCV Workshops.

  • Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., & Daniilidis, K. (2016). Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bugra Tekin.

Additional information

Communicated by Edwin Hancock, Richard Wilson, Will Smith, Adrian Bors and Nick Pears.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Katircioglu, I., Tekin, B., Salzmann, M. et al. Learning Latent Representations of 3D Human Pose with Deep Neural Networks. Int J Comput Vis 126, 1326–1341 (2018). https://doi.org/10.1007/s11263-018-1066-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1066-6

Keywords

Navigation