Learning Latent Representations of 3D Human Pose with Deep Neural Networks

Katircioglu, Isinsu; Tekin, Bugra; Salzmann, Mathieu; Lepetit, Vincent; Fua, Pascal

doi:10.1007/s11263-018-1066-6

Learning Latent Representations of 3D Human Pose with Deep Neural Networks

Published: 31 January 2018

Volume 126, pages 1326–1341, (2018)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Isinsu Katircioglu¹^na1,
Bugra Tekin¹^na1,
Mathieu Salzmann¹,
Vincent Lepetit² &
…
Pascal Fua¹

2876 Accesses
52 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

Most recent approaches to monocular 3D pose estimation rely on Deep Learning. They either train a Convolutional Neural Network to directly regress from an image to a 3D pose, which ignores the dependencies between human joints, or model these dependencies via a max-margin structured learning framework, which involves a high computational cost at inference time. In this paper, we introduce a Deep Learning regression architecture for structured prediction of 3D human pose from monocular images or 2D joint location heatmaps that relies on an overcomplete autoencoder to learn a high-dimensional latent pose representation and accounts for joint dependencies. We further propose an efficient Long Short-Term Memory network to enforce temporal consistency on 3D pose predictions. We demonstrate that our approach achieves state-of-the-art performance both in terms of structure preservation and prediction accuracy on standard 3D human pose estimation benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network

Human Pose Estimation by a Series of Residual Auto-Encoders

On the Role of Depth Predictions for 3D Human Pose Estimation

Notes

We experimented on only 6 actions due to time limitations of the submission server.

References

Agarwal, A., & Triggs, B. (2004). 3D human pose from silhouettes by relevance vector regression. In CVPR.
Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR.
Bo, L., & Sminchisescu, C. (2010). Twin Gaussian processes for structured prediction. International Journal of Computer Vision, 87, 28–52.
Article Google Scholar
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In ECCV.
Chapter Google Scholar
Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3D pictorial structures for multiple view articulated pose estimation. In CVPR.
Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In CVPR.
Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., et al. (2016). Synthesizing training images for boosting human 3D pose estimation. In 3DV.
Chen, X., & Yuille, A. L. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS.
Cortes, C., Mohri, M., & Weston, J. (2005). A general regression technique for learning transductions. In ICML.
Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., et al. (2015). Long-term recurrent convolutional networks for visual recognition and description. In CVPR.
Du, M., & Chellappa, R. (2012). Face association across unconstrained video frames using conditional random fields. In ECCV.
Chapter Google Scholar
Du, Y., Wong, Y., Liu, Y., Han, F., Gui, Y., Wang, Z., et al. (2016). Marker-less 3D human motion capture with monocular image sequence and height-maps. In ECCV.
Chapter Google Scholar
Elhayek, A., Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., et al. (2015). Efficient convnet-based marker-less motion capture in general scenes with a low number of cameras. In CVPR.
Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In ICCV.
Gkioxari, G., Toshev, A., & Jaitly, N. (2016). Chained predictions using convolutional neural networks. In ECCV.
Chapter Google Scholar
Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. In AISTATS.
Graves, A., Fernandez, S., & Schmidhuber, J. (2005). Bidirectional LSTM networks for improved phoneme classification and recognition. In ICANN.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, pp. 770–778.
Hinton, G., & Salakutdinov, R. (2006). Reducing the dimensionality of data with neural networks. Science, 313, 504–507.
Article MathSciNet Google Scholar
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Article Google Scholar
Hong, C., Yu, J., Wan, J., Tao, D., & Wang, M. (2014). Multimodal deep autoencoder for human pose recovery. IEEE Transactions on Image Processing, 24, 5659–5670.
Article MathSciNet Google Scholar
Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In ICCV.
Ionescu, C., Papava, I., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI.
Article Google Scholar
Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks. In ICLR.
Jain, A., Zamir, A., Savarese, S., & Saxena, A. (2016). Structural-RNN: Deep learning on spatio-temporal graphs. In CVPR.
Johnson, J., Karpathy, A., & Fei-fei, L. (2016). Densecap: Fully convolutional localization networks for dense captioning. In CVPR.
Johnson, S., & Everingham, M. (2010). Clustered pose and nonlinear appearance models for human pose estimation. In BMVC.
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In ICLR.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimisation. In ICLR.
Kombrink, S., Mikolov, T., Karafiat, M., & Burget, L. (2011). Recurrent neural network based language modeling in meeting recognition. In INTERSPEECH.
Konda, K., Memisevic, R., & Krueger, D. (2015). Zero-bias autoencoders and the benefits of co-adapting features. In ICLR.
Li, S., & Chan, A. B. (2014). 3D human pose estimation from monocular images with deep convolutional neural network. In ACCV.
Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3D human pose estimation. In ICCV.
Li, S., Zhang, W., Chan, A. B. (2016). Maximum-margin structured learning with deep networks for 3D human pose estimation. In IJCV.
Liang, M., & Hu, X. (2015). Recurrent convolutional neural network for object recognition. In CVPR.
Maaten, L. V. D., & Hinton, G. E. (2008). Visualizing high dimensional data using t-SNE. Journal of Machine Learning Research, 9, 2579–2605.
MATH Google Scholar
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., et al. (2017). Monocular 3D human pose estimation in the wild using improved CNN supervision. In International Conference on 3D Vision.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.
Chapter Google Scholar
Oberweger, M., Wohlhart, P., & Lepetit, V. (2015). Hands deep in deep learning for hand pose estimation. arXiv Preprint, arXiv:abs/1502.06807.
Park, S., Hwang, J., & Kwak, N. (2016) 3D human pose estimation using convolutional neural networks with 2D pose information. In ECCV Workshops.
Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR.
Pfister, T., Charles, J., & Zisserman, A. (2015). Flowing convnets for human pose estimation in videos. In ICCV.
Pinheiro, P. O., & Collobert, R. (2014). Recurrent neural networks for scenel labelling. In ICML.
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., et al. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR.
Popa, A.-I., Zanfir, M., & Sminchisescu, C. (2017). Deep multitask architecture for integrated 2D and 3D human sensing. In CVPR.
Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3D human pose from 2D image landmarks. In ECCV.
Chapter Google Scholar
Rifai, S., Vincent, P., Muller, X., Glorot, X., & Bengio, Y. (2011). Contractive auto-encoders: Explicit invariance during feature extraction. In ICML.
Rogez, G., & Schmid, C. (2016). Mocap guided data augmentation for 3D pose estimation in the wild. In NIPS.
Salzmann, M., & Urtasun, R. (2010). Implicitly constrained Gaussian process regression for monocular non-rigid pose estimation. In NIPS.
Sanzari, M., Ntouskos, V., & Pirri, F. (2016). Bayesian image based 3D pose estimation. In ECCV.
Chapter Google Scholar
Sigal, L., & Black, M. J. (2006). Humaneva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Technical report, Department of Computer Science, Brown University.
Simo-Serra, E., Quattoni, A., Torras, C., & Moreno-Noguer, F. (2013). A joint model for 2D and 3D pose estimation from a single image. In CVPR.
Sutskever, I., Hinton, G. E., & Taylor, G. W. (2011). Generating text with recurrent neural networks. In ICML.
Tekin, B., Katircioglu, I., Salzmann, M., Lepetit, V., & Fua, P. (2016). Structured prediction of 3D human pose with deep neural networks. In BMVC.
Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3D body poses from motion compensated sequences. In CVPR, pp. 991–1000.
Tome, D., Russell, C., & Agapito, L. (2017). Lifting from the deep: Convolutional 3D pose estimation from a single image. arXiv preprint, arXiv:1701.00295.
Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS.
Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR.
Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. In ICML.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, 11, 3371–3408.
MathSciNet MATH Google Scholar
Wei, S. E., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR.
Weinland, D., Ozuysal, M., & Fua, P. (2010). Making action recognition robust to occlusions and viewpoint changes. In ECCV, pp. 635–648.
Chapter Google Scholar
Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR.
Yasin, H., Iqbal, U., Kruger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3D pose estimation from a single image. In CVPR.
Zhou, X., Sun, X., Zhang, W., Liang, S., & Wei, Y. (2016). Deep kinematic pose regression. In ECCV Workshops.
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., & Daniilidis, K. (2016). Sparseness meets deepness: 3D human pose estimation from monocular video. In CVPR.

Download references

Author information

Isinsu Katircioglu and Bugra Tekin contributed equally as co-first authors.

Authors and Affiliations

Computer Vision Laboratory (CVLab), École Polytechnique Fédérale de Lausanne (EPFL), 1015, Lausanne, Switzerland
Isinsu Katircioglu, Bugra Tekin, Mathieu Salzmann & Pascal Fua
LaBRI, University of Bordeaux, 33405, Talence, France
Vincent Lepetit

Authors

Isinsu Katircioglu
View author publications
You can also search for this author in PubMed Google Scholar
Bugra Tekin
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Salzmann
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Lepetit
View author publications
You can also search for this author in PubMed Google Scholar
Pascal Fua
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bugra Tekin.

Additional information

Communicated by Edwin Hancock, Richard Wilson, Will Smith, Adrian Bors and Nick Pears.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Katircioglu, I., Tekin, B., Salzmann, M. et al. Learning Latent Representations of 3D Human Pose with Deep Neural Networks. Int J Comput Vis 126, 1326–1341 (2018). https://doi.org/10.1007/s11263-018-1066-6

Download citation

Received: 01 February 2017
Accepted: 10 January 2018
Published: 31 January 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s11263-018-1066-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Latent Representations of 3D Human Pose with Deep Neural Networks

Abstract

Access this article

Similar content being viewed by others

3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network

Human Pose Estimation by a Series of Residual Auto-Encoders

On the Role of Depth Predictions for 3D Human Pose Estimation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning Latent Representations of 3D Human Pose with Deep Neural Networks

Abstract

Access this article

Similar content being viewed by others

3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network

Human Pose Estimation by a Series of Residual Auto-Encoders

On the Role of Depth Predictions for 3D Human Pose Estimation

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation