3D Human Pose Estimation from Monocular Images with Deep Convolutional Neural Network

  • Sijin LiEmail author
  • Antoni B. Chan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9004)


In this paper, we propose a deep convolutional neural network for 3D human pose estimation from monocular images. We train the network using two strategies: (1) a multi-task framework that jointly trains pose regression and body part detectors; (2) a pre-training strategy where the pose regressor is initialized using a network trained for body part detection. We compare our network on a large data set and achieve significant improvement over baseline methods. Human pose estimation is a structured prediction problem, i.e., the locations of each body part are highly correlated. Although we do not add constraints about the correlations between body parts to the network, we empirically show that the network has disentangled the dependencies among different body parts, and learned their correlations.


Detection Task Convolutional Neural Network Deep Neural Network Joint Point Regression Task 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 123212 and CityU 110513).

Supplementary material

Supplementary material (mov 27,852 KB)


  1. 1.
    Andriluka, M., Roth, S., Schiele, B.: Monocular 3d pose estimation and tracking by detection. In: CVPR (2010)Google Scholar
  2. 2.
    Wei, X.K., Chai, J.: Modeling 3d human poses from uncalibrated monocular images. In: ICCV, pp. 1873–1880 (2009)Google Scholar
  3. 3.
    Agarwal, A., Triggs, B.: Recovering 3d human pose from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 28, 44–58 (2006)CrossRefGoogle Scholar
  4. 4.
    Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011)Google Scholar
  5. 5.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. IJCV 61, 55–79 (2005)CrossRefGoogle Scholar
  6. 6.
    Eichner, M., Marin-Jimenez, M., Zisserman, A., Ferrari, V.: 2d articulated human pose estimation and retrieval in (almost) unconstrained still images. IJCV 99, 190–214 (2012)CrossRefMathSciNetGoogle Scholar
  7. 7.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)Google Scholar
  8. 8.
    Burenius, M., Sullivan, J., Carlsson, S.: 3d pictorial structures for multiple view articulated pose estimation. In: CVPR, pp. 3618–3625 (2013)Google Scholar
  9. 9.
    Bo, L., Sminchisescu, C.: Twin gaussian processes for structured prediction. Int. J. Comput. Vis. 87, 28–52 (2010)CrossRefGoogle Scholar
  10. 10.
    Dantone, M., Gall, J., Leistner, C., van Gool, L.: Human pose estimation from still images using body parts dependent joint regressors. In: CVPR (2013)Google Scholar
  11. 11.
    Ionescu, C., Li, F., Sminchisescu, C.: Latent structured models for human pose estimation. In: ICCV, pp. 2220–2227 (2011)Google Scholar
  12. 12.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1339 (2014)CrossRefGoogle Scholar
  13. 13.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 25 (2012)Google Scholar
  14. 14.
    Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE TPAMI 32, 1744–1757 (2013)Google Scholar
  15. 15.
    Bengio, Y.: Deep learning of representations: Looking forward. CoRR abs/1305.0445 (2013)Google Scholar
  16. 16.
    Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: A deep learning approach. In: ICML (2011)Google Scholar
  17. 17.
    Moeslund, T.B., Hilton, A., Krüger, V.: A survey of advances in vision-based human motion capture and analysis. CVIU 104, 90–126 (2006)Google Scholar
  18. 18.
    Jain, A., Tompson, J., Andriluka, M., Taylor, G.W., Bregler, C.: Learning human pose estimation features with convolutional networks. In: International Conference on Learning Representations (ICLR) (2014)Google Scholar
  19. 19.
    Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression of general-activity human poses from depth images. In: ICCV, pp. 415–422 (2011)Google Scholar
  20. 20.
    Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: CVPR (2013)Google Scholar
  21. 21.
    Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  22. 22.
    Yuan, C., Niemann, H.: Neural networks for the recognition and pose estimation of 3d objects from a single 2d perspective view. Image Vis. Comput. 19, 585–592 (2001)CrossRefGoogle Scholar
  23. 23.
    Osadchy, M., Cun, Y.L., Miller, M.L.: Synergistic face detection and pose estimation with energy-based models. JMLR 8, 1197–1215 (2007)Google Scholar
  24. 24.
    Taylor, G.W., Sigal, L., Fleet, D.J., Hinton, G.E.: Dynamical binary latent variable models for 3d human pose tracking. In: CVPR, pp. 631–638 (2010)Google Scholar
  25. 25.
    Li, S., Liu, Z.Q., Chan, A.B.: Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In: CVPR: DeepVision Workshop (2014)Google Scholar
  26. 26.
    Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., Bengio, S.: Why does unsupervised pre-training help deep learning? J. Mach. Learn. Res. 11, 625–660 (2010)zbMATHMathSciNetGoogle Scholar
  27. 27.
    Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006)CrossRefzbMATHMathSciNetGoogle Scholar
  28. 28.
    Le, Q., Ranzato, M., Monga, R., Devin, M., Chen, K., Corrado, G., Dean, J., Ng, A.: Building high-level features using large scale unsupervised learning. In: ICML (2012)Google Scholar
  29. 29.
    Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., LeCun, Y.: Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR abs/1312.6229 (2013)Google Scholar
  30. 30.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML (2010)Google Scholar
  31. 31.
    Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)CrossRefGoogle Scholar
  32. 32.
    Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks by preventing co-adaptation of feature detectors. CoRR (2012)Google Scholar
  33. 33.
    Hurley, N., Rickard, S.: Comparing measures of sparsity. IEEE Trans. Inf. Theor. 55, 4723–4741 (2009)CrossRefMathSciNetGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Department of Computer ScienceCity University of Hong KongKowloon TongHong Kong

Personalised recommendations