International Journal of Computer Vision

, Volume 122, Issue 1, pp 149–168 | Cite as

Maximum-Margin Structured Learning with Deep Networks for 3D Human Pose Estimation



This paper focuses on structured-output learning using deep neural networks for 3D human pose estimation from monocular images. Our network takes an image and 3D pose as inputs and outputs a score value, which is high when the image-pose pair matches and low otherwise. The network structure consists of a convolutional neural network for image feature extraction, followed by two sub-networks for transforming the image features and pose into a joint embedding. The score function is then the dot-product between the image and pose embeddings. The image-pose embedding and score function are jointly trained using a maximum-margin cost function. Our proposed framework can be interpreted as a special form of structured support vector machines where the joint feature space is discriminatively learned using deep neural networks. We also propose an efficient recurrent neural network for performing inference with the learned image-embedding. We test our framework on the Human3.6m dataset and obtain state-of-the-art results compared to other recent methods. Finally, we present visualizations of the image-pose embedding space, demonstrating the network has learned a high-level embedding of body-orientation and pose-configuration.


Structured learning Deep learning Human pose estimation 



This work was supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU 123212), and by a Strategic Research Grant from City University of Hong Kong (Project Nos. 7004417 and 7004682). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40 GPU used for this research.


  1. Andrew, G., Arora, R., Bilmes, J., & Livescu, K. (2013). Deep canonical correlation analysis. ICML, 28, 1247–1255.Google Scholar
  2. Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3686–3693).Google Scholar
  3. Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I. J., Bergeron, A., Bouchard, N., & Bengio, Y. (2012). Theano: new features and speed improvements. In NIPS: Deep learning and unsupervised feature learning workshop Google Scholar
  4. Bengio, Y., Mesnil, G., Dauphin, Y., & Rifai, S. (2013). Better mixing via deep representations. In ICML (pp. 552–560).Google Scholar
  5. Bregler, C., Malik, J., & Pullen, K. (2004). Twist based acquisition and tracking of animal and human kinematics. International Journal of Computer Vision, 56(3), 179–194.CrossRefGoogle Scholar
  6. Burenius, M., Sullivan, J., & Carlsson, S. (2013). 3d pictorial structures for multiple view articulated pose estimation. In CVPR (pp. 3618–3625).Google Scholar
  7. Calamai, P. H., & Moré, J. J. (1987). Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1), 93–116.MathSciNetCrossRefMATHGoogle Scholar
  8. Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In The IEEE conference on computer vision and pattern recognition (CVPR) Google Scholar
  9. Chen, X. & Yuille, A. (2014). Articulated pose estimation by a graphical model with image dependent pairwise relations. In NIPS Google Scholar
  10. Chu, X., Ouyang, W., Yang, W., & Wang, X. (2015). Multi-task recurrent neural network for immediacy prediction. In The IEEE international conference on computer vision (ICCV) (pp. 3352–3360).Google Scholar
  11. Deutscher, J., & Reid, I. (2005). Articulated body motion capture by stochastic search. IJCV, 61(2), 185–205.CrossRefGoogle Scholar
  12. Dhungel, N., Carneiro, G., & Bradley, A. P. (2014). Deep structured learning for mass segmentation from mammograms. CoRR  arXiv:1410.7454
  13. Eichner, M. & Ferrari, V. (2009). Better appearance models for pictorial structures. In BMVC (pp 1–11)Google Scholar
  14. Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. IJCV, 61(1), 55–79.CrossRefGoogle Scholar
  15. Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International conference on learning representations Google Scholar
  16. Ionescu, C., Bo, L., & Sminchisescu, C. (2009). Structural SVM for visual localization and continuous state estimation. In ICCV (pp. 1157–1164).Google Scholar
  17. Ionescu, C., Li, F., & Sminchisescu, C. (2011). Latent structured models for human pose estimation. In ICCV (pp. 2220–2227).Google Scholar
  18. Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE TPAMI, 36(7), 1325–1339.CrossRefGoogle Scholar
  19. Jaderberg, M., Simonyan, K., Vedaldi, A., & Zisserman, A. (2015). Deep structured output learning for unconstrained text recognition. ICLR Google Scholar
  20. Jain, A., Tompson, J., Andriluka, M., Taylor, G. W., & Bregler, C. (2014). Learning human pose estimation features with convolutional networks. In ICLR Google Scholar
  21. Joachims, T., Finley, T., & Yu, C. N. J. (2009). Cutting-plane training of structural svms. Machine Learning, 77(1), 27–59.CrossRefMATHGoogle Scholar
  22. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. Cambridge: MIT Press.MATHGoogle Scholar
  23. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS Google Scholar
  24. Li, S. & Chan, A. B. (2014). 3d human pose estimation from monocular images with deep convolutional neural network. In ACCV Google Scholar
  25. Li, S., Liu, Z. Q., & Chan, A. B. (2014). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In IJCV (pp 1–18).Google Scholar
  26. Li, S., Zhang, W., & Chan, A. B. (2015). Maximum-margin structured learning with deep networks for 3d human pose estimation. In The IEEE international conference on computer vision (ICCV) Google Scholar
  27. Murray, R. M., Li, Z., & Sastry, S. S. (1994). A mathematical introduction to robotic manipulation (Vol. 29). Boca Raton: CRC press.MATHGoogle Scholar
  28. Nair, V. & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In ICML Google Scholar
  29. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In ICML (pp. 689–696)Google Scholar
  30. Osadchy, M., LeCun, Y., & Miller, M. L. (2007). Synergistic face detection and pose estimation with energy-based models. Journal of Machine Learning Research, 8, 1197–1215.Google Scholar
  31. Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR (pp. 512–519)Google Scholar
  32. Rodríguez, J. A. & Perronnin, F. (2013). Label embedding for text recognition. In BMVC Google Scholar
  33. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Neurocomputing: Foundations of research, Chap Learning representations by back-propagating errors (pp. 696–699). Cambridge, MA: MIT Press.Google Scholar
  34. Sapp, B. & Taskar, B. (2013). Modec: Multimodal decomposablemodels for human pose estimation. In Proceedings of the IEEE conference on CVPR Google Scholar
  35. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR  arXiv:1312.6229
  36. Srivastava, N. & Salakhutdinov, R. R. (2012). Multimodal learning with deep boltzmann machines. In NIPS (pp. 2222–2230). Curran Associates Inc., Red Hook.Google Scholar
  37. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetMATHGoogle Scholar
  38. Sun, Y., Wang, X., & Tang, X. (2014). Deep learning face representation from predicting 10,000 classes. In CVPR, IEEE Computer SocietyGoogle Scholar
  39. Tompson, J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In NIPS Google Scholar
  40. Toshev, A. & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In CVPR Google Scholar
  41. Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. In ICML Google Scholar
  42. Tsochantaridis, I., Joachims, T., Hofmann, T., & Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.MathSciNetMATHGoogle Scholar
  43. Yang, Y. & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR (pp. 1385 – 1392)Google Scholar
  44. Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. (2015). Conditional random fields as recurrent neural networks. In International Conference on Computer Vision (ICCV) Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceCity University of Hong KongKowloon TongHong Kong

Personalised recommendations