Human Pose Estimation by a Series of Residual Auto-Encoders

  • M. FarrajotaEmail author
  • João M. F. Rodrigues
  • J. M. H. du Buf
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10255)


Pose estimation is the task of predicting the pose of an object in an image or in a sequence of images. Here, we focus on articulated human pose estimation in scenes with a single person. We employ a series of residual auto-encoders to produce multiple predictions which are then combined to provide a heatmap prediction of body joints. In this network topology, features are processed across all scales which captures the various spatial relationships associated with the body. Repeated bottom-up and top-down processing with intermediate supervision for each auto-encoder network is applied. We propose some improvements to this type of regression-based networks to further increase performance, namely: (a) increase the number of parameters of the auto-encoder networks in the pipeline, (b) use stronger regularization along with heavy data augmentation, (c) use sub-pixel precision for more precise joint localization, and (d) combine all auto-encoders output heatmaps into a single prediction, which further increases body joint prediction accuracy. We demonstrate state-of-the-art results on the popular FLIC and LSP datasets.


Human pose ConvNet Neural networks Auto-encoders 



This work was supported by the FCT project LARSyS (UID/EEA/50009/2013) and FCT PhD grant to author MF (SFRH/BD/79812/2011).


  1. 1.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR, pp. 3686–3693. IEEE (2014)Google Scholar
  2. 2.
    Ess, A., Leibe, B., Schindler, K., Van Gool, L.: A mobile vision system for robust multi-person tracking. In: CVPR, pp. 1–8 (2008)Google Scholar
  3. 3.
    Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Strong appearance and expressive spatial models for human pose estimation. In: ICCV, pp. 3487–3494 (2013)Google Scholar
  4. 4.
    Sapp, B., Taskar, B.: MODEC: multimodal decomposable models for human pose estimation. In: CVPR, vol. 13, p. 3 (2013)Google Scholar
  5. 5.
    Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR, pp. 648–656 (2015)Google Scholar
  6. 6.
    Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS, pp. 1736–1744 (2014)Google Scholar
  7. 7.
    Wei, S.-E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. arXiv:1602.00134 (2016)
  8. 8.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. arXiv:1603.06937 (2016)
  9. 9.
    Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: IEEE Proceedings of CVPR (2011)Google Scholar
  10. 10.
    Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In IEEE Proceedings of CVPR, pp. 588–595 (2013)Google Scholar
  11. 11.
    Sun, M., Savarese, S.: Articulated part-based model for joint object detection and pose estimation. In: ICCV, pp. 723–730. IEEE (2011)Google Scholar
  12. 12.
    Dantone, M., Gall, J., Leistner, C., Van Gool, L.: Human pose estimation using body parts dependent joint regressors. In: IEEE Proceedings of CVPR, pp. 3041–3048 (2013)Google Scholar
  13. 13.
    Ramakrishna, V., Munoz, D., Hebert, M., Andrew Bagnell, J., Sheikh, Y.: Pose machines: articulated pose estimation via inference machines. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 33–47. Springer, Cham (2014). doi: 10.1007/978-3-319-10605-2_3 Google Scholar
  14. 14.
    Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. arXiv:1511.06645 (2015)
  15. 15.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS, pp. 1799–1807 (2014)Google Scholar
  16. 16.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: Deepercut: a deeper, stronger, and faster multi-person pose estimation model. arXiv:1605.03170 (2016)
  17. 17.
    Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: IEEE Proceedings of CVPR, pp. 1653–1660 (2014)Google Scholar
  18. 18.
    Wei, S., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. arXiv:1602.00134 (2016)
  19. 19.
    Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 717–732. Springer, Cham (2016). doi: 10.1007/978-3-319-46478-7_44 CrossRefGoogle Scholar
  20. 20.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv:1512.03385 (2015)
  21. 21.
    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE Proceedings of CVPR, pp. 3431–3440 (2015)Google Scholar
  22. 22.
    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: a matlab-like environment for machine learning. In: BigLearn, NIPS Workshop, no. EPFL-CONF-192376 (2011)Google Scholar
  23. 23.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2014)
  24. 24.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015)
  25. 25.
    Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. arXiv:1505.00853 (2015)
  26. 26.
    Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC, vol. 2, p. 5 (2010)Google Scholar
  27. 27.
    Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. arXiv:1605.02914 (2016)
  28. 28.
    Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensus voting. arXiv:1603.08212 (2016)

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Vision Laboratory, LARSySUniversity of the AlgarveFaroPortugal

Personalised recommendations