Multi-scale Attention Aided Multi-Resolution Network for Human Pose Estimation

  • Srinika SelvamEmail author
  • Deepak MishraEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11941)


In this paper, we propose attention maps at various scales on multi-resolution feature extractor baseline network for human pose estimation. The baseline network captures information across various scales with the help of repeated bottom-up and top-down approach using successive pooling and up-sampling. We propose a network named Refinement Net for regressing the predicted heatmaps to 2D joint locations to remove ambiguities in predicted position. We experiment with three levels of attention schemes - global, heatmap and multi-resolution. Attention masks helps in generating basin of attraction that helps the network on deciding where to “look”. The proposed network performance is at par with the state-of-the-art two dimensional pose estimation methods on MPII dataset.


Human pose estimation Multi-resolution Attention maps 


  1. 1.
    Alp Güler, R., Neverova, N., Kokkinos, I.: DensePose: dense human pose estimation in the wild. In: CVPR (2018)Google Scholar
  2. 2.
    Chen, Y., Zhao, D., Lv, L., Li, C.: A visual attention based convolutional neural network for image classification. In: 2016 WCICA (2016)Google Scholar
  3. 3.
    Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: CVPR (2018)Google Scholar
  4. 4.
    Chou, C.J., Chien, J.T., Chen, H.T.: Self adversarial training for human pose estimation. In: 2018 APSIPA ASC (2018)Google Scholar
  5. 5.
    Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: CVPR (2017)Google Scholar
  6. 6.
    Guo, C., Du, W., Ying, N.: Multi-scale stacked hourglass network for human pose estimation (2018)Google Scholar
  7. 7.
    Hara, K., Liu, M.Y., Tuzel, O., Farahmand, A.M.: Attentional network for visual object detection. arXiv preprint arXiv:1702.01478 (2017)
  8. 8.
    Huang, F., Zeng, A., Liu, M., Qin, J., Xu, Q.: Structure-aware 3D hourglass network for hand pose estimation from single depth image. arXiv preprint arXiv:1812.10320 (2018)
  9. 9.
    Insafutdinov, E., et al.: Arttrack: articulated multi-person tracking in the wild. In: CVPR (2017)Google Scholar
  10. 10.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016). Scholar
  11. 11.
    Li, L., Tang, S., Deng, L., Zhang, Y., Tian, Q.: Image caption with global-local attention. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)Google Scholar
  12. 12.
    Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensus voting. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 246–260. Springer, Cham (2016). Scholar
  13. 13.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). Scholar
  14. 14.
    Pishchulin, L., et al.: DeepCut: joint subset partition and labeling for multi person pose estimation. In: CVPR (2016)Google Scholar
  15. 15.
    Sun, G., Ye, C., Wang, K.: Focus on what’s important: self-attention model for human pose estimation. arXiv preprint arXiv:1809.08371 (2018)
  16. 16.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation (2014)Google Scholar
  17. 17.
    Wang, W., Shen, J.: Deep visual attention prediction. IEEE Trans. Image Process. 27, 2368–2378 (2018)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)Google Scholar
  19. 19.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  20. 20.
    Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: ICCV (2017)Google Scholar
  21. 21.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: CVPR (2016)Google Scholar
  22. 22.
    Zhang, D.Z., Liu, C.C.: A visual attention based object detection model beyond top-down and bottom-up mechanism. In: ITM Web of Conferences (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Indian Institute of Space Science and TechnologyThiruvananthapuramIndia

Personalised recommendations