Advertisement

Multi-Scale Structure-Aware Network for Human Pose Estimation

  • Lipeng KeEmail author
  • Ming-Ching Chang
  • Honggang Qi
  • Siwei Lyu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11206)

Abstract

We develop a robust multi-scale structure-aware neural network for human pose estimation. This method improves the recent deep conv-deconv hourglass models with four key improvements: (1) multi-scale supervision to strengthen contextual feature learning in matching body keypoints by combining feature heatmaps across scales, (2) multi-scale regression network at the end to globally optimize the structural matching of the multi-scale features, (3) structure-aware loss used in the intermediate supervision and at the regression to improve the matching of keypoints and respective neighbors to infer a higher-order matching configurations, and (4) a keypoint masking training scheme that can effectively fine-tune our network to robustly localize occluded keypoints via adjacent matches. Our method can effectively improve state-of-the-art pose estimation methods that suffer from difficulties in scale varieties, occlusions, and complex multi-person scenarios. This multi-scale supervision tightly integrates with the regression network to effectively (i) localize keypoints using the ensemble of multi-scale features, and (ii) infer global pose configuration by maximizing structural consistencies across multiple keypoints and scales. The keypoint masking training enhances these advantages to focus learning on hard occlusion samples. Our method achieves the leading position in the MPII challenge leaderboard among the state-of-the-art methods.

Keywords

Human pose estimation Conv-deconv network Multi-scale supervision 

References

  1. 1.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR, pp. 3686–3693 (2014)Google Scholar
  2. 2.
    Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: FG, pp. 468–475 (2017)Google Scholar
  3. 3.
    Bourdev, L., Malik, J.: Poselets: body part detectors trained using 3D human pose annotations. In: ICCV, pp. 1365–1372 (2009)Google Scholar
  4. 4.
    Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 717–732. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_44CrossRefGoogle Scholar
  5. 5.
    Chang, M., Qi, H., Wang, X., Cheng, H., Lyu, S.: Fast online upper body pose estimation from video. In: BMVC, pp. 104.1–104.12. Swansea (2015)Google Scholar
  6. 6.
    Charles, J., Pfister, T., Magee, D., Hogg, D., Zisserman, A.: Domain adaptation for upper body pose tracking in signed TV broadcasts. In: BMVC (2013)Google Scholar
  7. 7.
    Chen, Y., Shen, C., Wei, X.S., Liu, L., Yang, J.: Adversarial posenet: a structure-aware convolutional network for human pose estimation. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1221–1230 (2017)Google Scholar
  8. 8.
    Cherian, A., Mairal, J., Alahari, K., Schmid, C.: Mixing body-part sequences for human pose estimation. In: CVPR, pp. 2361–2368 (2014)Google Scholar
  9. 9.
    Chou, C., Chien, J., Chen, H.: Self adversarial training for human pose estimation. CoRR abs/1707.02439 (2017). http://arxiv.org/abs/1707.02439
  10. 10.
    Chu, X., Ouyang, W., Li, H., Wang, X.: Structured feature learning for pose estimation. In: CVPR, pp. 4715–4723 (2016)Google Scholar
  11. 11.
    Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5669–5678 (2017)Google Scholar
  12. 12.
    Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., Schiele, B.: DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 34–50. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_3CrossRefGoogle Scholar
  13. 13.
    Lafferty, J.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)Google Scholar
  14. 14.
    Liu, W., et al.: SSD: single shot multibox detector. In: ECCV (2016)Google Scholar
  15. 15.
    Liu, Z., Zhu, J., Bu, J., Chen, C.: A survey of human pose estimation. J. Vis. Commun. Image Represent. 32, 10–19 (2015)CrossRefGoogle Scholar
  16. 16.
    Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. CoRR abs/1710.02322 (2017). http://arxiv.org/abs/1710.02322
  17. 17.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_29CrossRefGoogle Scholar
  18. 18.
    Ning, G., He, Z.: Dual path networks for multi-person human pose estimation. CoRR abs/1710.10192 (2017). http://arxiv.org/abs/1710.10192
  19. 19.
    Pfister, T., Charles, J., Zisserman, A.: Flowing ConvNets for human pose estimation in videos. In: ICCV, pp. 1913–1921 (2015)Google Scholar
  20. 20.
    Sapp, B., Taskar, B.: Multimodal decomposable models for human pose estimation. In: CVPR, pp. 3674–3681 (2013)Google Scholar
  21. 21.
    Tompson, J., Goroshin, R., Jain, A., Lecun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR, pp. 648–656 (2015)Google Scholar
  22. 22.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS, vol. 27, pp. 1799–1807 (2014)Google Scholar
  23. 23.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS, pp. 1799–1807 (2014)Google Scholar
  24. 24.
    Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: CVPR, pp. 1653–1660 (2014)Google Scholar
  25. 25.
    Wei, S., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR, pp. 4724–4732 (2016)Google Scholar
  26. 26.
    Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1290–1299 (2017)Google Scholar
  27. 27.
    Zhao, B., Wu, X., Feng, J., Peng, Q., Yan, S.: Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimedia 19(6), 1245–1256 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Lipeng Ke
    • 1
    Email author
  • Ming-Ching Chang
    • 2
  • Honggang Qi
    • 1
  • Siwei Lyu
    • 2
  1. 1.University of Chinese Academy of SciencesBeijingChina
  2. 2.University at Albany, State University of New YorkNew York CityUSA

Personalised recommendations