Skip to main content

Advertisement

Log in

Scale-aware attention-based multi-resolution representation for multi-person pose estimation

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

The performance of multi-person pose estimation has significantly improved with the development of deep convolutional neural networks. However, two challenging issues are still ignored but are key factors causing deterioration in the keypoint localization. These two issues are scale variation of human body parts and huge information loss caused by consecutive striding in multiple upsampling. In this paper, we present a novel network named ‘Scale-aware attention-based multi-resolution representation network’ (SaMr-Net) which targets to make the proposed method against scale variation and prevent the detail information loss in upsampling, leading more precisely keypoint estimation. The proposed architecture adopts the high-resolution network (HRNet) as the backbone, we first introduce dilated convolution into the backbone to expand the receptive field. Then, attention-based multi-scale feature fusion module is devised to modify the exchange units in the HRNet, allowing the network to learn the weights of each fusion component. Finally, we design a scale-aware keypoint regressor model that gradually integrates features from low to high resolution, enhancing the invariance in different scales of pose parts keypoint estimation. We demonstrate the superiority of the proposed algorithm over two benchmark datasets: (1) the MS COCO keypoint benchmark, and (2) the MPII human pose dataset. The comparison shows that our approach achieves superior results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Liu, J., Gu, Y., Kamijo, S.: Customer pose estimation using orientational spatio-temporal network from surveillance camera. Multimedia Syst. 24, 439–457 (2018)

    Article  Google Scholar 

  2. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3686–3693 (2014)

  3. Gavrilescu, M.: Recognizing human gestures in videos by modeling the mutual context of body position and hands movement. Multimedia Syst. 2017(23), 381–393 (2017)

    Article  Google Scholar 

  4. Zhang, K., He, P., Yao, P., Chen, G., Yang, C., Li, H., Fu, L., Zheng, T.: DNANet: de-normalized attention based multi-resolution network for human pose estimation. In: The International Conference on Image Processing (ICIP), pp. 1–9 (2020). arXiv:1909.05090

  5. Newell, A., Yang, K., Deng, J.: Stacked Hourglass networks for human pose estimation. In: The European Conference on Computer Vision (ECCV), pp. 483–499 (2016)

  6. Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112 (2018)

  7. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5693–5703 (2019)

  8. Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 1290–1299 (2017)

  9. Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.: SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298–6306 (2017)

  10. Ke, L., Chang, M.C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: The European Conference on Computer Vision (ECCV), pp. 731–746 (2018)

  11. Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675 (2019)

  12. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Bottom-up higher-resolution networks for multi-person pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–10 (2020)

  13. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: The European Conference on Computer Vision (ECCV), pp. 472–487 (2018)

  14. Zhang, X., Wang, T., Qi, J., Lu, H., Wang, G.: Progressive attention guided recurrent network for salient object detection. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 714–722 (2018)

  15. Woo, S.H., Park, J.C., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: The European Conference on Computer Vision (ECCV), pp. 3–19 (2018)

  16. Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5669–5678 (2017)

  17. Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft COCO: common objects in context. In: The European Conference on Computer Vision (ECCV), pp. 740–755 (2014)

  18. Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., Schiele, B.: PoseTrack: a benchmark for human pose estimation and tracking. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5167–5176 (2018)

  19. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)

    Article  Google Scholar 

  20. Cao, Z., Martinez, G.H., Simon, T., Wei, S., Sheikh, Y.A.: OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 172–186 (2017)

  21. Kreiss, S., Bertoni, L., Alahi, A.: PifPaf: composite fields for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11969–11978 (2019)

  22. Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6950–6959 (2019)

  23. Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: The European Conference on Computer Vision (ECCV), pp. 282–299 (2018)

  24. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)

  25. Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J. Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3711–3719 (2017)

  26. Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: The European Conference on Computer Vision (ECCV), pp. 536–553 (2018)

  27. Fang, H., Xie, S., Tai, Y., Lu, C.: ‘RMPE: Regional Multi-person pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 2353–2362 (2017)

  28. Huang, S., Gong, M., Tao, D.: A coarse-fine network for keypoint localization. In: IEEE International Conference on Computer Vision (ICCV), pp. 3047–3056 (2017)

  29. Hu, P., Ramanan, D.: Bottom-up and top-down reasoning with hierarchical rectified Gaussians. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5600–5609 (2016)

  30. Pishchulin, L., et al.: DeepCut joint subset partition and labeling for multi person pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4929–4937 (2016)

  31. Gkioxari, G., Toshev, A., Jaitly, N.: Chained predictions using convolutional neural networks. In: The European Conference on Computer Vision (ECCV), pp. 728–743 (2016)

  32. Zhang, F., Zhu, X., Dai, H., et al.: Distribution-aware coordinate representation for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7091–7100 (2020)

  33. Sun, K., Lan, C., Xing, J., Zeng, W., Liu, D., Wang, J.: Human pose estimation using global and local normalization. In: IEEE International Conference on Computer Vision (ICCV), pp. 5600–5608 (2017)

  34. Tang, Z., Peng, X., Geng, S., Wu, L., Zhang, S., Metaxas, D.: Quantized densely connected U-Nets for efficient landmark localization. In: The European Conference on Computer Vision (ECCV), pp. 348–364 (2018)

  35. Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimedia 20(5), 1246–1259 (2018)

    Article  Google Scholar 

  36. Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15–22 (2017)

  37. Chou, C.J., Chien, J.T., Chen, H.T.: Self adversarial training for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition workshops (CVPRW), pp. 1–14 (2017)

  38. Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensus voting. In: The European Conference on Computer Vision (ECCV), pp. 246–260 (2016)

  39. Tang, W., Yu, P., Wu. Y.: Deeply learned compositional models for human pose estimation. In: The European Conference on Computer Vision (ECCV), pp. 197–214 (2018)

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China (No. 2017YFB1402102), the National Natural Science Foundation of China (Nos. 61907028, 11872036, 11772178, 61971273), the Young Talent fund of University Association for Science and Technology in Shaanxi (No.20200105), the China Postdoctoral Science Foundation (No. 2018M640950), the Natural Science Foundation of Shaanxi Provincial (Nos. 2019JQ-574, 2019GY-217, 2019ZDLSF07-01)and the Fundamental Research Funds for the Central Universities (No. GK202103114).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaojun Wu or Yumei Zhang.

Additional information

Communicated by Q. Tian.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, H., Guo, L., Wu, X. et al. Scale-aware attention-based multi-resolution representation for multi-person pose estimation. Multimedia Systems 28, 57–67 (2022). https://doi.org/10.1007/s00530-021-00795-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00795-5

Keywords

Navigation