Scale-aware attention-based multi-resolution representation for multi-person pose estimation

Yang, Honghong; Guo, Longfei; Wu, Xiaojun; Zhang, Yumei

doi:10.1007/s00530-021-00795-5

Scale-aware attention-based multi-resolution representation for multi-person pose estimation

Regular Paper
Published: 01 May 2021

Volume 28, pages 57–67, (2022)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Honghong Yang ORCID: orcid.org/0000-0002-4124-5317^1,2,
Longfei Guo¹,
Xiaojun Wu^1,2 &
…
Yumei Zhang^1,2

533 Accesses
5 Citations
Explore all metrics

Abstract

The performance of multi-person pose estimation has significantly improved with the development of deep convolutional neural networks. However, two challenging issues are still ignored but are key factors causing deterioration in the keypoint localization. These two issues are scale variation of human body parts and huge information loss caused by consecutive striding in multiple upsampling. In this paper, we present a novel network named ‘Scale-aware attention-based multi-resolution representation network’ (SaMr-Net) which targets to make the proposed method against scale variation and prevent the detail information loss in upsampling, leading more precisely keypoint estimation. The proposed architecture adopts the high-resolution network (HRNet) as the backbone, we first introduce dilated convolution into the backbone to expand the receptive field. Then, attention-based multi-scale feature fusion module is devised to modify the exchange units in the HRNet, allowing the network to learn the weights of each fusion component. Finally, we design a scale-aware keypoint regressor model that gradually integrates features from low to high resolution, enhancing the invariance in different scales of pose parts keypoint estimation. We demonstrate the superiority of the proposed algorithm over two benchmark datasets: (1) the MS COCO keypoint benchmark, and (2) the MPII human pose dataset. The comparison shows that our approach achieves superior results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

Article 29 March 2022

Human pose estimation based on feature enhancement and multi-scale feature fusion

Article 18 June 2022

Attention Refined Network for Human Pose Estimation

Article 20 May 2021

References

Liu, J., Gu, Y., Kamijo, S.: Customer pose estimation using orientational spatio-temporal network from surveillance camera. Multimedia Syst. 24, 439–457 (2018)
Article Google Scholar
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3686–3693 (2014)
Gavrilescu, M.: Recognizing human gestures in videos by modeling the mutual context of body position and hands movement. Multimedia Syst. 2017(23), 381–393 (2017)
Article Google Scholar
Zhang, K., He, P., Yao, P., Chen, G., Yang, C., Li, H., Fu, L., Zheng, T.: DNANet: de-normalized attention based multi-resolution network for human pose estimation. In: The International Conference on Image Processing (ICIP), pp. 1–9 (2020). arXiv:1909.05090
Newell, A., Yang, K., Deng, J.: Stacked Hourglass networks for human pose estimation. In: The European Conference on Computer Vision (ECCV), pp. 483–499 (2016)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112 (2018)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5693–5703 (2019)
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 1290–1299 (2017)
Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.: SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298–6306 (2017)
Ke, L., Chang, M.C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: The European Conference on Computer Vision (ECCV), pp. 731–746 (2018)
Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5675 (2019)
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Bottom-up higher-resolution networks for multi-person pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–10 (2020)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: The European Conference on Computer Vision (ECCV), pp. 472–487 (2018)
Zhang, X., Wang, T., Qi, J., Lu, H., Wang, G.: Progressive attention guided recurrent network for salient object detection. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 714–722 (2018)
Woo, S.H., Park, J.C., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: The European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5669–5678 (2017)
Lin, T.Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C.L., Dollár, P.: Microsoft COCO: common objects in context. In: The European Conference on Computer Vision (ECCV), pp. 740–755 (2014)
Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., Schiele, B.: PoseTrack: a benchmark for human pose estimation and tracking. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5167–5176 (2018)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Cao, Z., Martinez, G.H., Simon, T., Wei, S., Sheikh, Y.A.: OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 172–186 (2017)
Kreiss, S., Bertoni, L., Alahi, A.: PifPaf: composite fields for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11969–11978 (2019)
Nie, X., Feng, J., Zhang, J., Yan, S.: Single-stage multi-person pose machines. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6950–6959 (2019)
Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: The European Conference on Computer Vision (ECCV), pp. 282–299 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988 (2017)
Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J. Bregler, C., Murphy, K.: Towards accurate multi-person pose estimation in the wild. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3711–3719 (2017)
Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: The European Conference on Computer Vision (ECCV), pp. 536–553 (2018)
Fang, H., Xie, S., Tai, Y., Lu, C.: ‘RMPE: Regional Multi-person pose estimation. In: IEEE International Conference on Computer Vision (ICCV), pp. 2353–2362 (2017)
Huang, S., Gong, M., Tao, D.: A coarse-fine network for keypoint localization. In: IEEE International Conference on Computer Vision (ICCV), pp. 3047–3056 (2017)
Hu, P., Ramanan, D.: Bottom-up and top-down reasoning with hierarchical rectified Gaussians. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5600–5609 (2016)
Pishchulin, L., et al.: DeepCut joint subset partition and labeling for multi person pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4929–4937 (2016)
Gkioxari, G., Toshev, A., Jaitly, N.: Chained predictions using convolutional neural networks. In: The European Conference on Computer Vision (ECCV), pp. 728–743 (2016)
Zhang, F., Zhu, X., Dai, H., et al.: Distribution-aware coordinate representation for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7091–7100 (2020)
Sun, K., Lan, C., Xing, J., Zeng, W., Liu, D., Wang, J.: Human pose estimation using global and local normalization. In: IEEE International Conference on Computer Vision (ICCV), pp. 5600–5608 (2017)
Tang, Z., Peng, X., Geng, S., Wu, L., Zhang, S., Metaxas, D.: Quantized densely connected U-Nets for efficient landmark localization. In: The European Conference on Computer Vision (ECCV), pp. 348–364 (2018)
Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multimedia 20(5), 1246–1259 (2018)
Article Google Scholar
Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indirect part detection and contextual information. In: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15–22 (2017)
Chou, C.J., Chien, J.T., Chen, H.T.: Self adversarial training for human pose estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition workshops (CVPRW), pp. 1–14 (2017)
Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensus voting. In: The European Conference on Computer Vision (ECCV), pp. 246–260 (2016)
Tang, W., Yu, P., Wu. Y.: Deeply learned compositional models for human pose estimation. In: The European Conference on Computer Vision (ECCV), pp. 197–214 (2018)

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China (No. 2017YFB1402102), the National Natural Science Foundation of China (Nos. 61907028, 11872036, 11772178, 61971273), the Young Talent fund of University Association for Science and Technology in Shaanxi (No.20200105), the China Postdoctoral Science Foundation (No. 2018M640950), the Natural Science Foundation of Shaanxi Provincial (Nos. 2019JQ-574, 2019GY-217, 2019ZDLSF07-01)and the Fundamental Research Funds for the Central Universities (No. GK202103114).

Author information

Authors and Affiliations

Key Laboratory of Modern Teaching Technology, Ministry of Education, Shaanxi Normal University, Xi’an, 710062, China
Honghong Yang, Longfei Guo, Xiaojun Wu & Yumei Zhang
School of Computer Science, Shaanxi Normal University, Xi’an, 710062, China
Honghong Yang, Xiaojun Wu & Yumei Zhang

Authors

Honghong Yang
View author publications
You can also search for this author in PubMed Google Scholar
Longfei Guo
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojun Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yumei Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiaojun Wu or Yumei Zhang.

Additional information

Communicated by Q. Tian.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, H., Guo, L., Wu, X. et al. Scale-aware attention-based multi-resolution representation for multi-person pose estimation. Multimedia Systems 28, 57–67 (2022). https://doi.org/10.1007/s00530-021-00795-5

Download citation

Received: 10 July 2020
Accepted: 07 April 2021
Published: 01 May 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s00530-021-00795-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scale-aware attention-based multi-resolution representation for multi-person pose estimation

Abstract

Access this article

Similar content being viewed by others

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

Human pose estimation based on feature enhancement and multi-scale feature fusion

Attention Refined Network for Human Pose Estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scale-aware attention-based multi-resolution representation for multi-person pose estimation

Abstract

Access this article

Similar content being viewed by others

MTPose: Human Pose Estimation with High-Resolution Multi-scale Transformers

Human pose estimation based on feature enhancement and multi-scale feature fusion

Attention Refined Network for Human Pose Estimation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation