Skip to main content
Log in

Fixed-resolution representation network for human pose estimation

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Human pose estimation from a single image is a fundamental yet challenging task in computer vision. Most existing methods gradually generated multi-resolution from high-resolution to low-resolution, then recovered the higher resolution from the low resolution and used it to generate final pose heatmaps, such as Hourglass and HRNet and their variants. In this paper, we propose a novel architecture named fixed-resolution representation network for human pose estimation, which maintains fixed-resolution through the whole process to keep rich spatial-structural information. An Improved Pyramid Convolutional Bottleneck (IPCB) is firstly proposed to encode feature maps with multi receptive fields with the same resolution. Secondly, we introduce an efficient channel attention mechanism to enhance the feature extraction and information selection capability of IPCB, making the performance of IPCB better. Thirdly, considering the deviation from using the flip test of reasoning, we use an existing technology: Unbiased Data Processing. Fourthly, due to the change of the model structure and the limited computing resources, we introduce an iterative retraining strategy to solve the problem of pre-training. We empirically demonstrate the effectiveness of our method and achieve a competitive performance with 1.7M parameters and 3G FLOPs, 89.5 (PCKh@0.5) and 92.7 (PCK@0.2) respectively, compared with the state-of-the-art methods on the benchmark dataset: the MPII and LSP key points detection dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: CVPR, pp. 915–922 (2013)

  2. Zheng, L., Huang, Y., Lu, H., Yang, Y.: Pose invariant embedding for deep person re-identification. Proc. IEEE Trans. Image Process. 28, 4500–4509 (2019)

    Article  MathSciNet  Google Scholar 

  3. Zhang, Z.: Microsoft kinect sensor and its effect. IEEE MultiMedia 19, 4–10 (2012)

    Article  Google Scholar 

  4. Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware Siamese networks for visual object tracking. In: ECCV, pp. 103–119 (2018)

  5. Li, Y., Chen, X., Zhu, Z., Xie, L., Huang, G., Du, D., Wang, X.: Attention-guided unified network for panoptic segmentation. In: CVPR, pp. 7019–7028 (2019)

  6. Zhu, J., Zou, W., Xu, L., Hu, Y., Zhu, Z., Chang, M., Huang, J., Huang, G., Du, D.: Action machine: rethinking action recognition in trimmed videos. In: arXiv (2018)

  7. Zhu, J., Zou, W., Zhu, Z., Hu, Y.: Convolutional relation network for skeleton-based action recognition. Neurocomputing 370, 109–117 (2019)

    Article  Google Scholar 

  8. Zhu, J., Zou, W., Zhu, Z.: End-to-end video-live representation learning for action recognition. In: ICPR, pp. 645–650 (2018)

  9. Zhu, J., Zhou, W., Zhu, Z.: Two-stream gated fusion convnets for action recognition. In: ICPR, pp. 597–602 (2018)

  10. Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. NIPS 27, 1799–1807 (2014)

    Google Scholar 

  11. Toshev, A., Szegedy DeepPose, C.: Human pose estimation via deep neural networks. CVPR 27, 1653–1660 (2014)

    Google Scholar 

  12. Newell, A., Yang, K.: Jia Deng Stacked hourglass networks for human pose estimation. ECCV 9912, 483–499 (2016)

    Google Scholar 

  13. Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR, pp. 4733–4742 (2016)

  14. Wei, S., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. CVPR 9912, 4724–4732 (2016)

    Google Scholar 

  15. Chen, Y., Yingli, T., Mingyi, H.: Monocular human pose estimation: a survey of deep learning-based methods. Comput. Vis. Image Understand. 192, 102897 (2020)

    Article  Google Scholar 

  16. Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. ICCV 27, 1799–1807 (2017)

    Google Scholar 

  17. Rafi, U., Leibe, B., Gall, J., Kostrikov, I.: An efficient convolutional network for human pose estimation. In: BMVC (2016)

  18. Belagiannis, V., Zisserman, A.: Recurrent human pose estimation. In: IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 468–475 (2017)

  19. Bulat, A., Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. ECCV 9911, 717–732 (2016)

    Google Scholar 

  20. Nie, X., Feng, J., Zuo, Y., Yan, S.: Human pose estimation with parsing induced learner. In: CVPR (2018)

  21. Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: CVPR, pp. 3512–3521 (2019)

  22. Lipeng, K., Ming Ching, C., Honggang, Q., Siwei, L.: Multi-scale structure-aware network for human pose estimation. In: ECCV (2018)

  23. Sun, K., xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: CVPR, pp. 5686–5696 (2019)

  24. Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: HigherHRNet: scale-aware representation learning for bottom-up human pose estimation. In: CVPR, pp. 5385–5394 (2020)

  25. Cai, Y., Wang, Z., Luo, Z., Yin, B., Angang, D., Wang, H., Zhang, X., Zhou, X., Zhou, E., Sun, J.: Learning delicate local representations for multi-person pose estimation. ECCV 12348, 455–472 (2020)

    Google Scholar 

  26. Kim, S.-T., Lee, H.J.: Lightweight stacked hourglass network for human pose estimation. In: Appl. Sci., 10 (2020)

  27. Lianping, Y., Qin, Y., Xiangde, Z.: Lightweight densely connected residual network for human pose estimation. Real Time Image Process 18, 825–827 (2021)

    Article  Google Scholar 

  28. Xiao, Y., Yu, D., Wang, X., Lv, T., Fan, Y., Wu, L.: SPCNet: spatial preserve and content-aware network for human pose estimation. In: European Conference on Artificial Intelligence, pp. 2776–2783 (2020)

  29. Yu, C., Xiao, B., Gao, C.: et. Lite-HRNet: a lightweight high-resolution network. In: CVPR, pp. 10440–10450 (2021)

  30. Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: CVPR, pp. 3517–3526 (2019)

  31. Ren, Z., Zhou, Y., Chen, Y., et al.: Efficient human pose estimation by maximizing fusion and high-level spatial attention. In: arXiv (2021)

  32. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR, pp. 648–656 (2015)

  33. Hou, L., Cao, J., Zhao, Y., et al.: \(P^{2}\) Net: augmented parallel-pyramid net for attention guided pose estimation. In: ICPR, pp. 9658–9665 (2020)

  34. Yang, H., Guo, L., Wu, X., et al.: Scale-aware attention-based multi-resolution representation for multi-person pose estimation. In: Multimedia Systems (2021)

  35. Artacho, B., Savakis, A.: OmniPose: a multi-scale framework for multi-person pose estimation. In: arXiv (2021)

  36. Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. In: CVPR, pp. 6450–6458 (2017)

  37. Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. In: CVPR, pp. 7132–7141 (2018)

  38. Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. In: arXiv, pp. 1412–7755 (2014)

  39. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR, pp. 21–29 (2016)

  40. Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: CVPR, pp. 5669–5678 (2017)

  41. Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: CVPR, pp. 5674–5682 (2019)

  42. Yuan, Y., Fu, R., Huang, L., et al.: HRFormer: high-resolution transformer for dense prediction. In: arXiv (2021)

  43. Huang, L., Yuan, Y., Guo, J., et al.: Interlaced sparse self-attention for semantic segmentation. In: arXiv (2019)

  44. Luo, Z., Wang, Z., Cai, Y., et al.: Efficient human pose estimation by learning deeply aggregated representations. In: arXiv (2020)

  45. Wang, Q., Banggu, W., Zhu, P., Li, P., Zuo, W., Qinghua, H.: ECA-Net: efficient channel attention for deep convolutional neural network. CVPR 9912, 7132–7141 (2020)

    Google Scholar 

  46. Sun, X., Xiao, B., Wei, F., et al.: Integral human pose regression. In: ECCV, pp. 536–553 (2018)

  47. Zhang, F., Zhu, X., Dai, H., et al.: Distribution-aware coordinate representation for human pose estimation. In: CVPR, pp. 7091–7100 (2020)

  48. Huang, J., Zhu, Z., Guo, F., Huang, G.: The devil is in the details: delving into unbiased data processing for human pose estimation. In: CVPR, pp. 5699–5708 (2020)

  49. Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: ECCV, pp. 472–487 (2018)

  50. Zhang, Z., Tang, J., Wu, G.: Simple and lightweight human pose estimation. In: arXiv (2020)

  51. Yilun, C., Zhicheng, W., Yuxiang, P., Zhiqiang, Z., Gang, Y., Jian, S.: Cascaded pyramid network for multi-person pose estimation. In: CVPR, pp. 7103–7112 (2018)

  52. Cosmin Duta, I., Liu, L., Zhu, F., Shao, L.: Pyramidal convolution: rethinking convolutional neural network for visual recognition. In: arXiv (2020)

  53. Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: arXiv (2020)

  54. Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: British Machine Vision Conference (2010)

  55. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Computer Science, vol. 12 (2014)

  56. Peng, X., Tang, Z., Yang, F., Feris, R., Metaxas, D.: Jointly optimize data augmentation and network training: adversarial data augmentation in human pose estimation. In: CVPR, pp. 2226–2234 (2018)

  57. Su, Z., Ye, M., Zhang, G., Dai, L., Sheng, J.: Cascade feature aggregation for human pose estimation. In: arXiv, pp. 1902–07837 (2019)

  58. Bin, Y., Cao, X., Chen, X., Ge, Y., Tai, Y., Wang, C., Li, J., Huang, F., Gao, C., Sang, N.: Adversarial semantic data augmentation for human pose estimation. In: ECCV (2020)

  59. Chen, X., Yuille, A.L.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: Advances in neural information processing systems (2014)

  60. Ning, G., Zhang, Z., He, Z.: Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans. Multim. 20, 1246–1259 (2018)

    Article  Google Scholar 

  61. Bulat, D., Kossaifi, J., Tzimiropoulos, G., Pantic, M.: Toward fast and accurate human pose estimation via soft-gated skip connections. In: FG, pp. 8–15 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongxiang Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by I. Bartolini.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Hou, X. Fixed-resolution representation network for human pose estimation. Multimedia Systems 28, 1597–1609 (2022). https://doi.org/10.1007/s00530-022-00919-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-022-00919-5

Keywords

Navigation