Abstract
Hard-joint localization in human pose estimation is a challenging task for some reasons, such as the disappearance of joint points caused by clothing and lighting, the shelter caused by complex environment and the destruction of dependence among each joint point. A majority of existing approaches for hard-joint pose estimation achieve high accuracy by obtaining more high-level feature information. However, most networks suffer from information loss, which is caused by down-sampling. This would result in the loss of joint location. The compensation of information loss introduces useless information to network learning, affecting the extraction of useful information associated with hard joints. Herein, a residual down-sampling module is proposed to replace the pooling layer for down-sampling and fuse high-level features with low-resolution feature maps. This module aims to address the information loss issue. A strategy to guide network learning based on the attention mechanism is proposed, which makes the network focus on useful feature information. A convolutional block attention module is combined with a residual module outside the basic sub-network. The network can learn more effective high-level features. An eight-stack hourglass is used as the basic network, and the proposed method is validated on the MPII and LSP Human Pose dataset. Compared with eight-stack hourglass and HRNet, the proposed method achieves higher accuracy for hard-joint localization. The experimental results show our proposed methods effective for hard-joint localization.
Similar content being viewed by others
Availability of data and materials
The datasets generated during and/or analyzed during the current study are available in the MPII Human Pose Dataset, http://human-pose.mpi-inf.mpg.de/.
References
Alp Güler, R., Neverova, N., Kokkinos, I.: Densepose: dense human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7297–7306 (2018)
Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H.S., Lu, C.: Crowdpose: efficient crowded scenes pose estimation and a new benchmark. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10863–10872 (2019)
Fang, H.S., Xie, S., Tai, Y.W., Lu, C.: RMPE: regional multi-person pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343 (2017)
Yang, W., Li, S., Ouyang, W., Li, H., Wang, X.: Learning feature pyramids for human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1281–1290 (2017)
Tang, W., Wu, Y.: Does learning specific features for related parts help human pose estimation? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1107–1116 (2019)
Sypetkowski, M., Sarwas, G., Trzcinski, T.: Synthetic image translation for football players pose estimation. J. UCS 25(6), 683–700 (2019)
Sapp, B., Toshev, A., Taskar, B.: Cascaded models for articulated pose estimation. In: Lecture Notes in Computer Science Proceedings of the 11th European Conference on Computer Vision: Part II, pp. 406–420 (2010)
Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2014). https://doi.org/10.1109/CVPR.2014.214
Zhang, F., Zhu, X., Ye, M.: Fast human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3517–3526 (2019)
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112 (2018)
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4732 (2016)
Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P.V., Schiele, B.: Deepcut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4929–4937 (2016)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Lecture Notes in Computer Science European Conference on Computer Vision. Springer, Cham, pp. 483–499 (2016)
Newell, A., Huang, Z., Deng, J.: Associative embedding: end-toend learning for joint detection and grouping. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 2277–2287 (2017)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.: Inception-v4, Inception-ResNet and the impact of residual connections on learning. In: Proceedings of the 31st AAAI conference on artificial intelligence, pp. 4278–4284 (2016)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5693–5703 (2019)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5998–6008 (2017)
Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Liu, X., Xu, Q., Wang, N.: A survey on deep neural network-based image captioning. Vis. Comput. 35(3), 445–470 (2019)
Jiang, T., Zhang, Z., Yang, Y.: Modeling coverage with semantic embedding for image caption generation. Vis. Comput. 35(11), 1655–1665 (2019)
Jaderberg, M., Simonyan, K., Zisserman, A.: Spatial transformer networks. Adv. Neural. Inf. Process. Syst. 28, 2017–2025 (2015)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: CBAM: convolutional block attention module. In: Lecture Notes in Computer Science Proceedings of the European Conference on Computer Vision, pp. 3–19 (2018)
Chu, X., Yang, W., Ouyang, W., Ma, C., Yuille, A.L., Wang, X.: Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1831–1840 (2017)
Su, K., Yu, D., Xu, Z., Geng, X., Wang, C.: Multi-person pose estimation with enhanced channel-wise and spatial information. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5674–5682 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). arXiv:1512.03385
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pp: 3686–3693 (2014)
Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC, vol. 2, no. 4, p. 5 (2010)
Author information
Authors and Affiliations
Contributions
Qiaoning Yang: Conceptualization, Methodology, Writing and reviewing and editing; Weimin Shi: Software, Data curation, Writing original draft; Juan Chen: Supervision and Validation; Yang Tang: Data preprocessing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file1 (MP4 40676 kb)
Rights and permissions
About this article
Cite this article
Yang, Q., Shi, W., Chen, J. et al. Localization of hard joints in human pose estimation based on residual down-sampling and attention mechanism. Vis Comput 38, 2447–2459 (2022). https://doi.org/10.1007/s00371-021-02122-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-021-02122-5