Skip to main content
Log in

Keypoint-based contextual representations for hand pose estimation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Most current methods for the hand pose estimation ignore the pixel-level relationship of hand keypoints, e.g. four specific keypoints in the same finger can form a semantically continuous area at pixel level. To make full use of pixel-level semantic information extracted from the origin RGB image, we propose a novel keypoint-based contextual representation(KCR) scheme for hand pose estimation, which can leverage pixel-level continuous contextual features based on the hand structure without using any additional labeling information. To extract hand structure information from the contextual features, we creatively design a novel keypoint representation and finger representation scheme by fusing the keypoints feature in a specific group. Then, the cross-attention mechanism is used to calculate the relation between the finger representations and contextual features to improve the feature integration. The augmented feature contains more hand structure information for the final hand pose estimation. Experimental results demonstrate that our method achieves competitive performance on various 2D and 3D hand pose estimation benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Athitsos V, Sclaro S (2003) Estimating 3d hand pose from a cluttered image. 2003 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2:II–432

  2. Boukhayma A, de Bem R, Torr PHS (2019) 3d hand shape and pose from images in the wild. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:10835–10844

    Google Scholar 

  3. Brahmbhatt S, Tang C, Twigg CD, et al (2020) Contactpose: A dataset of grasps with object contact and hand pose. European Conference on Computer Vision (ECCV)

  4. Cai Y, Ge L, Cai J, et al (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. European Conference on Computer Vision (ECCV)

  5. Cao Z, Hidalgo G, Simon T et al (2021) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 43:172–186

    Article  PubMed  Google Scholar 

  6. Chen Y, Ma H, Kong D, et al (2020) Nonparametric structure regularization machine for 2d hand pose estimation. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 370–379

  7. Chen Y, Tu Z, Kang D, et al (2021) Model-based 3d hand reconstruction via self-supervised learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10,446–10,455

  8. Ci H, Wang C, Ma X, et al (2019) Optimizing network structure for 3d human pose estimation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2262–2271

  9. de La GorceMartin, FleetDavid J, ParagiosNikos (2011) Model-based 3d hand pose estimation from monocular video. IEEE Transactions on Pattern Analysis and Machine Intelligence

  10. Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. 2009 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  11. Fu J, Liu J, Tian H, et al (2019) Dual attention network for scene segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3141–3149

  12. Ge L, Ren Z, Li Y et al (2019) 3d hand shape and pose estimation from a single rgb image. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:10825–10834

    Google Scholar 

  13. Hasson Y, Varol G, Tzionas D, et al (2019) Learning joint reconstruction of hands and manipulated objects. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11,799–11,808

  14. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning

  15. Iqbal U, Molchanov P, Breuel TM, et al (2018) Hand pose estimation via latent 2.5d heatmap regression. Proceedings of the European Conference on Computer Vision (ECCV)

  16. Joo H, Simon T, Sheikh Y (2018) Total capture: A 3d deformation model for tracking faces, hands, and bodies. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8320–8329

  17. Kulon D, Güler RA, Kokkinos I, et al (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 4989–4999

  18. Kulon D, Wang H, Güler RA, et al (2019) Single image 3d hand reconstruction with mesh convolutions. BMVC

  19. Moon G, Chang JY, Lee KM (2018) V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2018:5079–5088

    Google Scholar 

  20. Moon G, Yu SI, Wen H, et al (2020) Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: European Conference on Computer Vision (ECCV)

  21. Mueller F, Bernard F, Sotnychenko O, et al (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 49–59

  22. Neverova N, Wolf C, Nebout F et al (2017) Hand pose estimation through semi-supervised and weakly-supervised learning. Comput Vis Image Underst 164:56–67

    Article  Google Scholar 

  23. Oikonomidis I, Kyriazis N, Argyros AA (2011) Efficient model-based 3d tracking of hand articulations using kinect. BMVC

  24. Romero J, Tzionas D, Black MJ (2017) Embodied hands. ACM Transactions on Graphics (TOG) 36:1–17

    Article  Google Scholar 

  25. Spurr A, Song J, Park S, et al (2018) Cross-modal deep variational hand pose estimation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 89–98

  26. Taheri O, Ghorbani N, Black MJ, et al (2020) Grab: A dataset of whole-body human grasping of objects. European Conference on Computer Vision (ECCV)

  27. Tkach A, Pauly M, Tagliasacchi A (2016) Sphere-meshes for real-time hand modeling and tracking. ACM Transactions on Graphics (TOG) 35:1–11

    Article  Google Scholar 

  28. Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via deep neural networks. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1653–1660

  29. Vaswani A, Shazeer NM, Parmar N, et al (2017) Attention is all you need. Advances in neural information processing systems 5998–6008

  30. Wang X, Girshick RB, Gupta AK et al (2018) Non-local neural networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2018:7794–7803

    Google Scholar 

  31. Wang Y, Peng C, Liu Y (2019) Mask-pose cascaded cnn for 2d hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology 29:3258–3268

    Article  Google Scholar 

  32. Wang Y, Zhang B, Peng C (2020) Srhandnet: Real-time 2d hand pose estimation with simultaneous region localization. IEEE Transactions on Image Processing 29:2977–2986

    Article  ADS  Google Scholar 

  33. Wei SE, Ramakrishna V, Kanade T et al (2016) Convolutional pose machines. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016:4724–4732

    Google Scholar 

  34. Xiang D, Joo H, Sheikh Y (2019) Monocular total capture: Posing face, body, and hands in the wild. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10,957–10,966

  35. Yang L, Li S, Lee D et al (2019) Aligning latent spaces for 3d hand pose estimation. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:2335–2343

    Article  Google Scholar 

  36. Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. Thirty-second AAAI conference on artificial intelligence

  37. Yuan Y, Wang J (2018) Ocnet: Object context network for scene parsing. arXiv:1809.00916

  38. Zhang X, Li Q, Zhang W et al (2019) End-to-end hand mesh recovery from a monocular rgb image. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:2354–2364

    Article  Google Scholar 

  39. Zhang J, Jiao J, Chen M, et al (2016) 3d hand pose tracking and estimation using stereo matching. arXiv:1610.07214

  40. Zhang H, Zhang H, Wang C, et al (2019) Co-occurrent features in semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 548–557

  41. Zhao L, Peng X, Tian Y et al (2019) Semantic graph convolutional networks for 3d human pose regression. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:3420–3430

    Google Scholar 

  42. Zhou Y, Habermann M, Xu W et al (2020) Monocular real-time hand shape and motion capture using multi-modal data. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020:5345–5354

    Google Scholar 

  43. Zhu JY, Park T, Isola P, et al (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV) 2242–2251

  44. Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. IEEE International Conference on Computer Vision (ICCV) 2017:4913–4921

    Article  Google Scholar 

  45. Zimmermann C, Ceylan D, Yang J et al (2019) Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:813–822

    Article  Google Scholar 

Download references

Funding

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDC02070600.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiwei Li.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, W., Du, R. & Chen, S. Keypoint-based contextual representations for hand pose estimation. Multimed Tools Appl 83, 28357–28372 (2024). https://doi.org/10.1007/s11042-023-15713-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15713-2

Keywords

Navigation