Keypoint-based contextual representations for hand pose estimation

Li, Weiwei; Du, Rong; Chen, Shudong

doi:10.1007/s11042-023-15713-2

Keypoint-based contextual representations for hand pose estimation

Published: 14 September 2023

Volume 83, pages 28357–28372, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

107 Accesses
Explore all metrics

Abstract

Most current methods for the hand pose estimation ignore the pixel-level relationship of hand keypoints, e.g. four specific keypoints in the same finger can form a semantically continuous area at pixel level. To make full use of pixel-level semantic information extracted from the origin RGB image, we propose a novel keypoint-based contextual representation(KCR) scheme for hand pose estimation, which can leverage pixel-level continuous contextual features based on the hand structure without using any additional labeling information. To extract hand structure information from the contextual features, we creatively design a novel keypoint representation and finger representation scheme by fusing the keypoints feature in a specific group. Then, the cross-attention mechanism is used to calculate the relation between the finger representations and contextual features to improve the feature integration. The augmented feature contains more hand structure information for the final hand pose estimation. Experimental results demonstrate that our method achieves competitive performance on various 2D and 3D hand pose estimation benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LiteHandNet: A Lightweight Hand Pose Estimation Network via Structural Feature Enhancement

Cascaded hierarchical CNN for 2D hand pose estimation from a single color image

Article 24 March 2022

An enhanced self-attention and A2J approach for 3D hand pose estimation

Article 15 June 2021

References

Athitsos V, Sclaro S (2003) Estimating 3d hand pose from a cluttered image. 2003 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2:II–432
Boukhayma A, de Bem R, Torr PHS (2019) 3d hand shape and pose from images in the wild. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:10835–10844
Google Scholar
Brahmbhatt S, Tang C, Twigg CD, et al (2020) Contactpose: A dataset of grasps with object contact and hand pose. European Conference on Computer Vision (ECCV)
Cai Y, Ge L, Cai J, et al (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. European Conference on Computer Vision (ECCV)
Cao Z, Hidalgo G, Simon T et al (2021) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 43:172–186
Article PubMed Google Scholar
Chen Y, Ma H, Kong D, et al (2020) Nonparametric structure regularization machine for 2d hand pose estimation. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 370–379
Chen Y, Tu Z, Kang D, et al (2021) Model-based 3d hand reconstruction via self-supervised learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10,446–10,455
Ci H, Wang C, Ma X, et al (2019) Optimizing network structure for 3d human pose estimation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2262–2271
de La GorceMartin, FleetDavid J, ParagiosNikos (2011) Model-based 3d hand pose estimation from monocular video. IEEE Transactions on Pattern Analysis and Machine Intelligence
Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. 2009 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Fu J, Liu J, Tian H, et al (2019) Dual attention network for scene segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3141–3149
Ge L, Ren Z, Li Y et al (2019) 3d hand shape and pose estimation from a single rgb image. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:10825–10834
Google Scholar
Hasson Y, Varol G, Tzionas D, et al (2019) Learning joint reconstruction of hands and manipulated objects. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11,799–11,808
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning
Iqbal U, Molchanov P, Breuel TM, et al (2018) Hand pose estimation via latent 2.5d heatmap regression. Proceedings of the European Conference on Computer Vision (ECCV)
Joo H, Simon T, Sheikh Y (2018) Total capture: A 3d deformation model for tracking faces, hands, and bodies. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8320–8329
Kulon D, Güler RA, Kokkinos I, et al (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 4989–4999
Kulon D, Wang H, Güler RA, et al (2019) Single image 3d hand reconstruction with mesh convolutions. BMVC
Moon G, Chang JY, Lee KM (2018) V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2018:5079–5088
Google Scholar
Moon G, Yu SI, Wen H, et al (2020) Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: European Conference on Computer Vision (ECCV)
Mueller F, Bernard F, Sotnychenko O, et al (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 49–59
Neverova N, Wolf C, Nebout F et al (2017) Hand pose estimation through semi-supervised and weakly-supervised learning. Comput Vis Image Underst 164:56–67
Article Google Scholar
Oikonomidis I, Kyriazis N, Argyros AA (2011) Efficient model-based 3d tracking of hand articulations using kinect. BMVC
Romero J, Tzionas D, Black MJ (2017) Embodied hands. ACM Transactions on Graphics (TOG) 36:1–17
Article Google Scholar
Spurr A, Song J, Park S, et al (2018) Cross-modal deep variational hand pose estimation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 89–98
Taheri O, Ghorbani N, Black MJ, et al (2020) Grab: A dataset of whole-body human grasping of objects. European Conference on Computer Vision (ECCV)
Tkach A, Pauly M, Tagliasacchi A (2016) Sphere-meshes for real-time hand modeling and tracking. ACM Transactions on Graphics (TOG) 35:1–11
Article Google Scholar
Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via deep neural networks. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1653–1660
Vaswani A, Shazeer NM, Parmar N, et al (2017) Attention is all you need. Advances in neural information processing systems 5998–6008
Wang X, Girshick RB, Gupta AK et al (2018) Non-local neural networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2018:7794–7803
Google Scholar
Wang Y, Peng C, Liu Y (2019) Mask-pose cascaded cnn for 2d hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology 29:3258–3268
Article Google Scholar
Wang Y, Zhang B, Peng C (2020) Srhandnet: Real-time 2d hand pose estimation with simultaneous region localization. IEEE Transactions on Image Processing 29:2977–2986
Article ADS Google Scholar
Wei SE, Ramakrishna V, Kanade T et al (2016) Convolutional pose machines. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016:4724–4732
Google Scholar
Xiang D, Joo H, Sheikh Y (2019) Monocular total capture: Posing face, body, and hands in the wild. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10,957–10,966
Yang L, Li S, Lee D et al (2019) Aligning latent spaces for 3d hand pose estimation. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:2335–2343
Article Google Scholar
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. Thirty-second AAAI conference on artificial intelligence
Yuan Y, Wang J (2018) Ocnet: Object context network for scene parsing. arXiv:1809.00916
Zhang X, Li Q, Zhang W et al (2019) End-to-end hand mesh recovery from a monocular rgb image. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:2354–2364
Article Google Scholar
Zhang J, Jiao J, Chen M, et al (2016) 3d hand pose tracking and estimation using stereo matching. arXiv:1610.07214
Zhang H, Zhang H, Wang C, et al (2019) Co-occurrent features in semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 548–557
Zhao L, Peng X, Tian Y et al (2019) Semantic graph convolutional networks for 3d human pose regression. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:3420–3430
Google Scholar
Zhou Y, Habermann M, Xu W et al (2020) Monocular real-time hand shape and motion capture using multi-modal data. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020:5345–5354
Google Scholar
Zhu JY, Park T, Isola P, et al (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV) 2242–2251
Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. IEEE International Conference on Computer Vision (ICCV) 2017:4913–4921
Article Google Scholar
Zimmermann C, Ceylan D, Yang J et al (2019) Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:813–822
Article Google Scholar

Download references

Funding

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDC02070600.

Author information

Authors and Affiliations

Institute of Microelectronics of the Chinese Academy of Sciences, Beijing, 100024, China
Weiwei Li, Rong Du & Shudong Chen
School of Microelectronics, University of Chinese Academy of Sciences, Beijing, 100024, China
Weiwei Li, Rong Du & Shudong Chen

Authors

Weiwei Li
View author publications
You can also search for this author in PubMed Google Scholar
Rong Du
View author publications
You can also search for this author in PubMed Google Scholar
Shudong Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiwei Li.

Ethics declarations

Conflicts of interest

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, W., Du, R. & Chen, S. Keypoint-based contextual representations for hand pose estimation. Multimed Tools Appl 83, 28357–28372 (2024). https://doi.org/10.1007/s11042-023-15713-2

Download citation

Received: 17 August 2021
Revised: 22 January 2022
Accepted: 23 April 2023
Published: 14 September 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-15713-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Keypoint-based contextual representations for hand pose estimation

Abstract

Access this article

Similar content being viewed by others

LiteHandNet: A Lightweight Hand Pose Estimation Network via Structural Feature Enhancement

Cascaded hierarchical CNN for 2D hand pose estimation from a single color image

An enhanced self-attention and A2J approach for 3D hand pose estimation

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Keypoint-based contextual representations for hand pose estimation

Abstract

Access this article

Similar content being viewed by others

LiteHandNet: A Lightweight Hand Pose Estimation Network via Structural Feature Enhancement

Cascaded hierarchical CNN for 2D hand pose estimation from a single color image

An enhanced self-attention and A2J approach for 3D hand pose estimation

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation