Abstract
Most current methods for the hand pose estimation ignore the pixel-level relationship of hand keypoints, e.g. four specific keypoints in the same finger can form a semantically continuous area at pixel level. To make full use of pixel-level semantic information extracted from the origin RGB image, we propose a novel keypoint-based contextual representation(KCR) scheme for hand pose estimation, which can leverage pixel-level continuous contextual features based on the hand structure without using any additional labeling information. To extract hand structure information from the contextual features, we creatively design a novel keypoint representation and finger representation scheme by fusing the keypoints feature in a specific group. Then, the cross-attention mechanism is used to calculate the relation between the finger representations and contextual features to improve the feature integration. The augmented feature contains more hand structure information for the final hand pose estimation. Experimental results demonstrate that our method achieves competitive performance on various 2D and 3D hand pose estimation benchmarks.
Similar content being viewed by others
References
Athitsos V, Sclaro S (2003) Estimating 3d hand pose from a cluttered image. 2003 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2:II–432
Boukhayma A, de Bem R, Torr PHS (2019) 3d hand shape and pose from images in the wild. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:10835–10844
Brahmbhatt S, Tang C, Twigg CD, et al (2020) Contactpose: A dataset of grasps with object contact and hand pose. European Conference on Computer Vision (ECCV)
Cai Y, Ge L, Cai J, et al (2018) Weakly-supervised 3d hand pose estimation from monocular rgb images. European Conference on Computer Vision (ECCV)
Cao Z, Hidalgo G, Simon T et al (2021) Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 43:172–186
Chen Y, Ma H, Kong D, et al (2020) Nonparametric structure regularization machine for 2d hand pose estimation. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 370–379
Chen Y, Tu Z, Kang D, et al (2021) Model-based 3d hand reconstruction via self-supervised learning. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10,446–10,455
Ci H, Wang C, Ma X, et al (2019) Optimizing network structure for 3d human pose estimation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2262–2271
de La GorceMartin, FleetDavid J, ParagiosNikos (2011) Model-based 3d hand pose estimation from monocular video. IEEE Transactions on Pattern Analysis and Machine Intelligence
Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. 2009 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Fu J, Liu J, Tian H, et al (2019) Dual attention network for scene segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3141–3149
Ge L, Ren Z, Li Y et al (2019) 3d hand shape and pose estimation from a single rgb image. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:10825–10834
Hasson Y, Varol G, Tzionas D, et al (2019) Learning joint reconstruction of hands and manipulated objects. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 11,799–11,808
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. International conference on machine learning
Iqbal U, Molchanov P, Breuel TM, et al (2018) Hand pose estimation via latent 2.5d heatmap regression. Proceedings of the European Conference on Computer Vision (ECCV)
Joo H, Simon T, Sheikh Y (2018) Total capture: A 3d deformation model for tracking faces, hands, and bodies. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 8320–8329
Kulon D, Güler RA, Kokkinos I, et al (2020) Weakly-supervised mesh-convolutional hand reconstruction in the wild. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 4989–4999
Kulon D, Wang H, Güler RA, et al (2019) Single image 3d hand reconstruction with mesh convolutions. BMVC
Moon G, Chang JY, Lee KM (2018) V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2018:5079–5088
Moon G, Yu SI, Wen H, et al (2020) Interhand2.6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: European Conference on Computer Vision (ECCV)
Mueller F, Bernard F, Sotnychenko O, et al (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 49–59
Neverova N, Wolf C, Nebout F et al (2017) Hand pose estimation through semi-supervised and weakly-supervised learning. Comput Vis Image Underst 164:56–67
Oikonomidis I, Kyriazis N, Argyros AA (2011) Efficient model-based 3d tracking of hand articulations using kinect. BMVC
Romero J, Tzionas D, Black MJ (2017) Embodied hands. ACM Transactions on Graphics (TOG) 36:1–17
Spurr A, Song J, Park S, et al (2018) Cross-modal deep variational hand pose estimation. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 89–98
Taheri O, Ghorbani N, Black MJ, et al (2020) Grab: A dataset of whole-body human grasping of objects. European Conference on Computer Vision (ECCV)
Tkach A, Pauly M, Tagliasacchi A (2016) Sphere-meshes for real-time hand modeling and tracking. ACM Transactions on Graphics (TOG) 35:1–11
Toshev A, Szegedy C (2014) Deeppose: Human pose estimation via deep neural networks. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1653–1660
Vaswani A, Shazeer NM, Parmar N, et al (2017) Attention is all you need. Advances in neural information processing systems 5998–6008
Wang X, Girshick RB, Gupta AK et al (2018) Non-local neural networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2018:7794–7803
Wang Y, Peng C, Liu Y (2019) Mask-pose cascaded cnn for 2d hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology 29:3258–3268
Wang Y, Zhang B, Peng C (2020) Srhandnet: Real-time 2d hand pose estimation with simultaneous region localization. IEEE Transactions on Image Processing 29:2977–2986
Wei SE, Ramakrishna V, Kanade T et al (2016) Convolutional pose machines. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016:4724–4732
Xiang D, Joo H, Sheikh Y (2019) Monocular total capture: Posing face, body, and hands in the wild. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 10,957–10,966
Yang L, Li S, Lee D et al (2019) Aligning latent spaces for 3d hand pose estimation. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:2335–2343
Yan S, Xiong Y, Lin D (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. Thirty-second AAAI conference on artificial intelligence
Yuan Y, Wang J (2018) Ocnet: Object context network for scene parsing. arXiv:1809.00916
Zhang X, Li Q, Zhang W et al (2019) End-to-end hand mesh recovery from a monocular rgb image. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:2354–2364
Zhang J, Jiao J, Chen M, et al (2016) 3d hand pose tracking and estimation using stereo matching. arXiv:1610.07214
Zhang H, Zhang H, Wang C, et al (2019) Co-occurrent features in semantic segmentation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 548–557
Zhao L, Peng X, Tian Y et al (2019) Semantic graph convolutional networks for 3d human pose regression. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019:3420–3430
Zhou Y, Habermann M, Xu W et al (2020) Monocular real-time hand shape and motion capture using multi-modal data. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020:5345–5354
Zhu JY, Park T, Isola P, et al (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. 2017 IEEE International Conference on Computer Vision (ICCV) 2242–2251
Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. IEEE International Conference on Computer Vision (ICCV) 2017:4913–4921
Zimmermann C, Ceylan D, Yang J et al (2019) Freihand: A dataset for markerless capture of hand pose and shape from single rgb images. IEEE/CVF International Conference on Computer Vision (ICCV) 2019:813–822
Funding
This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDC02070600.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors have no competing interests to declare that are relevant to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, W., Du, R. & Chen, S. Keypoint-based contextual representations for hand pose estimation. Multimed Tools Appl 83, 28357–28372 (2024). https://doi.org/10.1007/s11042-023-15713-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15713-2