Abstract
3D Hand Pose Estimation (HPE) is a task of recognizing the 3D joints of the hand from static images or dynamic video frames, which is an important topic in human-computer interaction and computer vision. Even though the rise of deep learning promotes the accuracy quite a lot, previous CNN-based methods still suffer from two drawbacks in nature. On one hand, real-valued CNNs break the original structure of the RGB images, and learn color features dependencies and hidden relations mixedly resulting in prediction difficulty when skin is similar. On the other hand, the feature representation for each pixel is only learned by local convolution while the global context information is ignored. To alleviate these problems, we propose a deep learning architecture, named Quaternion Multi-Graph Reasoning Network (QMGR-Net), which contains two novel modules: Quaternion Separation Learning Module (RSLM) and Global Quaternion Reasoning Module (GQRM). The RSLM is proposed to learn color feature dependencies and hidden relations such as edges or shapes at the different level where features can be learned in the quaternion space. Besides, the GQRM is proposed to explicitly capture the interactions among hand joints and model the relationship of non-adjacent nodes. Compared with other state-of-the-art (SOTA) methods on the STB dataset and the RHD dataset, we achieve the best results with 7.27 EPE and 18.59 EPE, respectively, which exhibit the effectiveness of our method.
Similar content being viewed by others
References
Lee T, Hollerer T (2009) Multithreaded hybrid feature tracking for markerless augmented reality. IEEE Trans Visual Computer Graph 15(3):355–368
Jang Y, Noh S-T, Chang HJ, Kim T-K, Woo W (2015) 3D finger cape: clicking action and position estimation under self-occlusions in egocentric viewpoint. IEEE Trans Visual Computer Graph 21(4):501–510
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inform Process Syst 27
Shi L, Zhang Y, Cheng J, Lu H (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035
Alexiadis DS, Daras P (2014) Quaternionic signal processing techniques for automatic evaluation of dance performances from mocap data. IEEE Trans Multimedia 16(5):1391–1406
Xu C, Jiang Y, Zhou J, Liu Y (2021) Semi-supervised joint learning for hand gesture recognition from a single color image. Sensors 21(3):1007
Bianchi M, Haschke R, Büscher G, Ciotti S, Carbonaro N, Tognetti A (2016) A multi-modal sensing glove for human manual-interaction studies. Electronics 5(3):42
Chossat J-B, Tao Y, Duchaine V, Park Y-L (2015) Wearable soft artificial skin for hand motion detection with embedded microfluidic strain sensing 2568–2573
Wang Y, Zhang B, Peng C (2019) Srhandnet: Real-time 2d hand pose estimation with simultaneous region localization. IEEE Tans Image Process 29:2977–2986
Chen Y, Ma H, Kong D, Yan X, Wu J, Fan W, Xie X (2020) Nonparametric structure regularization machine for 2d hand pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 381–390
Sharp T, Keskin C, Robertson D, Taylor J, Shotton J, Kim D, Rhemann C, Leichter I, Vinnikov A, Wei Y, et al. (2015) Accurate, robust, and flexible real-time hand tracking. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3633–3642
Sridhar S, Mueller F, Oulasvirta A, Theobalt C (2015) Fast and robust hand tracking using detection-guided optimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3221
Tan DJ, Cashman T, Taylor J, Fitzgibbon A, Tarlow D, Khamis S, Izadi S, Shotton J (2016) Fits like a glove: Rapid and reliable hand shape personalization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5610–5619
Tzionas D, Ballan L, Srikantha A, Aponte P, Pollefeys M, Gall J (2016) Capturing hands in action using discriminative salient points and physics simulation. Int J Comput Vis 118(2):172–193
Guo X, Xu S, Lin X, Sun Y, Ma X (2022) 3D hand pose estimation from a single rgb image through semantic decomposition of vae latent space. Pattern Anal Appl 25(1):157–167
Zimmermann C, Brox T (2017) Learning to estimate 3d hand pose from single rgb images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4903–4911
Oberweger M, Lepetit V (2017) Deepprior++: Improving fast and accurate 3d hand pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 585–594
Oberweger M, Wohlhart P, Lepetit V (2015) Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807
Moon G, Chang JY, Lee KM (2018) V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5079–5088
Moon G, Yu S-I, Wen H, Shiratori T, Lee KM (2020) Interhand2. 6m: A dataset and baseline for 3d interacting hand pose estimation from a single rgb image. In: European Conference on Computer Vision, pp. 548–564. Springer
Chen X, Wang G, Guo H, Zhang C (2020) Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 395:138–149
Ge L, Liang H, Yuan J, Thalmann D (2018) Real-time 3d hand pose estimation with 3d convolutional neural networks. IEEE Trans Pattern Anal Mach Intellig 41(4):956–970
Liao M, Zhu Z, Shi B, Xia G-s, Bai X (2018) Rotation-sensitive regression for oriented scene text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5909–5918
Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) Ganerated hands for real-time 3d hand tracking from monocular rgb. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–59
Spurr A, Song J, Park S, Hilliges O (2018) Cross-modal deep variational hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 89–98
Yang L, Yao A (2019) Disentangling latent hands for image synthesis and pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9877–9886
Khaleghi L, Sepas-Moghaddam A, Marshall J, Etemad A (2022) Multi-view video-based 3d hand pose estimation. IEEE Trans Artif Intellig
Zhang J, Jiao J, Chen M, Qu L, Xu X, Yang Q (2016) 3D hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214
Panteleris P, Argyros A (2017) Back to rgb: 3D tracking of hands and hand-object interactions based on short-baseline stereo. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 575–584
Arjovsky M, Shah A, Bengio Y (2016) Unitary evolution recurrent neural networks. In: International Conference on Machine Learning, pp. 1120–1128. PMLR
Guberman N (2016) On complex valued convolutional neural networks. arXiv preprint arXiv:1602.09046
Trabelsi C, Bilaniuk O, Serdyuk D, Subramanian S, Santos JF, Mehri S, Rostamzadeh N, Bengio Y, Pal CJ (2017) Deep complex networks. CoRR abs/1705.09792arXiv:1705.09792
Shen W, Zhang B, Huang S, Wei Z, Zhang Q (2020) 3d-rotation-equivariant quaternion neural networks. In: European Conference on Computer Vision, pp. 531–547. Springer
Parcollet T, Morchid M, Linarès G (2019) Quaternion convolutional neural networks for heterogeneous image processing. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8514–8518
Grassucci E, Cicero E, Comminiello D (2021) Quaternion generative adversarial networks. arXiv preprint arXiv:2104.09630
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Adv Neur Inform Process Syst 29
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Li M, Chen S, Chen X, Zhang Y, Wang Y, Tian Q (2019) Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3595–3603
Xu Y, Mu L, Ji Z, Liu X, Han J (2022) Meta hyperbolic networks for zero-shot learning. Neurocomputing 491:57–66
Fang L, Liu X, Liu L, Xu H, Kang W (2020) Jgr-p2o: Joint graph reasoning based pixel-to-offset prediction network for 3d hand pose estimation from a single depth image. In: European Conference on Computer Vision, pp. 120–137. Springer
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Hamilton WR (1848) Xi. on quaternions; or on a new system of imaginaries in algebra. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 33(219), 58–60
Gaudet CJ, Maida AS (2018) Deep quaternion networks. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neur Inform Process Syst 25
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Moon G, Chang JY, Lee KM (2019) Camera distance-aware top-down approach for 3d multi-person pose estimation from a single rgb image. In: Proceedings of the IEEE/cvf International Conference on Computer Vision, pp. 10133–10142
Moon G, Chang JY, Lee KM (2018) V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5079–5088
Oikonomidis I, Kyriazis N, Argyros AA (2011) Efficient model-based 3d tracking of hand articulations using kinect. In: BmVC, vol. 1, p. 3
Qian C, Sun X, Wei Y, Tang X, Sun J (2014) Realtime and robust hand tracking from depth. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1106–1113
Chen L, Lin S-Y, Xie Y, Tang H, Xue Y, Xie X, Lin Y-Y, Fan W (2018) Generating realistic training images based on tonality-alignment generative adversarial networks for hand pose estimation. arXiv preprint arXiv:1811.09916
Zhou Y, Habermann M, Xu W, Habibie I, Theobalt C, Xu F (2020) Monocular real-time hand shape and motion capture using multi-modal data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5346–5355
Zhao L, Peng X, Chen Y, Kapadia M, Metaxas DN (2020) Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6528–6537
Doosti B, Naha S, Mirbagheri M, Crandall DJ (2020) Hope-net: A graph-based model for hand-object pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6608–6617
Liu Y, Jiang J, Sun J, Wang X (2021) Internet+: A light network for hand pose estimation. Sensors 21(20):6747
Acknowledgements
This work was supported in part by National Key Research and Development Project under Grant 2018YFB1802400, Key-Area Research and Development Program of Guangdong Province under Grant 2019B010121001 and 2019B010118001.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ni, H., Xie, S., Xu, P. et al. QMGR-Net: quaternion multi-graph reasoning network for 3D hand pose estimation. Int. J. Mach. Learn. & Cyber. 14, 4029–4045 (2023). https://doi.org/10.1007/s13042-023-01879-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-023-01879-6