Skip to main content
Log in

Robust vision-based glove pose estimation for both hands in virtual reality

  • Original Article
  • Published:
Virtual Reality Aims and scope Submit manuscript

Abstract

In virtual reality (VR) applications, haptic gloves provide feedback and more direct control than bare hands do. Most VR gloves contain flex and inertial measurement sensors for tracking the finger joints of a single hand; however, they lack a mechanism for tracking two-hand interactions. In this paper, a vision-based method is proposed for improved two-handed glove tracking. The proposed method requires only one camera attached to a VR headset. A photorealistic glove data generation framework was established to synthesize large quantities of training data for identifying the left, right, or both gloves in images with complex backgrounds. We also incorporated the “glove pose hypothesis” in the training stage, in which spatial cues regarding relative joint positions were exploited for accurately predict glove positions under severe self-occlusion or motion blur. In our experiments, a system based on the proposed method achieved an accuracy of 94.06% on a validation set and achieved high-speed tracking at 65 fps on a consumer graphics processing unit.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data availability

The authors confirm that the data supporting the findings of this study are available within the article.

References

  • Barron C, Kakadiaris IA (2000) Estimating anthropometry and pose from a single image. Proc IEEE Conf Comput vis Pattern Recognit 1:669–676. https://doi.org/10.1109/CVPR.2000.855884

    Article  MATH  Google Scholar 

  • Buxton W, Myers B (1986) A study in two-handed input. In: Proceedings of the SIGCHI conference on human factors in computing systems, Boston, Massachusetts, USA., 321–326. https://doi.org/10.1145/22627.22390

  • Buxton W (1995) Chunking and phrasing and the design of human-computer dialogues. In: Baecker RM, Grudin J, Buxton WAS, Greenberg S. (Eds), Readings in human–computer interaction, 494–499. https://doi.org/10.1016/B978-0-08-051574-8.50051-0

  • Chen W, Yu C, Tu C, Lyu Z, Tang J, Ou S, Fu Y, Xue Z (2020) A Survey on hand pose estimation with wearable sensors and computer-vision-based methods. Sensors 20(4):1074. https://doi.org/10.3390/s20041074

    Article  Google Scholar 

  • Chen Y, Tu Z, Ge L, Zhang D, Chen R, Yuan J (2019) SO-HandNet: self-organizing network for 3D hand pose estimation with semi-supervised learning. In: Proceedings of the IEEE/CVF international conference on computer vision, 6961–6970

  • Chen Y, Tu Z, Kang D, Bao L, Zhang Y, Zhe X, Chen R, Yuan J (2021) Model-based 3D Hand Reconstruction via Self-Supervised Learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10451–10460. https://doi.org/10.48550/arXiv.2103.11703

  • Cheng W, Park JH, Ko JH (2021) HandFoldingNet: A 3D hand pose estimation network using multiscale-feature guided folding of a 2D hand skeleton. In: Proceedings of the IEEE/CVF international conference on computer vision, 11260–11269. https://doi.org/10.48550/arXiv.2108.05545

  • Doosti B, Naha S, Mirbagheri M, Crandall DJ (2020) Hope-net: a graph-based model for hand-object pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6608–6617. https://doi.org/10.48550/arXiv.2004.00060

  • Erol A, Bebis G, Nicolescu M, Boyle RD, Twombly X (2007) Vision-based hand pose estimation: a review. Comput vis Image Underst 108(1–2):52–73. https://doi.org/10.1016/j.cviu.2006.10.012

    Article  Google Scholar 

  • Fang L, Liu X, Liu L, Xu H, Kang W (2020) JGR-P2O: Joint graph reasoning based pixel-to-offset prediction network for 3D hand pose estimation from a single depth image. In: European Conference Computer Vision, pp 120–137. https://doi.org/10.48550/arXiv.2007.04646

  • Garcia-Hernando G, Yuan S, Baek S, Kim TK (2018) First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 409–419. https://doi.org/10.48550/arXiv.1704.0246

  • Hinckley K, Pausch R, Proffitt D, Kassell NF (1998a) Two-handed virtual manipulation. ACM Trans Comput Hum Interact 5(3):260–302. https://doi.org/10.1145/292834.292849

    Article  Google Scholar 

  • Hinckley K, Pausch R, Proffitt D, Kassell NF (1998b) Two-handed virtual manipulation. ACM Trans Comput Hum Interact (TOCHI) 5(3):260–302. https://doi.org/10.1145/292834.292849

    Article  Google Scholar 

  • Hinckley K, Pausch R, Proffitt D (1997) Attention and visual feedback: the bimanual frame of reference. In: Proceedings of the 1997 symposium on interactive 3D graphics, Providence, Rhode Island, USA. 121–ff. https://doi.org/10.1145/253284.253318

  • Huber PJ (1992) Robust estimation of a location parameter. In: Breakthroughs in statistics, pp 492–518. https://doi.org/10.1007/978-1-4612-4380-9_35

  • Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) DeeperCut: a deeper, stronger, and faster multi-person pose estimation model. In: European conference on computer vision, pp 34–50. https://doi.org/10.48550/arXiv.1605.03170

  • Kotranza A, Quarles J, Lok B (2006) Mixed reality: are two hands better than one?. In: Proceedings of the ACM symposium on virtual reality software and technology, Limassol, Cyprus. pp 31–34. https://doi.org/10.1145/1180495.1180503

  • Lin F, Wilhelm C, Martinez T (2021) Two-hand global 3D pose estimation using monocular RGB. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2373–2381. https://doi.org/10.48550/arXiv.2006.01320

  • Liu S Jiang H, Xu J, Liu S, Wang X (2021) Semi-supervised 3D hand-object poses estimation with interactions in time. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14687–14697. https://doi.org/10.48550/arXiv.2106.05266

  • Moon G, Chang JY, Lee KM (2018) V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5079–5088. https://doi.org/10.48550/arXiv.1711.07399

  • Mueller F, Mehta D, Sotnychenko O, Sridhar S, Casas D, Theobalt C (2017) Real-time hand tracking under occlusion from an egocentric RGB-D sensor. In: Proceedings of the IEEE international conference on computer vision, pp 1154–1163. https://doi.org/10.48550/arXiv.1704.02201

  • Mueller F, Bernard F, Sotnychenko O, Mehta D, Sridhar S, Casas D, Theobalt C (2018) GANerated hands for real-time 3D hand tracking from monocular RGB. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 49–59. https://doi.org/10.48550/arXiv.1712.01057

  • Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler PV, Schiele B (2016) DeepCut: joint subset partition and labeling for multi person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4929–4937. https://doi.org/10.48550/arXiv.1511.06645

  • Rad M, Oberweger M, Lepetit V (2018) Feature mapping for learning fast and accurate 3D pose inference from synthetic images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4663–4672. https://doi.org/10.48550/arXiv.1712.03904

  • Ren P, Sun H, Hao J, Wang J, Qi Q, Liao J (2022) Mining multi-view information: a strong self-supervised framework for depth-based 3D hand pose and mesh estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 20555–20565. https://doi.org/10.1109/CVPR52688.2022.01990

  • Rhodin H, Richardt C, Casas D, Insafutdinov E, Shafiei M, Seidel H-P, Schiele B, Theobalt C (2016) EgoCap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans Grap 35(6):1–11. https://doi.org/10.48550/arXiv.1609.07306

    Article  Google Scholar 

  • Rudnev V, Golyanik V, Wang J, Seidel HP, Mueller F, Elgharib M, Theobalt C (2021) Real-time neural 3D hand pose estimation from an event stream. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2385–12395. https://doi.org/10.48550/arXiv.2012.06475

  • Sapp B, Taskar B (2013) MODEC: multimodal decomposable models for human pose estimation. IEEE Conf Comput vis Pattern Recognit 2013:23–28. https://doi.org/10.1109/CVPR.2013.471

    Article  Google Scholar 

  • Spurr A, Dahiya A, Wang X, Zhang X, Hilliges O (2021) Self-supervised 3D hand pose estimation from monocular RGB via contrastive learning.In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11230–11239. https://doi.org/10.48550/arXiv.2106.05953

  • Tompson J, Stein M, Lecun Y, Perlin K (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans Grap 33(5):1–10. https://doi.org/10.1145/2629500

    Article  Google Scholar 

  • Vogiatzidakis P, Koutsabasis P (2022) ‘Address and command’: two-handed mid-air interactions with multiple home devices. Int J Hum Comput Stud 159:102755. https://doi.org/10.1016/j.ijhcs.2021.102755

    Article  Google Scholar 

  • Voigt-Antons J N, Kojic T, Ali D, Möller S (2020) Influence of hand tracking as a way of interaction in virtual reality on user experience. In: 2020 Twelfth international conference on quality of multimedia experience (QoMEX), Athlone, Ireland, pp 1–4. https://doi.org/10.1109/QoMEX48832.2020.9123085

  • Wei SE, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 4724–4732. https://doi.org/10.48550/arXiv.1602.00134

  • Xiong F, Zhang B, Xiao Y, Cao Z, Yu T, Zhou JT, Yuan J (2019) A2J: anchor-to-joint regression network for 3D articulated pose estimation from a single depth image. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 793–802. https://doi.org/10.48550/arXiv.1908.09999

  • Yang L, Li S, Lee D, Yao A (2019) Aligning latent spaces for 3D hand pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2335–2343. https://doi.org/10.1109/ICCV.2019.00242

  • Yang L, Li K, Zhan X, Lv J, Xu W, Li J, Lu C (2022) ArtiBoost: boosting articulated 3D hand-object pose estimation via online exploration and synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2750–2760. https://doi.org/10.48550/arXiv.2109.05488

  • Zhao Z, Zhao X, Wang Y (2021) TravelNet: self-supervised physically plausible hand motion learning from monocular color images. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 11666–11676. https://doi.org/10.1109/ICCV48922.2021.01146

Download references

Acknowledgements

This study was supported by the Industrial Technology Research Institute, the National Science and Technology Council, Taiwan (Grant Numbers: NSTC 111-2222-E-A49-008 and NSTC 112-2221-E-A49-129).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fu-Song Hsu.

Ethics declarations

Conflict of interest

The authors have no relevant financial or nonfinancial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (MP4 83881 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hsu, FS., Wang, TM. & Chen, LH. Robust vision-based glove pose estimation for both hands in virtual reality. Virtual Reality 27, 3133–3148 (2023). https://doi.org/10.1007/s10055-023-00860-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10055-023-00860-6

Keywords

Navigation