Skip to main content

Advertisement

Log in

Static hand gesture recognition method based on the Vision Transformer

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Hand gesture recognition (HGR) is the most important part of human-computer interaction (HCI). Static hand gesture recognition is equivalent to the classification of hand gesture images. At present, the classification of hand gesture images mainly uses the Convolutional Neural Network (CNN) method. The Vision Transformer architecture (ViT) proposes not to use the convolutional layers at all but to use the multi-head attention mechanism to learn global information. Therefore, this paper proposes a static hand gesture recognition method based on the Vision Transformer. This paper uses a self-made dataset and two publicly available American Sign Language (ASL) datasets to train and evaluate the ViT architecture. Using the depth information provided by the Microsoft Kinect camera to capture the hand gesture images and filter the background, then use the eight-connected discrimination algorithm and the distance transformation algorithm to remove the redundant arm information. The resulting images constitute a self-made dataset. At the same time, this paper studies the impact of several data augmentation strategies on recognition performance. This paper uses accuracy, F1 score, recall, and precision as evaluation metrics. Finally, the validation accuracy of the proposed model on the three datasets achieves 99.44%, 99.37%, and 96.53%, respectively, and the results obtained are better than those obtained by some CNN structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The datasets used to support the findings of this study could be found freely here:

Dataset a: https://www.kaggle.com/datasets/zhangyu0123456/hand-dataset-first

Dataset b: https://www.kaggle.com/grassknoted/asl-alphabet

Dataset c: https://empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset

Notes

  1. https://www.kaggle.com/datasets/zhangyu0123456/hand-dataset-first

  2. https://www.kaggle.com/grassknoted/asl-alphabet

References

  1. Alani AA, Cosma G, Taherkhani A, McGinnity TM (2018) Hand gesture recognition using an adapted convolutional neural network with data augmentation. In: 2018 4th international conference on information management (ICIM). pp 5–12. IEEE, Oxford

  2. Ameen S, Vadera S (2017) A convolutional neural network to classify American sign language fingerspelling from depth and colour images. Expert Syst 34(3):e12197. https://doi.org/10.1111/exsy.12197

    Article  Google Scholar 

  3. Bendarkar D, Somase P, Rebari P, Paturkar R, Khan A (2021) Web based recognition and translation of American sign language with CNN and RNN. Int J Online Biomed Eng (iJOE) 17:34–50. https://doi.org/10.3991/ijoe.v17i01.18585

    Article  Google Scholar 

  4. Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ (eds) Advances in Neural Information Processing Systems, vol 24. Curran Associates Inc., Red Hook

  5. Bhatia P, Wadhawan A (2021) Deep learning-based sign language recognition system for static signs. Neural Comput Appl 32:7957–7968. https://doi.org/10.1007/s00521-019-04691-y

    Article  Google Scholar 

  6. Bhojanapalli S, Chakrabarti A, Glasner D, Li D, Unterthiner T, Veit A (2021) Understanding robustness of transformers for image classification. In: 2021 IEEE/CVF international conference on computer vision (ICCV). pp 10211–10221. IEEE, Montreal

  7. Bowles C, Chen L, Guerrero R, Bentley P, Gunn RN, Hammers A, Dickie DA, Hernández MV, Wardlaw JM, Rueckert D (2018) GAN augmentation: augmenting training data using generative adversarial networks. arXiv:1810.10863

  8. Chen N, Watanabe S, Villalba J, Zelasko P, Dehak N (2021) Non-autoregressive transformer for speech recognition. IEEE Signal Process Lett 28:121–125. https://doi.org/10.1109/LSP.2020.3044547

    Article  Google Scholar 

  9. Cheok MJ, Omar Z, Jaward M (2019) A review of hand gesture and sign language recognition techniques. Int J Mach Learn Cybern 10:131–153. https://doi.org/10.1007/s13042-017-0705-5

    Article  Google Scholar 

  10. Chevtchenko SF, Vale RF, Macario V, Cordeiro FR (2018) A convolutional neural network with feature fusion for real-time hand posture recognition. Appl Soft Comput 73:748–766. https://doi.org/10.1016/j.asoc.2018.09.010

    Article  Google Scholar 

  11. Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-XL: attentive language models beyond a fixed-length context

  12. Devlin J, Chang M-W, Lee K, Toutanova K (June 2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186

  13. DeVries T, Taylor GW (2017) Improved regularization of convolutional neural networks with Cutout

  14. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale

  15. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). pp 770–778. IEEE, Las Vegas

  16. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 7132–7141. IEEE, Salt Lake City

  17. Huang Z, Wang X, Wei Y, Huang L, Shi H, Liu W, Huang TS (2020) Ccnet: Criss-cross attention for semantic segmentation

  18. Islam MZ, Hossain MS, ul Islam R, Andersson K (2019) Static hand gesture recognition using convolutional neural network with data augmentation. In: 2019 joint 8th international conference on informatics, electronics vision (ICIEV) and 2019 3rd international conference on imaging, vision pattern recognition (icIVPR). pp 324–329. IEEE, Spokane

  19. Khari M, Garg A, Gonzalez Crespo R, Verdú E (2019) Gesture recognition of RGB and RGB-D static images using convolutional neural networks. Int J Interact Multimed Artif Intell 5:22–27. https://doi.org/10.9781/ijimai.2019.09.002

    Article  Google Scholar 

  20. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  21. Li G, Tang H, Sun Y, Kong J, Jiang G, Jiang D, Tao B, Xu S, Liu H (2019) Hand gesture recognition based on convolution neural network. Clust Comput 22:2719–2729. https://doi.org/10.1007/s10586-017-1435-x

    Article  Google Scholar 

  22. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF international conference on computer vision (ICCV). pp 9992–10002. IEEE, Montreal

  23. Lu D, Yu Y, Liu H (2016) Gesture recognition using data glove: an extreme learning machine method. In: 2016 IEEE international conference on robotics and biomimetics (ROBIO). pp 1349–1354

  24. Mirsu R, Simion G, Caleanu CD, Pop-Calimanu IM (2020) A pointnet-based solution for 3D hand gesture recognition, vol 20

  25. Modanwal G, Sarawadekar K (2018) A robust wrist point detection algorithm using geometric features. Pattern Recognit Lett 110:72–78. https://doi.org/10.1016/j.patrec.2018.03.025

    Article  Google Scholar 

  26. Mohammed AAQ, Lv J, Islam MDS (2019) A deep learning-based end-to-end composite system for hand detection and gesture recognition, vol 19

  27. Nagi J, Ducatelle F, Di Caro GA, Ciresan D, Meier U, Giusti A, Nagi F, Schmidhuber J, Gambardella LM (2011) Max-pooling convolutional neural networks for vision-based hand gesture recognition. In: 2011 IEEE international conference on signal and image processing applications (ICSIPA). pp 342–347. IEEE, Kuala Lumpur

  28. Naseer M, Ranasinghe K, Khan S, Hayat M, Khan F, Yang M-H (2021) Intriguing properties of vision transformers. In: Beygelzimer A, Dauphin Y, Liang P, Vaughan JW (eds) Advances in Neural Information Processing Systems. https://openreview.net/forum?id=o2mbl-Hmfgd. Accessed 29 Oct 2021

  29. Oyedotun O, Khashman A (2017) Deep learning in vision-based static hand gesture recognition. Neural Comput Appl 28:3941–3951. https://doi.org/10.1007/s00521-016-2294-8

    Article  Google Scholar 

  30. Ozcan T, Basturk A (2019) Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture recognition. Neural Comput Appl 31:8955–8970. https://doi.org/10.1007/s00521-019-04427-y

    Article  Google Scholar 

  31. Pan T-Y, Lo L-Y, Yeh C-W, Li J-W, Liu H-T, Hu M-C (2016) Real-time sign language recognition in complex background scene based on a hierarchical clustering classification method. In: 2016 IEEE second international conference on multimedia big data (BigMM). pp 64–67. IEEE, Taipei

  32. Paul S, Chen P-Y (2022) Vision transformers are robust learners. Proc AAAI Conf Artif Intell 36(2):2071–2081. https://doi.org/10.1609/aaai.v36i2.20103

    Article  Google Scholar 

  33. Pigou L, Dieleman S, Kindermans P-J, Schrauwen B (2015) Sign language recognition using convolutional neural networks. In: Computer Vision - ECCV 2014 Workshops. Springer International Publishing, Cham, pp 572–578

  34. Pugeault N, Bowden R (2011) Spelling it out: real-time ASL fingerspelling recognition. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). pp 1114–1119. IEEE, Barcelona

  35. Qi J, Jiang G, Li G, Sun Y, Tao B (2019) Surface EMG hand gesture recognition system based on PCA and GRNN. Neural Comput Appl 32:6343–6351. https://doi.org/10.1007/s00521-019-04142-8

    Article  Google Scholar 

  36. Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks?

  37. Rao GA, Syamala K, Kishore PVV, Sastry ASCS (2018) Deep convolutional neural networks for sign language recognition. In: 2018 conference on signal processing and communication engineering systems (SPACES). pp 194–197. IEEE, Vijayawada

  38. Sadeddine K, Chelali FZ, Djeradi R, Djeradi A, Benabderrahmane S (2021) Recognition of user-dependent and independent static hand gestures: application to sign language. J Vis Commun Image Represent 79:103193. https://doi.org/10.1016/j.jvcir.2021.103193

    Article  Google Scholar 

  39. Sharma P, Anand RS (2020) Depth data and fusion of feature descriptors for static gesture recognition. IET Image Process 14(5):909–920. https://doi.org/10.1049/iet-ipr.2019.0230

    Article  Google Scholar 

  40. Simonyan K, Zisserman A (September 2014) Very deep convolutional networks for large-scale image recognition

  41. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp 1–9. IEEE, Boston

  42. Tan Y, Lim K, Tee C, Lee C-P, Low C-Y (2021) Convolutional neural network with spatial pyramid pooling for hand gesture recognition. Neural Comput Appl 33:1–13. https://doi.org/10.1007/s00521-020-05337-0

    Article  Google Scholar 

  43. Tan YS, Lim KM, Lee CP (2021) Hand gesture recognition via enhanced densely connected convolutional neural network. Expert Syst Appl 175:114797. https://doi.org/10.1016/j.eswa.2021.114797

    Article  Google Scholar 

  44. Tang A, Lu K, Wang Y, Huang J, Li H (2015) A real-time hand posture recognition system using deep neural networks, vol 6

  45. Tao W, Leu MC, Yin Z (2018) American sign language alphabet recognition using convolutional neural networks with multiview augmentation and inference fusion. Eng Appl Artif Intell 76:202–213. https://doi.org/10.1016/j.engappai.2018.09.006

    Article  Google Scholar 

  46. Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jegou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, vol 139. pp 10347–10357. PMLR, New York

  47. Touvron H, Cord M, Sablayrolles A, Synnaeve G, Jégou H (2021) Going deeper with image transformers. In: 2021 IEEE/CVF international conference on computer vision (ICCV). pp 32–42. IEEE, Montreal

  48. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems, vol 30, DOI https://doi.org/10.5555/3295222.3295349

  49. Wang Q, Li B, Xiao T, Zhu J, Li C, Wong DF, Chao LS (June 2019) Learning deep transformer models for machine translation

  50. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition(CVPR). pp 7794–7803

  51. Xu B, Zhou Z, Huang J, Huang Y (2017) Static hand gesture recognition based on RGB-D image and arm removal. In: Cong F, Leung A, Wei Q (eds) Advances in Neural Networks - ISNN 2017. Springer International Publishing, Cham, pp 180–187

  52. Yun S, Han D, Chun S, Oh SJ, Yoo Y, Choe J (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp 6022–6031

  53. Zhou H-Y, Lu C, Yang S, Yu Y (2021) Convnets vs. transformers: whose visual representations are more transferable?. In: 2021 IEEE/CVF international conference on computer vision workshops (ICCVW). pp 2230–2238. IEEE, Montreal

Download references

Acknowledgements

This research was supported in part by the National Natural Science Foundation of China under Grants 51965047, in part by the Natural Science Foundation of Inner Mongolia Autonomous Region of China under Grants 2021MS06012, and in part by the Science and Technology Planning Project of Inner Mongolia Autonomous Region of China under Grants 2020GG0185.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Junlin Wang or Xin Wang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Wang, J., Wang, X. et al. Static hand gesture recognition method based on the Vision Transformer. Multimed Tools Appl 82, 31309–31328 (2023). https://doi.org/10.1007/s11042-023-14732-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14732-3

Keywords

Navigation