Static hand gesture recognition method based on the Vision Transformer

Zhang, Yu; Wang, Junlin; Wang, Xin; Jing, Haonan; Sun, Zhanshuo; Cai, Yu

doi:10.1007/s11042-023-14732-3

Static hand gesture recognition method based on the Vision Transformer

Published: 02 March 2023

Volume 82, pages 31309–31328, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yu Zhang¹,
Junlin Wang ORCID: orcid.org/0000-0001-8708-5059¹,
Xin Wang¹,
Haonan Jing¹,
Zhanshuo Sun¹ &
…
Yu Cai¹

437 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Hand gesture recognition (HGR) is the most important part of human-computer interaction (HCI). Static hand gesture recognition is equivalent to the classification of hand gesture images. At present, the classification of hand gesture images mainly uses the Convolutional Neural Network (CNN) method. The Vision Transformer architecture (ViT) proposes not to use the convolutional layers at all but to use the multi-head attention mechanism to learn global information. Therefore, this paper proposes a static hand gesture recognition method based on the Vision Transformer. This paper uses a self-made dataset and two publicly available American Sign Language (ASL) datasets to train and evaluate the ViT architecture. Using the depth information provided by the Microsoft Kinect camera to capture the hand gesture images and filter the background, then use the eight-connected discrimination algorithm and the distance transformation algorithm to remove the redundant arm information. The resulting images constitute a self-made dataset. At the same time, this paper studies the impact of several data augmentation strategies on recognition performance. This paper uses accuracy, F1 score, recall, and precision as evaluation metrics. Finally, the validation accuracy of the proposed model on the three datasets achieves 99.44%, 99.37%, and 96.53%, respectively, and the results obtained are better than those obtained by some CNN structures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Article 10 June 2021

Data Availability

The datasets used to support the findings of this study could be found freely here:

Dataset a: https://www.kaggle.com/datasets/zhangyu0123456/hand-dataset-first

Dataset b: https://www.kaggle.com/grassknoted/asl-alphabet

Dataset c: https://empslocal.ex.ac.uk/people/staff/np331/index.php?section=FingerSpellingDataset

Notes

References

Alani AA, Cosma G, Taherkhani A, McGinnity TM (2018) Hand gesture recognition using an adapted convolutional neural network with data augmentation. In: 2018 4th international conference on information management (ICIM). pp 5–12. IEEE, Oxford
Ameen S, Vadera S (2017) A convolutional neural network to classify American sign language fingerspelling from depth and colour images. Expert Syst 34(3):e12197. https://doi.org/10.1111/exsy.12197
Article Google Scholar
Bendarkar D, Somase P, Rebari P, Paturkar R, Khan A (2021) Web based recognition and translation of American sign language with CNN and RNN. Int J Online Biomed Eng (iJOE) 17:34–50. https://doi.org/10.3991/ijoe.v17i01.18585
Article Google Scholar
Bergstra J, Bardenet R, Bengio Y, Kégl B (2011) Algorithms for hyper-parameter optimization. In: Shawe-Taylor J, Zemel R, Bartlett P, Pereira F, Weinberger KQ (eds) Advances in Neural Information Processing Systems, vol 24. Curran Associates Inc., Red Hook
Bhatia P, Wadhawan A (2021) Deep learning-based sign language recognition system for static signs. Neural Comput Appl 32:7957–7968. https://doi.org/10.1007/s00521-019-04691-y
Article Google Scholar
Bhojanapalli S, Chakrabarti A, Glasner D, Li D, Unterthiner T, Veit A (2021) Understanding robustness of transformers for image classification. In: 2021 IEEE/CVF international conference on computer vision (ICCV). pp 10211–10221. IEEE, Montreal
Bowles C, Chen L, Guerrero R, Bentley P, Gunn RN, Hammers A, Dickie DA, Hernández MV, Wardlaw JM, Rueckert D (2018) GAN augmentation: augmenting training data using generative adversarial networks. arXiv:1810.10863
Chen N, Watanabe S, Villalba J, Zelasko P, Dehak N (2021) Non-autoregressive transformer for speech recognition. IEEE Signal Process Lett 28:121–125. https://doi.org/10.1109/LSP.2020.3044547
Article Google Scholar
Cheok MJ, Omar Z, Jaward M (2019) A review of hand gesture and sign language recognition techniques. Int J Mach Learn Cybern 10:131–153. https://doi.org/10.1007/s13042-017-0705-5
Article Google Scholar
Chevtchenko SF, Vale RF, Macario V, Cordeiro FR (2018) A convolutional neural network with feature fusion for real-time hand posture recognition. Appl Soft Comput 73:748–766. https://doi.org/10.1016/j.asoc.2018.09.010
Article Google Scholar
Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-XL: attentive language models beyond a fixed-length context
Devlin J, Chang M-W, Lee K, Toutanova K (June 2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, pp 4171–4186
DeVries T, Taylor GW (2017) Improved regularization of convolutional neural networks with Cutout
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). pp 770–778. IEEE, Las Vegas
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR). pp 7132–7141. IEEE, Salt Lake City
Huang Z, Wang X, Wei Y, Huang L, Shi H, Liu W, Huang TS (2020) Ccnet: Criss-cross attention for semantic segmentation
Islam MZ, Hossain MS, ul Islam R, Andersson K (2019) Static hand gesture recognition using convolutional neural network with data augmentation. In: 2019 joint 8th international conference on informatics, electronics vision (ICIEV) and 2019 3rd international conference on imaging, vision pattern recognition (icIVPR). pp 324–329. IEEE, Spokane
Khari M, Garg A, Gonzalez Crespo R, Verdú E (2019) Gesture recognition of RGB and RGB-D static images using convolutional neural networks. Int J Interact Multimed Artif Intell 5:22–27. https://doi.org/10.9781/ijimai.2019.09.002
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90. https://doi.org/10.1145/3065386
Article Google Scholar
Li G, Tang H, Sun Y, Kong J, Jiang G, Jiang D, Tao B, Xu S, Liu H (2019) Hand gesture recognition based on convolution neural network. Clust Comput 22:2719–2729. https://doi.org/10.1007/s10586-017-1435-x
Article Google Scholar
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF international conference on computer vision (ICCV). pp 9992–10002. IEEE, Montreal
Lu D, Yu Y, Liu H (2016) Gesture recognition using data glove: an extreme learning machine method. In: 2016 IEEE international conference on robotics and biomimetics (ROBIO). pp 1349–1354
Mirsu R, Simion G, Caleanu CD, Pop-Calimanu IM (2020) A pointnet-based solution for 3D hand gesture recognition, vol 20
Modanwal G, Sarawadekar K (2018) A robust wrist point detection algorithm using geometric features. Pattern Recognit Lett 110:72–78. https://doi.org/10.1016/j.patrec.2018.03.025
Article Google Scholar
Mohammed AAQ, Lv J, Islam MDS (2019) A deep learning-based end-to-end composite system for hand detection and gesture recognition, vol 19
Nagi J, Ducatelle F, Di Caro GA, Ciresan D, Meier U, Giusti A, Nagi F, Schmidhuber J, Gambardella LM (2011) Max-pooling convolutional neural networks for vision-based hand gesture recognition. In: 2011 IEEE international conference on signal and image processing applications (ICSIPA). pp 342–347. IEEE, Kuala Lumpur
Naseer M, Ranasinghe K, Khan S, Hayat M, Khan F, Yang M-H (2021) Intriguing properties of vision transformers. In: Beygelzimer A, Dauphin Y, Liang P, Vaughan JW (eds) Advances in Neural Information Processing Systems. https://openreview.net/forum?id=o2mbl-Hmfgd. Accessed 29 Oct 2021
Oyedotun O, Khashman A (2017) Deep learning in vision-based static hand gesture recognition. Neural Comput Appl 28:3941–3951. https://doi.org/10.1007/s00521-016-2294-8
Article Google Scholar
Ozcan T, Basturk A (2019) Transfer learning-based convolutional neural networks with heuristic optimization for hand gesture recognition. Neural Comput Appl 31:8955–8970. https://doi.org/10.1007/s00521-019-04427-y
Article Google Scholar
Pan T-Y, Lo L-Y, Yeh C-W, Li J-W, Liu H-T, Hu M-C (2016) Real-time sign language recognition in complex background scene based on a hierarchical clustering classification method. In: 2016 IEEE second international conference on multimedia big data (BigMM). pp 64–67. IEEE, Taipei
Paul S, Chen P-Y (2022) Vision transformers are robust learners. Proc AAAI Conf Artif Intell 36(2):2071–2081. https://doi.org/10.1609/aaai.v36i2.20103
Article Google Scholar
Pigou L, Dieleman S, Kindermans P-J, Schrauwen B (2015) Sign language recognition using convolutional neural networks. In: Computer Vision - ECCV 2014 Workshops. Springer International Publishing, Cham, pp 572–578
Pugeault N, Bowden R (2011) Spelling it out: real-time ASL fingerspelling recognition. In: 2011 IEEE international conference on computer vision workshops (ICCV Workshops). pp 1114–1119. IEEE, Barcelona
Qi J, Jiang G, Li G, Sun Y, Tao B (2019) Surface EMG hand gesture recognition system based on PCA and GRNN. Neural Comput Appl 32:6343–6351. https://doi.org/10.1007/s00521-019-04142-8
Article Google Scholar
Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A (2021) Do vision transformers see like convolutional neural networks?
Rao GA, Syamala K, Kishore PVV, Sastry ASCS (2018) Deep convolutional neural networks for sign language recognition. In: 2018 conference on signal processing and communication engineering systems (SPACES). pp 194–197. IEEE, Vijayawada
Sadeddine K, Chelali FZ, Djeradi R, Djeradi A, Benabderrahmane S (2021) Recognition of user-dependent and independent static hand gestures: application to sign language. J Vis Commun Image Represent 79:103193. https://doi.org/10.1016/j.jvcir.2021.103193
Article Google Scholar
Sharma P, Anand RS (2020) Depth data and fusion of feature descriptors for static gesture recognition. IET Image Process 14(5):909–920. https://doi.org/10.1049/iet-ipr.2019.0230
Article Google Scholar
Simonyan K, Zisserman A (September 2014) Very deep convolutional networks for large-scale image recognition
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp 1–9. IEEE, Boston
Tan Y, Lim K, Tee C, Lee C-P, Low C-Y (2021) Convolutional neural network with spatial pyramid pooling for hand gesture recognition. Neural Comput Appl 33:1–13. https://doi.org/10.1007/s00521-020-05337-0
Article Google Scholar
Tan YS, Lim KM, Lee CP (2021) Hand gesture recognition via enhanced densely connected convolutional neural network. Expert Syst Appl 175:114797. https://doi.org/10.1016/j.eswa.2021.114797
Article Google Scholar
Tang A, Lu K, Wang Y, Huang J, Li H (2015) A real-time hand posture recognition system using deep neural networks, vol 6
Tao W, Leu MC, Yin Z (2018) American sign language alphabet recognition using convolutional neural networks with multiview augmentation and inference fusion. Eng Appl Artif Intell 76:202–213. https://doi.org/10.1016/j.engappai.2018.09.006
Article Google Scholar
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jegou H (2021) Training data-efficient image transformers & distillation through attention. In: International conference on machine learning, vol 139. pp 10347–10357. PMLR, New York
Touvron H, Cord M, Sablayrolles A, Synnaeve G, Jégou H (2021) Going deeper with image transformers. In: 2021 IEEE/CVF international conference on computer vision (ICCV). pp 32–42. IEEE, Montreal
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in Neural Information Processing Systems, vol 30, DOI https://doi.org/10.5555/3295222.3295349
Wang Q, Li B, Xiao T, Zhu J, Li C, Wong DF, Chao LS (June 2019) Learning deep transformer models for machine translation
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: 2018 IEEE/CVF conference on computer vision and pattern recognition(CVPR). pp 7794–7803
Xu B, Zhou Z, Huang J, Huang Y (2017) Static hand gesture recognition based on RGB-D image and arm removal. In: Cong F, Leung A, Wei Q (eds) Advances in Neural Networks - ISNN 2017. Springer International Publishing, Cham, pp 180–187
Yun S, Han D, Chun S, Oh SJ, Yoo Y, Choe J (2019) CutMix: regularization strategy to train strong classifiers with localizable features. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp 6022–6031
Zhou H-Y, Lu C, Yang S, Yu Y (2021) Convnets vs. transformers: whose visual representations are more transferable?. In: 2021 IEEE/CVF international conference on computer vision workshops (ICCVW). pp 2230–2238. IEEE, Montreal

Download references

Acknowledgements

This research was supported in part by the National Natural Science Foundation of China under Grants 51965047, in part by the Natural Science Foundation of Inner Mongolia Autonomous Region of China under Grants 2021MS06012, and in part by the Science and Technology Planning Project of Inner Mongolia Autonomous Region of China under Grants 2020GG0185.

Author information

Authors and Affiliations

College of Electronic Information Engineering, Inner Mongolia University, Hohhot, 010021, China
Yu Zhang, Junlin Wang, Xin Wang, Haonan Jing, Zhanshuo Sun & Yu Cai

Authors

Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Junlin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haonan Jing
View author publications
You can also search for this author in PubMed Google Scholar
Zhanshuo Sun
View author publications
You can also search for this author in PubMed Google Scholar
Yu Cai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Junlin Wang or Xin Wang.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Wang, J., Wang, X. et al. Static hand gesture recognition method based on the Vision Transformer. Multimed Tools Appl 82, 31309–31328 (2023). https://doi.org/10.1007/s11042-023-14732-3

Download citation

Received: 01 January 2022
Revised: 23 August 2022
Accepted: 04 February 2023
Published: 02 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11042-023-14732-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Static hand gesture recognition method based on the Vision Transformer

Abstract

Access this article

Similar content being viewed by others

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Convolutional neural network: a review of models, methodologies and applications to object detection

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Static hand gesture recognition method based on the Vision Transformer

Abstract

Access this article

Similar content being viewed by others

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Convolutional neural network: a review of models, methodologies and applications to object detection

A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets

Data Availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation