Skip to main content
Log in

DATE: a video dataset and benchmark for dynamic hand gesture recognition

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

This paper proposes a new Dynamic hAnd gesTurE (DATE) dataset for dynamic hand gestures. The DATE dataset contains 13,500 videos of 22 different subjects. The subjects wear different clothes, have different backgrounds, and are filmed from various camera angles. Two different benchmarks for our self-built DATE dataset are also proposed. The first one is the high accuracy approach, while the second benchmark is the lightweight approach. The operation of our benchmarks has two phases. In the first phase, videos are preprocessed with detection or segmentation tasks. Then, the processed data are classified by customized cutting-edge deep learning models in the second phase. Experimental results showed that our benchmarks obtained high accuracies in both the self-build dataset and a publicly recognized dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

Details of our dataset can be found online at https://users.soict.hust.edu.vn/linhdt/gestures/. Accessing our dataset is available from the corresponding author upon reasonable request.

References

  1. Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54

    Article  Google Scholar 

  2. Al-Samarraay MS, Zaidan A, Albahri OS, Pamucar D, AlSattar HA, Alamoodi AH, Zaidan B, Albahri AS (2022) Extension of interval-valued pythagorean fdosm for evaluating and benchmarking real-time slrss based on multidimensional criteria of hand gesture recognition and sensor glove perspectives. Appl Soft Comput 116:108284

    Article  Google Scholar 

  3. Barczak A, Reyes N, Abastillas M, Piccio A, Susnjak T (2011) A new 2d static hand gesture colour image dataset for asl gestures

  4. Pinto RF, Borges CD, Almeida A, Paula IC (2019) Static hand gesture recognition based on convolutional neural networks. Journal of Electrical and Computer Engineering 2019(1):4167890

    Google Scholar 

  5. Priyal SP, Bora PK (2013) A robust static hand gesture recognition system using geometry based normalizations and krawtchouk moments. Pattern Recogn 46(8):2202–2219

    Article  Google Scholar 

  6. Zaib R, Ourabah O (2023) Large scale data using k-means. Mesopotamian Journal of Big Data 2023:36–45

    Article  Google Scholar 

  7. Kim T-K, Cipolla R (2008) Canonical correlation analysis of video volume tensors for action categorization and detection. IEEE Trans Pattern Anal Mach Intell 31(8):1415–1428

    Google Scholar 

  8. Athitsos V, Sclaroff S (2003) Estimating 3d hand pose from a cluttered image. In: 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., vol. 2, p. 432

  9. Zhang Y, Cao C, Cheng J, Lu H (2018) Egogesture: a new dataset and benchmark for egocentric hand gesture recognition. IEEE Trans Multimedia 20(5):1038–1050

    Article  Google Scholar 

  10. Huang H, Chong Y, Nie C, Pan S (2019) Hand gesture recognition with skin detection and deep learning method. In: Journal of Physics: Conference Series, vol. 1213, p. 022001. IOP Publishing

  11. Shinde V, Bacchav T, Pawar J, Sanap M (2014) Hand gesture recognition system using camera. Int. J. Eng. Res. Technol.(IJERT) 3(1)

  12. Sung G, Sokal K, Uboweja E, Bazarevsky V, Baccash J, Bazavan EG, Chang C-L, Grundmann M (2021) On-device real-time hand gesture recognition. arXiv preprint arXiv:2111.00038

  13. Suarez J, Murphy RR (2012) Hand gesture recognition with depth images: A review. In: 2012 IEEE RO-MAN: the 21st IEEE International Symposium on Robot and Human Interactive Communication, pp. 411–417 . IEEE

  14. Sahana T, Paul S, Basu S, Mollah AF (2020) Hand sign recognition from depth images with multi-scale density features for deaf mute persons. Procedia Computer Science 167:2043–2050

    Article  Google Scholar 

  15. Triesch J, Von Der Malsburg C (1996) Robust classification of hand postures against complex backgrounds. In: Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, pp. 170–175. IEEE

  16. Liu L, Shao L (2013) Learning discriminative representations from rgb-d video data. In: Twenty-third International Joint Conference on Artificial Intelligence

  17. Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ, Hamner B (2012) Results and analysis of the chalearn gesture challenge 2012. In: International Workshop on Depth Image Analysis and Applications, pp. 186–204. Springer

  18. Escalera S, Gonzàlez J, Baró X, Reyes M, Guyon I, Athitsos V, Escalante H, Sigal L, Argyros A, Sminchisescu C (2013) Chalearn multi-modal gesture recognition 2013: grand challenge and workshop summary. In: Proceedings of the 15th ACM on International Conference on Multimodal Interaction, pp. 365–368

  19. Wan J, Zhao Y, Zhou S, Guyon I, Escalera S, Li SZ (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 56–64

  20. Benitez-Garcia G, Olivares-Mercado J, Sanchez-Perez G, Yanai K (2021) Ipn hand: A video dataset and benchmark for real-time continuous hand gesture recognition. In: 25th International Conference on Pattern Recognition, (ICPR 2020), Milan, Italy, Jan 10–15, 2021, pp. 4340–4347. IEEE

  21. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587. https://doi.org/10.1109/CVPR.2014.81

  22. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  23. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox detector. In: European Conference on Computer Vision, pp. 21–37. https://doi.org/10.1007/978-3-319-46448-0_2. Springer

  24. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893 . Ieee

  25. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788.https://doi.org/10.1109/CVPR.2016.91

  26. Bochkovskiy A, Wang C-Y, Liao H-YM (2020) Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

  27. Yu J, Zhang W (2021) Face mask wearing detection algorithm based on improved yolo-v4. Sensors 21(9):3263

    Article  Google Scholar 

  28. Dewi C, Chen R-C, Liu Y-T, Jiang X, Hartomo KD (2021) Yolo v4 for advanced traffic sign recognition with synthetic training data generated by various gan. IEEE Access 9:97228–97242

    Article  Google Scholar 

  29. Jiang Z, Zhao L, Li S, Jia Y (2020) Real-time object detection method based on improved yolov4-tiny. arXiv preprint arXiv:2011.04244

  30. Dang TL, Nguyen HT, Dao DM, Nguyen HV, Luong DL, Nguyen BT, Kim S, Monet N (2022) Shape: a dataset for hand gesture recognition. Neural Comput Appl 34(24):21849–21862

    Article  Google Scholar 

  31. Saponara S, Elhanashi A, Zheng Q (2022) Developing a real-time social distancing detection system based on yolov4-tiny and bird-eye view for covid-19. J Real-Time Image Proc 19(3):551–563

    Article  Google Scholar 

  32. Huang J, Zhu Z, Huang G (2019) Multi-stage hrnet: multiple stage high-resolution network for human pose estimation. arXiv preprint arXiv:1910.05901

  33. Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 801–818

  34. Baheti B, Innani S, Gajre S, Talbar S (2020) Semantic scene segmentation in unstructured environment with modified deeplabv3+. Pattern Recogn Lett 138:223–229

    Article  Google Scholar 

  35. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Springer

  36. Jing J, Wang Z, Rätsch M, Zhang H (2020) Mobile-unet: An efficient convolutional neural network for fabric defect detection. Textile Research Journal, 0040517520928604

  37. Azad R, Asadi-Aghbolaghi M, Fathy M, Escalera S (2020) Attention deeplabv3+: Multi-level context attention mechanism for skin lesion segmentation. In: European Conference on Computer Vision, pp. 251–266. Springer

  38. Dang TL, Pham TH, Dang QM, Monet N (2023) A lightweight architecture for hand gesture recognition. Multimedia Tools and Applications, 1–19

  39. Wang C, Du P, Wu H, Li J, Zhao C, Zhu H (2021) A cucumber leaf disease severity classification method based on the fusion of deeplabv3+ and u-net. Comput Electron Agric 189:106373

    Article  Google Scholar 

  40. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. https://doi.org/10.1109/CVPR.2018.00474

  41. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199

  42. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008

  43. Fu R, Zhang Z, Li L (2016) Using lstm and gru neural network methods for traffic flow prediction. In: 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), pp. 324–328. IEEE

  44. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308

  45. Smith LN (2018) A disciplined approach to neural network hyper-parameters: Part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820

  46. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. CoRR abs/1706.03762arXiv:1706.03762

  47. Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis, Minnesota. https://doi.org/10.18653/v1/N19-1423. https://aclanthology.org/N19-1423

  48. Lee-Thorp J, Ainslie J, Eckstein I, Ontañón S (2021) Fnet: Mixing tokens with fourier transforms. CoRR abs/2105.03824arXiv:2105.03824

  49. Tolstikhin IO, Houlsby N, Kolesnikov A, Beyer L, Zhai X, Unterthiner T, Yung J, Steiner A, Keysers D, Uszkoreit J, Lucic M, Dosovitskiy A (2021) Mlp-mixer: An all-mlp architecture for vision. CoRR abs/2105.01601arXiv:2105.01601

  50. Liu H, Dai Z, So DR, Le QV (2021) Pay attention to mlps. CoRR abs/2105.08050arXiv:2105.08050

  51. Ye R, Liu F, Zhang L (2019) 3d depthwise convolution: Reducing model parameters in 3d vision tasks. In: Canadian Conference on Artificial Intelligence, pp. 186–199. Springer

  52. Fajar R, Suciati N, Navastara DA (2020) Real time human activity recognition using convolutional neural network and deep gated recurrent unit. In: 2020 International Conference on Electrical Engineering and Informatics (ICELTICs), pp. 1–6. https://doi.org/10.1109/ICELTICs50595.2020.9315535

  53. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV)

  54. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788

  55. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740–755. Springer

  56. Wang C-Y, Bochkovskiy A, Liao H-YM (2021) Scaled-YOLOv4: Scaling cross stage partial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13029–13038

  57. Benitez-Garcia G, Olivares-Mercado J, Sanchez-Perez G, Yanai K (2021) Ipn hand: A video dataset and benchmark for real-time continuous hand gesture recognition. In: 2020 25th International Conference on Pattern Recognition (ICPR), pp. 4340–4347. IEEE

Download references

Acknowledgements

This research is funded by Hanoi University of Science and Technology (HUST) under grant number T2022-PC-052. This research is also partially supported by NAVER Corporation within the framework of collaboration with the International Research Center for Artificial Intelligence (BKAI), School of Information and Communications Technology, HUST under project NAVER.2022.DA02.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tuan Linh Dang.

Ethics declarations

Conflict of interest

The authors declared that they have no Conflict of interest with regard to this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dang, T.L., Pham, T.H., Dao, D.M. et al. DATE: a video dataset and benchmark for dynamic hand gesture recognition. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09990-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00521-024-09990-7

Keywords

Navigation