Skip to main content

Video-based isolated hand sign language recognition using a deep cascaded model


In this paper, we propose an efficient cascaded model for sign language recognition taking benefit from spatio-temporal hand-based information using deep learning approaches, especially Single Shot Detector (SSD), Convolutional Neural Network (CNN), and Long Short Term Memory (LSTM), from videos. Our simple yet efficient and accurate model includes two main parts: hand detection and sign recognition. Three types of spatial features, including hand features, Extra Spatial Hand Relation (ESHR) features, and Hand Pose (HP) features, have been fused in the model to feed to LSTM for temporal features extraction. We train SSD model for hand detection using some videos collected from five online sign dictionaries. Our model is evaluated on our proposed dataset (Rastgoo et al., Expert Syst Appl 150: 113336, 2020), including 10’000 sign videos for 100 Persian sign using 10 contributors in 10 different backgrounds, and isoGD dataset. Using the 5-fold cross-validation method, our model outperforms state-of-the-art alternatives in sign language recognition

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13


  1. Acton B, Koum J (2009)

  2. Chai X, Guang L, Lin Y, Xu Z h, Tang Y, Chen X, Zhou M (2013) Sign language recognition and translation with kinect. In: IEEE International conference on automatic face and gesture recognition (FG2013). April 22–26. Shanghai

  3. Chen Ch, Zhang B, Zhenjie H, Jiang J, Liu M, Yang Y (2017) Action recognition from depth sequences using weighted fusion of 2D and 3D auto-correlation of gradients features. Multimedia Tools and Applications

  4. Cooper H, Ong W-J, Pugeault N, Bowden R (2012) Sign language recognition using sub-units. J Mach Learn Res 13:2205–2231

    Google Scholar 

  5. Duan J, Zhou Sh, Wan J, Guo X, Li SZ (2016) Multi-modality fusion based on consensus-voting and 3D convolution for isolated gesture recognition, arXiv:

  6. El Khattabi Z, Tabii Y, Benkaddour A (2015) Video summarization: techniques and applications. Int J Comput Inform Eng 4:9

    Article  Google Scholar 

  7. Forster, et al. (2012) WTH-PHOENIX v1 - German sign language RWTH-PHOENIX v2

  8. Ge L, Liang H, Yuan J, Thalmann D (2018) Robust 3D hand pose estimation in single depth images: from single-view CNN to multi-view CNNs. IEEE Transactions on Image Processing

  9. Goodwyn S, Acredolo L, Brown C (2000) Impact of symbolic gesturing on early language development. Nonverbal Behavior, 81–103.

  10. He K, Zhang X, Ren Sh, Sun J (2016) Deep residual learning for image recognition. CVPR

  11. Jameson L, et al. (2004) American Sign Language

  12. Kang B, Tripathi S, Nguyen TQ (2015) Real-time sign language fingerspelling recognition using convolutional neural networks from depth map. In: 3rd IAPR Asian conference on pattern recognition (ACPR)

  13. Kapuscinski T, Oszust M, Wysocki M, Warchol D (2015) Recognition of hand gestures observed by depth cameras. International Journal of Advanced Robotic Systems

  14. Kim S, Ban Y, Lee S (2017) Tracking and classification of in-air hand gesture based on thermal guided joint filter. Sensors

  15. Koller O, Forster J, Hermann N (2015) Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput Vis Image Underst, 108–125

  16. Le TH, Jaw DW, Lin ICh, Liu HB, Huang ShCh (2018) An efficient hand detection method based on convolutional neural network. In: The 7th IEEE international symposium on next-generation electronics

  17. Liu W, Anguelov D, Erhan D, Szegedy Ch, Reed S, Fu ChY, Berg AC (2016) SSD: single shot MultiBox detector. ECCV, 21–37

  18. Miao Q, Li Y, Ouyang W, Ma Z, Xu X, Shi W, Cao X, Liu Z, Chai X, Liu Z et al (2017) Multimodal gesture recognition based on the resc3d network. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  19. Miller J, Winn B, Winn J (2019) Signing savvy. Online dictionary

  20. Narayana P, Beveridge JR, Bruce AD (2018) Gesture recognition: focus on the hands. CVPR, 5235–5244

  21. Neverova N, Wolf C h, Taylor GW, Nebout F (2014) Hand segmentation with structured convolutional learning. In: Asian conference on computer vision (ACCV) 2014: computer vision, pp 687–702

  22. Ong WJ, Cooper H, Pugeault N, Bowden R (2012) Sign language recognition using sequential pattern trees. CVPR

  23. Oszust M, Wysocki M (2013) Polish sign language words recognition with Kinect. In: 6th International conference on human system interactions (HSI)

  24. Pagebites Inc. (2019) United States.

  25. Pugeault N, Bowden R (2011) Spelling it out: real-time ASL fingerspelling recognition. In: Proceedings of the 1st IEEE workshop on consumer depth cameras for computer vision, jointly with ICCV’2011

  26. Rastgoo R, Kiani K, Escalera S (2018) Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine. Entropy 20:809

    Article  Google Scholar 

  27. Rastgoo R, Kiani K, Escalera S (2020) Hand sign language recognition using multi-view hand skeleton. Expert Syst Appl 150:113336.

    Article  Google Scholar 

  28. Ren S, He K, Girshick R, Sun J (2015) Faster R-CNN: towards real-time object detection with region proposal networks. NIPS

  29. Ronchetti F, Quiroga F, Estrebou C, Lanzarini L (2016) Handshape recognition for Argentinian sign language using ProbSom. JCS-T

  30. Ronchetti F, Quiroga F, Estrebou C, Lanzarini LC, Rosete A (2016) LSA64: an Argentinian sign language dataset. Congreso Argentino de Ciencias de la Computación (CACIC 2016)

  31. Scogin J (2008) Texas math sign language dictionary.

  32. Simon T, Joo H, Matthews I, Sheikh Y (2017) Hand keypoint detection in single images using multi-view bootstrapping. arXiv:

  33. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. arXiv:

  34. Sun A, Wei Y, Liang S, Tang X, Sun J (2015) Cascaded hand pose regression. CVPR, 824–832

  35. Thangali A, Nash J, Sclaroff S, Neidle C (2011) Exploiting phonological constraints for handshape inference in ASL video. CVPR

  36. Wang H, Wang P, Song Z, Li W (2017) Large-scale multimodal gesture recognition using heterogeneous networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  37. William V (2013) American sign language. William Vicars Publisher,

  38. Yan S h, Xia Y, Smith JS, Lu W, Zhang B (2017) Multi-scale convolutional neural networks for hand detection. Applied Computational Intelligence and Soft Computing

  39. Zhang L, Zhu G, Shen P, Song J, Shah SA, Bennamoun M (2017) Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  40. Zhou Y, Lu J, Lin X, Sun Y, Ma X (2018) HBE: hand branch ensemble network for real-time 3D Hand Pose Estimation. ECCV

  41. Zimmermann Ch, Brox T (2017) Learning to estimate 3D hand pose from single RGB images. ICCV

Download references


This work is partially supported by the Spanish project TIN2016-74946-P (MINECO/FEDER,UE), CERCA Programme /Generalitat de Catalunya, ICREA under the ICREA Academia programme, and High Intelligent Solution (HIS) company of Iran. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan XP GPU used for this research. Also, we would like to thank two deaf centers of Iran (Semnan and Tehran) and the Computer Vision Center (CVC) of Spain for their collaborations.


This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Kourosh Kiani.

Ethics declarations

Conflict of interests

The authors certify that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rastgoo, R., Kiani, K. & Escalera, S. Video-based isolated hand sign language recognition using a deep cascaded model. Multimed Tools Appl 79, 22965–22987 (2020).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Sign language
  • Deep learning
  • Dataset
  • Single Shot Detector (SSD)
  • Hand pose
  • Video