Skip to main content
Log in

Two-stream lightweight sign language transformer

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Despite the recent progress of continuous sign language translation-based video, a variety of deep learning models are difficult to apply to the real-time translation in the limit computing resource. We present the two-stream lightweight sign transformer network model for recognizing and translating continuous sign language. This lightweight framework can obtain both static spatial information and all body dynamic features of signer, and the transformer-style decoder architecture to real-time translate sentences from the spatio-temporal context around the signer. Additionally its attention mechanism focus on moving hands and mouth of signer, which is often crucial for semantic understanding of sign language. In this paper, we introduce the Chinese sign language corpus of the business scene which consists of 3080 videos of high quality. The Chinese sign language corpus of the business scene has enormous impetuses for further research on the Chinese sign language translation. Experiments are carried out the PHOENIX-Weather 2014T (Camgoz et al, in: Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR 2018), pp 7784–7793, 2018), Chinese Sign Language dataset Huang et al, in: The thirty-second AAAI conference on artificial intelligence (AAAI-18), pp 2257–2264, 2018) and our CSLBS, the proposed model outperforms the state-of-the-art in inference times and accuracy using only raw RGB and RGB difference frames as input.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Ji, S., Xu, W., Yang, M.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  Google Scholar 

  2. Barros, P., Magg, S., Weber, C.: A multichannel convolutional neural network for hand posture recognition. In: International Conference on Artificial Neural Networks. Springer (2014)

  3. Zhang, J., Zhou, W., Li, H.: A threshold-based HMM-DTW approach for continuous sign language recognition. ACM (2014)

  4. Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)

  5. Huang, J., Zhou, W., Li, H.: Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Trans. Circuits Syst. Video Technol. 32, 1 (2018)

    Google Scholar 

  6. Cui, R., Liu, H., Zhang, C.: A deep neural framework for continuous sign language recognition by iterative training. TMM 21(7), 1880–1891 (2019)

    Google Scholar 

  7. Buehler, P., Zisserman, A., Everingham, M.: Learning sign language by watching TV (using weakly aligned subtitles). In: CVPR (2009)

  8. Pfister, T., Charles, J., Zisserman, A.: Large-scale learning of sign language by watching TV (using co-occurrences). In: BMVC (2013)

  9. Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. CVIU 141, 108–125 (2015)

    Google Scholar 

  10. Cihan Camgoz, N., Hadfield, S., Koller, O., Bowden, R.: SubUNets: end-to-end hand shape and continuous sign language recognition. In: ICCV (2017)

  11. Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), pp. 7784–7793 (2018)

  12. Ko, S.K., Kim, C.J., Jung, H., et al.: Neural sign language translation based on human key-point estimation. Appl. Sci. 9(13), 2683 (2019). https://doi.org/10.3390/app9132683

    Article  Google Scholar 

  13. Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Sign language transformers: joint end-to-end sign language recognition and translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2020) (2020)

  14. Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), pp. 13009–13016 (2020)

  15. Orbay, A., Akarun, L.: Neural sign language translation by learning tokenization. arXiv:2002.00479 (2020)

  16. HU, H., ZHOU, W.G., PU, J.F., LI, H.Q.: Global-local enhancement network for NMF-aware sign language recognition. ACM Trans. Multimedia Comput. Commun. Appl. TOMM 17(3), 80:1-80:19 (2021)

    Google Scholar 

  17. Tang, S.G., Guo, D., Richang, H., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Trans. Multimedia (2021). https://doi.org/10.1109/TMM.2021.3117124

    Article  Google Scholar 

  18. Zhou, Z.X., Tam, V.W.L., Lam, E.Y.: SignBERT: a BERT-based deep learning framework for continuous sign language recognition. IEEE Access 9, 10 (2021). https://doi.org/10.1109/ACCESS.2021.3132668

    Article  Google Scholar 

  19. Wei, C.C., Zhao, J., Zhou, W.G., Li, H.Q.: Semantic boundary detection with reinforcement learning for continuous sign language recognition. IEEE Trans. Circuits Syst. Video Technol. TCSVT 31(3), 1138–1149 (2021)

    Article  Google Scholar 

  20. Zhou, H., Zhou, W.G., Zhou, Y., Li, H.Q.: Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Trans. Multimedia TMM (2021)

  21. Hu, H.Z., Zhou, W., Li, H.: Hand-model-aware sign language recognition. In: AAAI Conference on Artificial Intelligence (AAAI 2021), pp. 1558–1566

  22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision Pattern Recognition. IEEE Computer Society (2016)

  23. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, et al.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2009)

  24. Kozlov, A., Andronov, V., Gritsenko, Y.: Lightweight network architecture for real-time action recognition. In: Proceedings of the 35th Annual ACM Symposium on Applied Computing, pp. 2074–2080 (2020) https://doi.org/10.1145/3341105.3373906

  25. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. pp. 5998–6008 (2020)

  26. Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

  27. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

  28. Fang, G., Gao, W., Zhao, D.: Large-vocabulary continuous sign language recognition based on transition-movement models. IEEE Trans. Syst. Man Cybern. 37(1), 1–9 (2007)

    Article  Google Scholar 

  29. Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2017)

  30. Huang, J., et al.: Video-based sign language recognition without temporal segmentation. In: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pp. 2257–2264 (2018)

  31. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., et al.: Temporal segment networks: towards good practices for deep action recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)

  32. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks (2014). arXiv preprint arXiv:1412.4729

  33. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)

  34. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)

  35. Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language (2015). arXiv preprint arXiv:1505.01861

  36. Yang, W., Tao, J., Ye, Z.: Continuous sign language recognition using level building based on fast hidden Markov model. Pattern Recognit. Lett. 78, 28–35 (2016)

    Article  Google Scholar 

  37. Zhang, J., Zhou, W., Li, H.: A threshold-based HMM-DTW approach for continuous sign language recognition. In: Proceedings of ACM International Conference on Internet Multimedia Computing and Service, p. 237 (2014)

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 61973334), Research Center of Security Video and Image Processing Engineering Technology of Guizhou (China) under Grant SRC-Open Project ([2020]001]), and Beijing Advanced Innovation Center for Intelligent Robots and Systems (China) under Grant 2018IRS20. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. The raw/processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuming Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Y., Mei, X. & Qin, X. Two-stream lightweight sign language transformer. Machine Vision and Applications 33, 79 (2022). https://doi.org/10.1007/s00138-022-01330-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-022-01330-w

Keywords

Navigation