Two-stream lightweight sign language transformer

Chen, Yuming; Mei, Xue; Qin, Xuan

doi:10.1007/s00138-022-01330-w

Two-stream lightweight sign language transformer

Original Paper
Published: 24 August 2022

Volume 33, article number 79, (2022)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

430 Accesses
1 Altmetric
Explore all metrics

Abstract

Despite the recent progress of continuous sign language translation-based video, a variety of deep learning models are difficult to apply to the real-time translation in the limit computing resource. We present the two-stream lightweight sign transformer network model for recognizing and translating continuous sign language. This lightweight framework can obtain both static spatial information and all body dynamic features of signer, and the transformer-style decoder architecture to real-time translate sentences from the spatio-temporal context around the signer. Additionally its attention mechanism focus on moving hands and mouth of signer, which is often crucial for semantic understanding of sign language. In this paper, we introduce the Chinese sign language corpus of the business scene which consists of 3080 videos of high quality. The Chinese sign language corpus of the business scene has enormous impetuses for further research on the Chinese sign language translation. Experiments are carried out the PHOENIX-Weather 2014T (Camgoz et al, in: Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR 2018), pp 7784–7793, 2018), Chinese Sign Language dataset Huang et al, in: The thirty-second AAAI conference on artificial intelligence (AAAI-18), pp 2257–2264, 2018) and our CSLBS, the proposed model outperforms the state-of-the-art in inference times and accuracy using only raw RGB and RGB difference frames as input.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

A unified approach for continuous sign language recognition and translation

Article 29 April 2024

References

Ji, S., Xu, W., Yang, M.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Barros, P., Magg, S., Weber, C.: A multichannel convolutional neural network for hand posture recognition. In: International Conference on Artificial Neural Networks. Springer (2014)
Zhang, J., Zhou, W., Li, H.: A threshold-based HMM-DTW approach for continuous sign language recognition. ACM (2014)
Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)
Huang, J., Zhou, W., Li, H.: Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Trans. Circuits Syst. Video Technol. 32, 1 (2018)
Google Scholar
Cui, R., Liu, H., Zhang, C.: A deep neural framework for continuous sign language recognition by iterative training. TMM 21(7), 1880–1891 (2019)
Google Scholar
Buehler, P., Zisserman, A., Everingham, M.: Learning sign language by watching TV (using weakly aligned subtitles). In: CVPR (2009)
Pfister, T., Charles, J., Zisserman, A.: Large-scale learning of sign language by watching TV (using co-occurrences). In: BMVC (2013)
Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. CVIU 141, 108–125 (2015)
Google Scholar
Cihan Camgoz, N., Hadfield, S., Koller, O., Bowden, R.: SubUNets: end-to-end hand shape and continuous sign language recognition. In: ICCV (2017)
Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), pp. 7784–7793 (2018)
Ko, S.K., Kim, C.J., Jung, H., et al.: Neural sign language translation based on human key-point estimation. Appl. Sci. 9(13), 2683 (2019). https://doi.org/10.3390/app9132683
Article Google Scholar
Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Sign language transformers: joint end-to-end sign language recognition and translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR2020) (2020)
Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), pp. 13009–13016 (2020)
Orbay, A., Akarun, L.: Neural sign language translation by learning tokenization. arXiv:2002.00479 (2020)
HU, H., ZHOU, W.G., PU, J.F., LI, H.Q.: Global-local enhancement network for NMF-aware sign language recognition. ACM Trans. Multimedia Comput. Commun. Appl. TOMM 17(3), 80:1-80:19 (2021)
Google Scholar
Tang, S.G., Guo, D., Richang, H., Wang, M.: Graph-based multimodal sequential embedding for sign language translation. IEEE Trans. Multimedia (2021). https://doi.org/10.1109/TMM.2021.3117124
Article Google Scholar
Zhou, Z.X., Tam, V.W.L., Lam, E.Y.: SignBERT: a BERT-based deep learning framework for continuous sign language recognition. IEEE Access 9, 10 (2021). https://doi.org/10.1109/ACCESS.2021.3132668
Article Google Scholar
Wei, C.C., Zhao, J., Zhou, W.G., Li, H.Q.: Semantic boundary detection with reinforcement learning for continuous sign language recognition. IEEE Trans. Circuits Syst. Video Technol. TCSVT 31(3), 1138–1149 (2021)
Article Google Scholar
Zhou, H., Zhou, W.G., Zhou, Y., Li, H.Q.: Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Trans. Multimedia TMM (2021)
Hu, H.Z., Zhou, W., Li, H.: Hand-model-aware sign language recognition. In: AAAI Conference on Artificial Intelligence (AAAI 2021), pp. 1558–1566
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision Pattern Recognition. IEEE Computer Society (2016)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, et al.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2009)
Kozlov, A., Andronov, V., Gritsenko, Y.: Lightweight network architecture for real-time action recognition. In: Proceedings of the 35th Annual ACM Symposium on Applied Computing, pp. 2074–2080 (2020) https://doi.org/10.1145/3341105.3373906
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. pp. 5998–6008 (2020)
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Fang, G., Gao, W., Zhao, D.: Large-vocabulary continuous sign language recognition based on transition-movement models. IEEE Trans. Syst. Man Cybern. 37(1), 1–9 (2007)
Article Google Scholar
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2017)
Huang, J., et al.: Video-based sign language recognition without temporal segmentation. In: The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pp. 2257–2264 (2018)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., et al.: Temporal segment networks: towards good practices for deep action recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks (2014). arXiv preprint arXiv:1412.4729
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Pan, Y., Mei, T., Yao, T., Li, H., Rui, Y.: Jointly modeling embedding and translation to bridge video and language (2015). arXiv preprint arXiv:1505.01861
Yang, W., Tao, J., Ye, Z.: Continuous sign language recognition using level building based on fast hidden Markov model. Pattern Recognit. Lett. 78, 28–35 (2016)
Article Google Scholar
Zhang, J., Zhou, W., Li, H.: A threshold-based HMM-DTW approach for continuous sign language recognition. In: Proceedings of ACM International Conference on Internet Multimedia Computing and Service, p. 237 (2014)

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No. 61973334), Research Center of Security Video and Image Processing Engineering Technology of Guizhou (China) under Grant SRC-Open Project ([2020]001]), and Beijing Advanced Innovation Center for Intelligent Robots and Systems (China) under Grant 2018IRS20. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted. The raw/processed data required to reproduce these findings cannot be shared at this time as the data also forms part of an ongoing study.

Author information

Authors and Affiliations

College of Electrical Engineering and Control Science, Nanjing Tech University, Nanjing, China
Yuming Chen & Xue Mei
China Electronics Technology Group Corp 28th Research Institute, Nanjing, China
Xuan Qin

Authors

Yuming Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xue Mei
View author publications
You can also search for this author in PubMed Google Scholar
Xuan Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuming Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, Y., Mei, X. & Qin, X. Two-stream lightweight sign language transformer. Machine Vision and Applications 33, 79 (2022). https://doi.org/10.1007/s00138-022-01330-w

Download citation

Received: 08 September 2021
Revised: 31 May 2022
Accepted: 21 July 2022
Published: 24 August 2022
DOI: https://doi.org/10.1007/s00138-022-01330-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-stream lightweight sign language transformer

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

A review of convolutional neural networks in computer vision

A unified approach for continuous sign language recognition and translation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-stream lightweight sign language transformer

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

A review of convolutional neural networks in computer vision

A unified approach for continuous sign language recognition and translation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation