Abstract
The continuous sign language recognition task is challenging which needs to identify unsegmented gloss from long videos in a weakly supervised manner. Some previous methods hope to extract information of different modalities to enhance the representation of features, which often complicates the network and focuses too much on visual features. Sign language data is a long video, as the time span increases, the model may forget the information of early time steps. The long-distance temporal modeling ability directly affects the recognition performance. Therefore, a Global-Temporal Enhancement (GTE) module is proposed to enhance temporal learning ability. Most of the current continuous sign language recognition networks have a three-step architecture, i.e., visual, sequence and alignment module. However, such architecture is difficult to get enough training under current Connectionist Temporal Classification (CTC) losses. So two auxiliary supervision methods are proposed, namely Temporal-Consistency Self-Distillation (TCSD) and GTE loss. TCSD uses two global temporal outputs from different depths to supervise local temporal information. GTE loss can provide a moderate supervision to balance the features extracted by deep and shallow layers. The proposed model achieves state-of-the-art or competitive performance on PHOENIX14, PHOENIX14-T datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Cheng, K.L., Yang, Z., Chen, Q., Tai, Y.-W.: Fully convolutional networks for continuous sign language recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12369, pp. 697–714. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58586-0_41
Cui, R., Liu, H., Zhang, C.: A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multimedia 21(7), 1880–1891 (2019)
Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., Ramabhadran, B.: Efficient knowledge distillation from an ensemble of teachers. In: Interspeech, pp. 3697–3701 (2017)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Guo, M.H., Liu, Z.N., Mu, T.J., Hu, S.M.: Beyond self-attention: external attention using two linear layers for visual tasks. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Hao, A., Min, Y., Chen, X.: Self-mutual distillation learning for continuous sign language recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11303–11312 (2021)
Hu, L., Gao, L., Feng, W., et al.: Self-emphasizing network for continuous sign language recognition. arXiv preprint arXiv:2211.17081 (2022)
Hu, L., Gao, L., Liu, Z., Feng, W.: Temporal lift pooling for continuous sign language recognition. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXXV, pp. 511–527. Springer, Cham (2022)
Jiang, S., Sun, B., Wang, L., Bai, Y., Li, K., Fu, Y.: Skeleton aware multi-modal sign language recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3413–3423 (2021)
Min, Y., Hao, A., Chai, X., Chen, X.: Visual alignment constraint for continuous sign language recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11542–11551 (2021)
Niu, Z., Mak, B.: Stochastic fine-grained labeling of multi-state sign glosses for continuous sign language recognition. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 172–186. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_11
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., Ma, K.: Be your own teacher: improve the performance of convolutional neural networks via self distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3713–3722 (2019)
Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13009–13016 (2020)
Zuo, R., Mak, B.: C2SLR: consistency-enhanced continuous sign language recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5131–5140 (2022)
Acknowledgements
Funded by National Natural Science Foundation of China (NSFC), Grant Number: 92048205; Also funded by China Scholarship Council (CSC), Grant Number: 202008310014.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Qin, X., Wang, H., He, C., Zhang, X. (2023). Global-Temporal Enhancement for Sign Language Recognition. In: Iliadis, L., Papaleonidas, A., Angelov, P., Jayne, C. (eds) Artificial Neural Networks and Machine Learning – ICANN 2023. ICANN 2023. Lecture Notes in Computer Science, vol 14261. Springer, Cham. https://doi.org/10.1007/978-3-031-44198-1_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-44198-1_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44197-4
Online ISBN: 978-3-031-44198-1
eBook Packages: Computer ScienceComputer Science (R0)