Skip to main content

Multi-channel Transformers for Multi-articulatory Sign Language Translation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 Workshops (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12538))

Included in the following conference series:

Abstract

Sign languages use multiple asynchronous information channels (articulators), not just the hands but also the face and body, which computational approaches often ignore. In this paper we tackle the multi-articulatory sign language translation task and propose a novel multi-channel transformer architecture. The proposed architecture allows both the inter and intra contextual relationships between different sign articulators to be modelled within the transformer network itself, while also maintaining channel specific information. We evaluate our approach on the RWTH-PHOENIX-Weather-2014T dataset and report competitive translation performance. Importantly, we overcome the reliance on gloss annotations which underpin other state-of-the-art approaches, thereby removing the need for expensive curated datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Mouthings are lip patterns that accompany a sign.

  2. 2.

    Glosses can be considered as the minimal lexical items of the sign languages.

  3. 3.

    Note that we use a vectorized formulation in our equations. All softmax and bias addition operations are done row-wise.

References

  1. Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) (2018)

    Google Scholar 

  2. Antonakos, E., Pitsikalis, V., Rodomagoulakis, I., Maragos, P.: Unsupervised classification of extreme facial events using active appearance models tracking for sign language videos. In: Proceedings of the IEEE International Conference on Image Processing (ICIP) (2012)

    Google Scholar 

  3. Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LipNet: end-to-end sentence-level lipreading. In: GPU Technology Conference (2017)

    Google Scholar 

  4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of the International Conference on Learning Representations (ICLR) (2015)

    Google Scholar 

  5. Bellugi, U., Fischer, S.: A comparison of sign language and spoken language. Cognition 1(2), 173–200 (1972)

    Article  Google Scholar 

  6. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguis. (ACL) 5, 135–146 (2017)

    Article  Google Scholar 

  7. Bragg, D., et al.: Sign language recognition, generation, and translation: an interdisciplinary perspective. In: Proceedings of the International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS) (2019)

    Google Scholar 

  8. Brentari, D.: Sign Language Phonology. Cambridge University Press, Cambridge (2019)

    Google Scholar 

  9. Bungeroth, J., Ney, H.: Statistical Sign Language Translation. In: Proceedings of the Workshop on Representation and Processing of Sign Languages at International Conference on Language Resources and Evaluation (LREC) (2004)

    Google Scholar 

  10. Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: Using convolutional 3D neural networks for user-independent continuous gesture recognition. In: Proceedings of the IEEE International Conference on Pattern Recognition Workshops (ICPRW) (2016)

    Google Scholar 

  11. Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: SubUNets: end-to-end hand shape and continuous sign language recognition. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  12. Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: Sign language transformers: joint end-to-end sign language recognition and translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  13. Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  14. Cao, Z., Hidalgo, G.M., Simon, T., Wei, S., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) (2019)

    Google Scholar 

  15. Caridakis, G., Asteriadis, S., Karpouzis, K.: Non-manual cues in automatic sign language recognition. Pers. Ubiquit. Comput. 18(1), 37–46 (2014)

    Article  Google Scholar 

  16. Chai, X., et al.: Sign language recognition and translation with Kinect. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition (FG) (2013)

    Google Scholar 

  17. Charles, J., Pfister, T., Everingham, M., Zisserman, A.: Automatic and efficient human pose estimation for sign language videos. Int. J. Comput. Vision (IJCV) 110(1), 70–90 (2014)

    Article  Google Scholar 

  18. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)

    Google Scholar 

  19. Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  20. Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  21. Cui, R., Liu, H., Zhang, C.: A deep neural framework for continuous sign language recognition by iterative training. IEEE Trans. Multimedia 21, 1880–1891 (2019)

    Article  Google Scholar 

  22. Fang, B. Co, J., Zhang, M.: DeepASL: enabling ubiquitous and non-intrusive word and sentence-level sign language translation. In: Proceedings of the ACM Conference on Embedded Networked Sensor Systems (SenSys) (2017)

    Google Scholar 

  23. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (2010)

    Google Scholar 

  24. Grimes, G.J.: Digital Data Entry Glove Interface Device, US Patent 4,414,537 (1983)

    Google Scholar 

  25. Guo, D., Wang, S., Tian, Q., Wang, M.: Dense temporal convolution network for sign language translation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2019)

    Google Scholar 

  26. Guo, D., Zhou, W., Li, H., Wang, M.: Hierarchical LSTM for sign language translation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  27. Hidalgo, G., et al.: Single-network whole-body pose estimation. In: IEEE International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  28. Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2018)

    Google Scholar 

  29. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  30. Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)

    Article  Google Scholar 

  31. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (ACL) (2017)

    Google Scholar 

  32. Kalchbrenner, N., Blunsom, P.: Recurrent continuous translation models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)

    Google Scholar 

  33. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the International Conference on Learning Representations (ICLR) (2014)

    Google Scholar 

  34. Ko, S.K., Kim, C.J., Jung, H., Cho, C.: Neural sign language translation based on human keypoint estimation. Appl. Sci. 9(13), 2683 (2019)

    Article  Google Scholar 

  35. Koller, O., Camgoz, N.C., Bowden, R., Ney, H.: Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) (2019)

    Google Scholar 

  36. Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. (CVIU) 141, 108–125 (2015)

    Article  Google Scholar 

  37. Koller, O., Ney, H., Bowden, R.: Read my lips: continuous signer independent weakly supervised viseme recognition. In: European Conference on Computer Vision (ECCV) (2014)

    Google Scholar 

  38. Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes for sign language. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW) (2015)

    Google Scholar 

  39. Koller, O., Ney, H., Bowden, R.: Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  40. Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  41. Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: hybrid CNN-HMM for continuous sign language recognition. In: Proceedings of the British Machine Vision Conference (BMVC) (2016)

    Google Scholar 

  42. Kreutzer, J., Bastings, J., Riezler, S.: Joey NMT: a minimalist NMT toolkit for novices. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations (2019)

    Google Scholar 

  43. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, Text Summarization Branches Out Workshop (2004)

    Google Scholar 

  44. Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2015)

    Google Scholar 

  45. Luzardo, M., Karppa, M., Laaksonen, J., Jantunen, T.: Head pose estimation for sign language video. In: Kämäräinen, J.-K., Koskela, M. (eds.) SCIA 2013. LNCS, vol. 7944, pp. 349–360. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38886-6_34

    Chapter  Google Scholar 

  46. Malaia, E., Borneman, J.D., Wilbur, R.B.: Information transfer capacity of articulators in American sign language. Lang. Speech 61(1), 97–112 (2018)

    Article  Google Scholar 

  47. Nwankpa, C., Ijomah, W., Gachagan, A., Marshall, S.: Activation functions: comparison of trends in practice and research for deep learning. arXiv:1811.03378 (2018)

  48. Orbay, A., Akarun, L.: Neural sign language translation by learning tokenization. arXiv:2002.00479 (2020)

  49. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) (2002)

    Google Scholar 

  50. Parton, B.S.: Sign language recognition and translation: a multidisciplined approach from the field of artificial intelligence. J. Deaf Stud. Deaf Educ. 11(1), 94–101 (2005)

    Article  Google Scholar 

  51. Paszke, A., et al.: Automatic differentiation in PyTorch. In: Proceedings of the Advances in Neural Information Processing Systems Workshops (NIPSW) (2017)

    Google Scholar 

  52. Pfau, R., Quer, J.: Nonmanuals: their grammatical and prosodic roles. In: Sign Languages. Cambridge University Press (2010)

    Google Scholar 

  53. Pfister, T., Charles, J., Everingham, M., Zisserman, A.: Automatic and efficient long term arm and hand tracking for continuous sign language TV broadcasts. In: Proceedings of the British Machine Vision Conference (BMVC) (2012)

    Google Scholar 

  54. Pigou, L., Dieleman, S., Kindermans, P.J., Schrauwen, B.: Sign language recognition using convolutional neural networks. In: European Conference on Computer Vision Workshops (ECCVW) (2014)

    Google Scholar 

  55. Starner, T., Weaver, J., Pentland, A.: Real-time American sign language recognition using desk and wearable computer based video. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 20(12), 1371–1375 (1998)

    Article  Google Scholar 

  56. Stein, D., Schmidt, C., Ney, H.: Sign language machine translation overkill. In: International Workshop on Spoken Language Translation (2010)

    Google Scholar 

  57. Stokoe, W.C.: Sign language structure. Ann. Rev. Anthropol. 9(1) (1980)

    Google Scholar 

  58. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS) (2014)

    Google Scholar 

  59. Sutton-Spence, R., Woll, B.: The Linguistics of British Sign Language: An Introduction. Cambridge University Press, Cambridge (1999)

    Book  Google Scholar 

  60. Tamura, S., Kawasaki, S.: Recognition of sign language motion images. Pattern Recogn. 21(4), 343–353 (1988)

    Article  Google Scholar 

  61. Valli, C., Lucas, C.: Linguistics of American Sign Language: An Introduction. Gallaudet University Press, Washington, D.C. (2000)

    Google Scholar 

  62. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS) (2017)

    Google Scholar 

  63. Vogler, C., Goldenstein, S.: Facial movement analysis in ASL. Univ. Access Inf. Soc. 6(4) (2008)

    Google Scholar 

  64. Wang, S., Guo, D., Zhou, W.g., Zha, Z.J., Wang, M.: Connectionist temporal fusion for sign language translation. In: Proceedings of the ACM International Conference on Multimedia (2018)

    Google Scholar 

  65. Wilbur, R.B.: Phonological and Prosodic Layering of Nonmanuals in American Sign Language. The Signs of Language Revisited: An Anthology to Honor Ursula Bellugi and Edward Klima (2000)

    Google Scholar 

  66. Yang, Z., Dai, Z., Salakhutdinov, R., Cohen, W.W.: Breaking the softmax bottleneck: a high-rank RNN language model. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)

    Google Scholar 

  67. Zelinka, J., Kanis, J., Salajka, P.: NN-based Czech sign language synthesis. In: Salah, A.A., Karpov, A., Potapova, R. (eds.) SPECOM 2019. LNCS (LNAI), vol. 11658, pp. 559–568. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26061-3_57

    Chapter  Google Scholar 

  68. Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)

    Google Scholar 

Download references

Acknowledgements

This work received funding from the SNSF Sinergia project ‘SMILE’ (CRSII2_160811), the European Union’s Horizon2020 research and innovation programme under grant agreement no. 762021 ‘Content4All’ and the EPSRC project ‘ExTOL’ (EP/R03298X/1). This work reflects only the author’s view and the Commission is not responsible for any use that may be made of the information it contains. We would also like to thank NVIDIA Corporation for their GPU grant.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Necati Cihan Camgoz .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 135 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R. (2020). Multi-channel Transformers for Multi-articulatory Sign Language Translation. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12538. Springer, Cham. https://doi.org/10.1007/978-3-030-66823-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-66823-5_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-66822-8

  • Online ISBN: 978-3-030-66823-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics