Advertisement

BSL-1K: Scaling Up Co-articulated Sign Language Recognition Using Mouthing Cues

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12356)

Abstract

Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. We make use of weakly-aligned subtitles for broadcast footage together with a keyword spotting method to automatically localise sign-instances for a vocabulary of 1,000 signs in 1,000 h of video. We make the following contributions: (1) We show how to use mouthing cues from signers to obtain high-quality annotations from video data—the result is the BSL-1K dataset, a collection of British Sign Language (BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks—we exceed the state of the art on both the MSASL and WLASL benchmarks. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area.

Keywords

Sign language recognition Visual keyword spotting 

Notes

Acknowledgements

This work was supported by EPSRC grant ExTol. We also thank T. Stafylakis, A. Brown, A. Dutta, L. Dunbar, A. Thandavan, C. Camgoz, O. Koller, H. V. Joze, O. Kopuklu for their help.

Supplementary material

504452_1_En_3_MOESM1_ESM.pdf (9.8 mb)
Supplementary material 1 (pdf 10032 KB)

Supplementary material 2 (mp4 57023 KB)

References

  1. 1.
    Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2019)Google Scholar
  2. 2.
    Agris, U., Zieren, J., Canzler, U., Bauer, B., Kraiss, K.F.: Recent developments in visual sign language recognition. Univ. Access Inf. Soc. 6, 323–362 (2008).  https://doi.org/10.1007/s10209-007-0104-xCrossRefGoogle Scholar
  3. 3.
    Antonakos, E., Roussos, A., Zafeiriou, S.: A survey on mouth modeling and analysis for sign language recognition. In: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (2015)Google Scholar
  4. 4.
    Athitsos, V., et al.: The American sign language lexicon video dataset. In: CVPRW (2008)Google Scholar
  5. 5.
    Bank, R., Crasborn, O., Hout, R.: Variation in mouth actions with manual signs in sign language of the Netherlands (NGT). Sign Lang. Linguist. 14, 248–270 (2011)CrossRefGoogle Scholar
  6. 6.
    Bilge, Y.C., Ikizler, N., Cinbis, R.: Zero-shot sign language recognition: can textual data uncover sign languages? In: BMVC (2019)Google Scholar
  7. 7.
    Buehler, P., Zisserman, A., Everingham, M.: Learning sign language by watching TV (using weakly aligned subtitles). In: CVPR (2009)Google Scholar
  8. 8.
    Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: Using convolutional 3D neural networks for user-independent continuous gesture recognition. In: IEEE International Conference of Pattern Recognition, ChaLearn Workshop (2016)Google Scholar
  9. 9.
    Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: SubUNets: end-to-end hand shape and continuous sign language recognition. In: ICCV (2017)Google Scholar
  10. 10.
    Camgoz, N.C., Hadfield, S., Koller, O., Ney, H., Bowden, R.: Neural sign language translation. In: CVPR (2018)Google Scholar
  11. 11.
    Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: a dataset for recognising faces across pose and age. In: International Conference on Automatic Face and Gesture Recognition (2018)Google Scholar
  12. 12.
    Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008 (2018)
  13. 13.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the Kinetics dataset. In: CVPR (2017)Google Scholar
  14. 14.
    Chai, X., Wang, H., Chen, X.: The devisign large vocabulary of Chinese sign language database and baseline evaluations. Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS (2014)Google Scholar
  15. 15.
    Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: CVPR (2017)Google Scholar
  16. 16.
    Chung, J.S., Zisserman, A.: Lip reading in the wild. In: ACCV (2016)Google Scholar
  17. 17.
    Chung, J.S., Zisserman, A.: Signs in time: encoding human motion as a temporal image. In: Workshop on Brave New Ideas for Motion Representations, ECCV (2016)Google Scholar
  18. 18.
    Cooper, H., Bowden, R.: Learning signs from subtitles: a weakly supervised approach to sign language recognition. In: CVPR (2009)Google Scholar
  19. 19.
    Cooper, H., Pugeault, N., Bowden, R.: Reading the signs: a video based sign dictionary. In: ICCVW (2011)Google Scholar
  20. 20.
    Cooper, H., Holt, B., Bowden, R.: Sign language recognition. In: Moeslund, T., Hilton, A., Krüger, V., Sigal, L. (eds.) Visual Analysis of Humans, pp. 539–562. Springer, London (2011).  https://doi.org/10.1007/978-0-85729-997-0_27CrossRefGoogle Scholar
  21. 21.
    Crasborn, O.A., Van Der Kooij, E., Waters, D., Woll, B., Mesch, J.: Frequency distribution and spreading behavior of different types of mouth actions in three sign languages. Sign Lang. Linguist. 11, 45–67 (2008)CrossRefGoogle Scholar
  22. 22.
    Ong, E.-J., Koller, O., Pugeault, N., Bowden, R.: Sign spotting using hierarchical sequential patterns with temporal intervals. In: CVPR (2014)Google Scholar
  23. 23.
    Farhadi, A., Forsyth, D.: Aligning ASL for statistical translation using a discriminative word model. In: CVPR (2006)Google Scholar
  24. 24.
    Farhadi, A., Forsyth, D.A., White, R.: Transfer learning in sign language. In: CVPR (2007)Google Scholar
  25. 25.
    Fillbrandt, H., Akyol, S., Kraiss, K.: Extraction of 3D hand shape and posture from image sequences for sign language recognition. In: IEEE International SOI Conference (2003)Google Scholar
  26. 26.
    Fisher, C.G.: Confusions among visually perceived consonants. J. Speech Hear. Res. 11(4), 796–804 (1968)CrossRefGoogle Scholar
  27. 27.
    Forster, J., Oberdörfer, C., Koller, O., Ney, H.: Modality combination techniques for continuous sign language recognition. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds.) IbPRIA 2013. LNCS, vol. 7887, pp. 89–99. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-38628-2_10CrossRefGoogle Scholar
  28. 28.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  29. 29.
    Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42, 2011–2023 (2019)CrossRefGoogle Scholar
  30. 30.
    Huang, J., Zhou, W., Li, H., Li, W.: Sign language recognition using 3D convolutional neural networks. In: International Conference on Multimedia and Expo (ICME) (2015)Google Scholar
  31. 31.
    Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: AAAI (2018)Google Scholar
  32. 32.
    Jha, A., Namboodiri, V.P., Jawahar, C.V.: Word spotting in silent lip videos. In: WACV (2018)Google Scholar
  33. 33.
    Joze, H.R.V., Koller, O.: MS-ASL: a large-scale data set and benchmark for understanding American sign language. In: BMVC (2019)Google Scholar
  34. 34.
    Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 141, 108–125 (2015)CrossRefGoogle Scholar
  35. 35.
    Koller, O., Ney, H., Bowden, R.: Read my lips: continuous signer independent weakly supervised viseme recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 281–296. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_19CrossRefGoogle Scholar
  36. 36.
    Koller, O., Ney, H., Bowden, R.: Weakly supervised automatic transcription of mouthings for gloss-based sign language corpora. In: LREC Workshop on the Representation and Processing of Sign Languages: Beyond the Manual Channel (2014)Google Scholar
  37. 37.
    Koller, O., Ney, H., Bowden, R.: Deep learning of mouth shapes for sign language. In: 3rd Workshop on Assistive Computer Vision and Robotics, ICCV (2015)Google Scholar
  38. 38.
    Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: CVPR (2017)Google Scholar
  39. 39.
    Li, D., Opazo, C.R., Yu, X., Li, H.: Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: WACV (2019)Google Scholar
  40. 40.
    Li, D., Yu, X., Xu, C., Petersson, L., Li, H.: Transferring cross-domain knowledge for video sign language recognition. In: CVPR (2020)Google Scholar
  41. 41.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  42. 42.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  43. 43.
    Materzynska, J., Berger, G., Bax, I., Memisevic, R.: The Jester dataset: A large-scale video dataset of human gestures. In: ICCVW (2019)Google Scholar
  44. 44.
    Momeni, L., Afouras, T., Stafylakis, T., Albanie, S., Zisserman, A.: Seeing wake words: Audio-visual keyword spotting. BMVC (2020)Google Scholar
  45. 45.
    Nguyen, T.D., Ranganath, S.: Tracking facial features under occlusions and recognizing facial expressions in sign language. In: International Conference on Automatic Face and Gesture Recognition (2008)Google Scholar
  46. 46.
    Ong, E., Cooper, H., Pugeault, N., Bowden, R.: Sign language recognition using sequential pattern trees. In: CVPR (2012)Google Scholar
  47. 47.
    Pfister, T., Charles, J., Zisserman, A.: Large-scale learning of sign language by watching TV (using co-occurrences). In: BMVC (2013)Google Scholar
  48. 48.
    Pfister, T., Charles, J., Zisserman, A.: Domain-adaptive discriminative one-shot learning of gestures. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 814–829. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_52CrossRefGoogle Scholar
  49. 49.
    Schembri, A., Fenlon, J., Rentelis, R., Cormier, K.: British Sign Language Corpus Project: A corpus of digital video data and annotations of British Sign Language 2008–2017, 3rd edn. (2017). http://www.bslcorpusproject.org
  50. 50.
    Schembri, A., Fenlon, J., Rentelis, R., Reynolds, S., Cormier, K.: Building the British sign language corpus. Lang. Doc. Conserv. 7, 136–154 (2013)Google Scholar
  51. 51.
    Stafylakis, T., Tzimiropoulos, G.: Zero-shot keyword spotting for visual speech recognition in-the-wild. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 536–552. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01225-0_32CrossRefGoogle Scholar
  52. 52.
    Starner, T.: Visual Recognition of American Sign Language Using Hidden Markov Models. Master’s thesis, Massachusetts Institute of Technology (1995)Google Scholar
  53. 53.
    Sutton-Spence, R.: Mouthings and simultaneity in British sign language. In: Simultaneity in Signed Languages: Form and Function, pp. 147–162. John Benjamins (2007)Google Scholar
  54. 54.
    Sutton-Spence, R., Woll, B.: The Linguistics of British Sign Language: An Introduction. Cambridge University Press, Cambridge (1999)CrossRefGoogle Scholar
  55. 55.
    Tamura, S., Kawasaki, S.: Recognition of sign language motion images. Pattern Recogn. 21(4), 343–353 (1988)CrossRefGoogle Scholar
  56. 56.
    Valli, C., University, G.: The Gallaudet Dictionary of American Sign Language. Gallaudet University Press, Washington, D.C. (2005)Google Scholar
  57. 57.
    Viitaniemi, V., Jantunen, T., Savolainen, L., Karppa, M., Laaksonen, J.: S-pot - a benchmark in spotting signs within continuous signing. In: LREC (2014)Google Scholar
  58. 58.
    von Agris, U., Knorr, M., Kraiss, K.: The significance of facial features for automatic sign language recognition. In: 2008 8th IEEE International Conference on Automatic Face Gesture Recognition (2008)Google Scholar
  59. 59.
    Wilbur, R.B., Kak, A.C.: Purdue RVL-SLLL American sign language database. School of Electrical and Computer Engineering Technical Report, TR-06-12, Purdue University, W. Lafayette, IN 47906 (2006)Google Scholar
  60. 60.
    Woll, B.: The sign that dares to speak its name: echo phonology in British sign language (BSL). In: Boyes-Braem, P., Sutton-Spence, R. (eds.) The Hands are the Head of the Mouth: The Mouth as Articulator in Sign Languages, pp. 87–98. Signum Press, Hamburg (2001)Google Scholar
  61. 61.
    Yang, S., Luo, P., Loy, C.C., Tang, X.: Wider face: a face detection benchmark. In: CVPR (2016)Google Scholar
  62. 62.
    Yao, Y., Wang, T., Du, H., Zheng, L., Gedeon, T.: Spotting visual keywords from temporal sliding windows. In: Mandarin Audio-Visual Speech Recognition Challenge (2019)Google Scholar
  63. 63.
    Ye, Y., Tian, Y., Huenerfauth, M., Liu, J.: Recognizing American sign language gestures from within continuous videos. In: CVPRW (2018)Google Scholar
  64. 64.
    Zhou, H., Zhou, W., Zhou, Y., Li, H.: Spatial-temporal multi-cue network for continuous sign language recognition. CoRR abs/2002.03187 (2020)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Visual Geometry GroupUniversity of OxfordOxfordUK
  2. 2.Naver CorporationSeoulSouth Korea
  3. 3.Deafness, Cognition and Language Research CentreUniversity College LondonLondonUK

Personalised recommendations