Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition

  • Oscar Koller
  • Hermann Ney
  • Richard Bowden
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8689)


This work presents a framework to recognise signer independent mouthings in continuous sign language, with no manual annotations needed. Mouthings represent lip-movements that correspond to pronunciations of words or parts of them during signing. Research on sign language recognition has focused extensively on the hands as features. But sign language is multi-modal and a full understanding particularly with respect to its lexical variety, language idioms and grammatical structures is not possible without further exploring the remaining information channels. To our knowledge no previous work has explored dedicated viseme recognition in the context of sign language recognition. The approach is trained on over 180.000 unlabelled frames and reaches 47.1% precision on the frame level. Generalisation across individuals and the influence of context-dependent visemes are analysed.


Sign Language Recognition Viseme Recognition Mouthing Lip Reading 


  1. 1.
    Starner, T., Weaver, J., Pentland, A.: Real-time American sign language recognition using desk and wearable computer based video. IEEE Pattern Analysis and Machine Intelligence 20(12), 1371–1375 (1998)CrossRefGoogle Scholar
  2. 2.
    Vogler, C., Metaxas, D.: Handshapes and movements: Multiple-channel American sign language recognition. In: Camurri, A., Volpe, G. (eds.) GW 2003. LNCS (LNAI), vol. 2915, pp. 247–258. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  3. 3.
    Zaki, M.M., Shaheen, S.I.: Sign language recognition using a combination of new vision based features. Pattern Recognition Letters 32(4), 572–577 (2011)CrossRefGoogle Scholar
  4. 4.
    Ong, S.C., Ranganath, S.: Automatic sign language analysis: A survey and the future beyond lexical meaning. IEEE Pattern Analysis and Machine Intelligence 27(6), 873–891 (2005)CrossRefGoogle Scholar
  5. 5.
    Lucas, C., Bayley, R., Valli, C.: What’s your sign for pizza?: an introduction to variation in American Sign Language. Gallaudet University Press, Washington, D.C (2003)Google Scholar
  6. 6.
    Emmorey, K.: Language, Cognition, and the Brain: Insights From Sign Language Research. Psychology Press (November 2001)Google Scholar
  7. 7.
    Sandler, W.: Sign Language and Linguistic Universals. Cambridge University Press (February 2006)Google Scholar
  8. 8.
    Lan, Y., Harvey, R., Theobald, B.-J.: Insights into machine lip reading. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4825–4828 (March 2012)Google Scholar
  9. 9.
    Hilder, S., Theobald, B.J., Harvey, R.: In pursuit of visemes. In: Proceedings of the International Conference on Auditory-Visual Speech Processing, pp. 154–159 (2010)Google Scholar
  10. 10.
    Fisher, C.G.: Confusions among visually perceived consonants. Journal of Speech, Language and Hearing Research 11(4), 796 (1968)CrossRefGoogle Scholar
  11. 11.
    Petajan, E.D.: Automatic Lipreading to Enhance Speech Recognition (Speech Reading). PhD thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA (1984)Google Scholar
  12. 12.
    Zhou, Z., Zhao, G., Pietikainen, M.: Towards a practical lipreading system. In: Computer Vision and Pattern Recognition, pp. 137–144 (2011)Google Scholar
  13. 13.
    Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Transactions on Multimedia 11(7), 1254–1265 (2009)CrossRefGoogle Scholar
  14. 14.
    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9), 1306–1326 (2003)CrossRefGoogle Scholar
  15. 15.
    Chi†u, A., Rothkrantz, L.J.M.: Automatic visual speech recognition. In: Ramakrishnan, S. (ed.) Speech Enhancement, Modeling and Recognition- Algorithms and Applications. InTech (March 2012)Google Scholar
  16. 16.
    Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-visual speech recognition. In: Final Workshop 2000 Report, vol. 764 (2000)Google Scholar
  17. 17.
    Aghaahmadi, M., Dehshibi, M.M., Bastanfard, A., Fazlali, M.: Clustering persian viseme using phoneme subspace for developing visual speech application. Multimedia Tools and Applications, 1–21 (2013)Google Scholar
  18. 18.
    Shan, C., Gong, S., McOwan, P.W.: Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing 27(6), 803–816 (2009)CrossRefGoogle Scholar
  19. 19.
    Tian, Y.L., Kanade, T., Cohn, J.: Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(2), 97–115 (2001)CrossRefGoogle Scholar
  20. 20.
    Buehler, P., Everingham, M., Zisserman, A.: Employing signed TV broadcasts for automated learning of British sign language. In: Proceedings of 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, pp. 22–23 (2010)Google Scholar
  21. 21.
    Cooper, H., Ong, E.J., Pugeault, N., Bowden, R.: Sign language recognition using sub-units. The Journal of Machine Learning Research 13(1), 2205–2231 (2012)zbMATHGoogle Scholar
  22. 22.
    Kelly, D., McDonald, J., Markham, C.: Weakly supervised training of a sign language recognition system using multiple instance learning density matrices. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 41(2), 526–541 (2011)CrossRefGoogle Scholar
  23. 23.
    Cooper, H., Holt, B., Bowden, R.: Sign language recognition. In: Moeslund, T.B., Hilton, A., Krüger, V., Sigal, L. (eds.) Visual Analysis of Humans, pp. 539–562. Springer, London (2011)Google Scholar
  24. 24.
    Koller, O., Ney, H., Bowden, R.: May the force be with you: Force-aligned SignWriting for automatic subunit annotation of corpora. In: IEEE International Conference on Automatic Face and Gesture Recognition, Shanghai, PRC (April 2013)Google Scholar
  25. 25.
    Michael, N., Neidle, C., Metaxas, D.: Computer-based recognition of facial expressions in ASL: from face tracking to linguistic interpretation. In: Proceedings of the 4th Workshop on the Representation and Processing of Sign Languages: Corpora and Sign Language Technologies, LREC, Malta (2010)Google Scholar
  26. 26.
    Vogler, C., Goldenstein, S.: Facial movement analysis in ASL. Universal Access in the Information Society 6(4), 363–374 (2008)CrossRefGoogle Scholar
  27. 27.
    Pfister, T., Charles, J., Zisserman, A.: Large-scale learning of sign language by watching TV (using co-occurrences). In: Proceedings of the British Machine Vision Conference, U. K. Leeds (2013)Google Scholar
  28. 28.
    Gross, R., Matthews, I., Baker, S.: Generic vs. person specific active appearance models. Image and Vision Computing 23(12), 1080–1093 (2005)CrossRefGoogle Scholar
  29. 29.
    Xiao, J., Baker, S., Matthews, I., Kanade, T.: Real-time combined 2D+ 3D active appearance models. In: CVPR (2), pp. 535–542 (2004)Google Scholar
  30. 30.
    Schmidt, C., Koller, O., Ney, H., Hoyoux, T., Piater, J.: Enhancing gloss-based corpora with facial features using active appearance models. In: International Symposium on Sign Language Translation and Avatar Technology, Chicago, IL, USA, vol. 2 (2013)Google Scholar
  31. 31.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)zbMATHCrossRefGoogle Scholar
  32. 32.
    Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication 50(5), 434–451 (2008)CrossRefGoogle Scholar
  33. 33.
    Elliott, E.A.: Phonological Functions of Facial Movements: Evidence from deaf users of German Sign Language. Thesis, Freie Universität, Berlin, Germany (2013)Google Scholar
  34. 34.
    Jiang, J., Alwan, A., Bernstein, L.E., Auer, E.T., Keating, P.A.: Similarity structure in perceptual and physical measures for visual consonants across talkers. In: 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. I–441–I–444 (May 2002)Google Scholar
  35. 35.
    Turkmani, A.: Visual Analysis of Viseme Dynamics. Ph.d., University of Surrey (2008)Google Scholar
  36. 36.
    Beulen, K.: Phonetische Entscheidungsbäume für die automatische Spracherkennung mit großem Vokabular. Mainz (1999)Google Scholar
  37. 37.
    Haeb-Umbach, R., Ney, H.: Linear discriminant analysis for improved large vocabulary continuous speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 13–16 (1992)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Oscar Koller
    • 1
    • 2
  • Hermann Ney
    • 1
  • Richard Bowden
    • 2
  1. 1.Human Language Technology and Pattern RecognitionRWTHAachenGermany
  2. 2.Centre for Vision Speech and Signal ProcessingUniversity of SurreyUK

Personalised recommendations