Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading?

  • Helen L. Bear
  • Richard W. Harvey
  • Barry-John Theobald
  • Yuxuan Lan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8888)


A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings and consider if any are stable across talkers. We show a method for devising maps based on phoneme confusions from an automated lip-reading system, and we present new mappings that show improvements for individual talkers.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Association, I.P.: Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press (1999)Google Scholar
  2. 2.
    Chen, T., Rao, R.R.: Audio-visual integration in multimodal communication. Proceedings of the IEEE 86, 837–852 (1998)CrossRefGoogle Scholar
  3. 3.
    Fisher, C.G.: Confusions among visually perceived consonants. Journal of Speech, Language and Hearing Research 11, 796 (1968)CrossRefGoogle Scholar
  4. 4.
    Hazen, T.J., Saenko, K., La, C.H., Glass, J.R.: A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI 2004, pp. 235–242. ACM, New York (2004)Google Scholar
  5. 5.
    Theobald, B.J.: Visual speech synthesis using shape and appearance models. PhD thesis, University of East Anglia (2003)Google Scholar
  6. 6.
    Binnie, C.A., Jackson, P.L., Montgomery, A.A.: Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation. Journal of Speech and Hearing Disorders 41, 530 (1976)CrossRefGoogle Scholar
  7. 7.
    Franks, J.R., Kimble, J.: The confusion of english consonant clusters in lipreading. Journal of Speech, Language and Hearing Research 15, 474 (1972)CrossRefGoogle Scholar
  8. 8.
    Walden, B.E., Prosek, R.A., Montgomery, A.A., Scherr, C.K., Jones, C.J.: Effects of training on the visual recognition of consonants. Journal of Speech, Language and Hearing Research 20, 130 (1977)CrossRefGoogle Scholar
  9. 9.
    Kricos, P.B., Lesner, S.A.: Differences in visual intelligibility across talkers. The Volta Review (1982)Google Scholar
  10. 10.
    Owens, E., Blazek, B.: Visemes observed by hearing-impaired and normal-hearing adult viewers. Journal of Speech and Hearing Research 28, 381 (1985)CrossRefGoogle Scholar
  11. 11.
    Cox, S., Harvey, R., Lan, Y., Newman, J., Theobald, B.-J.: The challenge of multispeaker lip-reading. In: International Conference on Auditory-Visual Speech Processing, Citeseer, pp. 179–184 (2008)Google Scholar
  12. 12.
    Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision 60, 135–164 (2004)CrossRefGoogle Scholar
  13. 13.
    Cappelletta, L., Harte, N.: Phoneme-to-viseme mapping for visual speech recognition. In: ICPRAM (2), pp. 322–329 (2012)Google Scholar
  14. 14.
    Bozkurt, E., Erdem, C., Erzin, E., Erdem, T., Ozkan, M.: Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In: Proc. of Signal Proc. and Communications Applications, pp. 1–4 (2007)Google Scholar
  15. 15.
    Lander, J.: Read my lips: Facial animation techniques (2014), (accessed: January 28, 2014)
  16. 16.
    Jeffers, J., Barley, M.: Speechreading (lipreading). Thomas Springfield, IL (1971)Google Scholar
  17. 17.
    Lee, S., Yook, D.: Audio-to-visual conversion using hidden markov models. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 563–570. Springer, Heidelberg (2002)Google Scholar
  18. 18.
    Montgomery, A.A., Jackson, P.L.: Physical characteristics of the lips underlying vowel lipreading performance. The Journal of the Acoustical Society of America 73, 2134 (1983)CrossRefGoogle Scholar
  19. 19.
    Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-visual speech recognition. In: Final Workshop 2000 Report, vol. 764 (2000)Google Scholar
  20. 20.
    Nitchie, E.B.: Lip-Reading, principles and practise: A handbook for teaching and self-practise. Frederick A Stokes Co., New York (1912)Google Scholar
  21. 21.
    Finn, K.E., Montgomery, A.A.: Automatic optically-based recognition of speech. Pattern Recognition Letters 8, 159–164 (1988)CrossRefGoogle Scholar
  22. 22.
    Heider, F., Heider, G.M.: An experimental investigation of lipreading. Psychological Monographs 52, 124–153 (1940)CrossRefGoogle Scholar
  23. 23.
    Woodward, M.F., Barber, C.G.: Phoneme perception in lipreading. Journal of Speech, Language and Hearing Research 3, 212 (1960)CrossRefGoogle Scholar
  24. 24.
    Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchec, V., Woodland, P.: The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department (2006)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Helen L. Bear
    • 1
  • Richard W. Harvey
    • 1
  • Barry-John Theobald
    • 1
  • Yuxuan Lan
    • 1
  1. 1.School of Computing SciencesUniversity of East AngliaNorwichUK

Personalised recommendations