Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading?

Bear, Helen L.; Harvey, Richard W.; Theobald, Barry-John; Lan, Yuxuan

doi:10.1007/978-3-319-14364-4_22

Helen L. Bear²⁷,
Richard W. Harvey²⁷,
Barry-John Theobald²⁷ &
…
Yuxuan Lan²⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8888))

Included in the following conference series:

International Symposium on Visual Computing

2516 Accesses
8 Citations

Abstract

A critical assumption of all current visual speech recognition systems is that there are visual speech units called visemes which can be mapped to units of acoustic speech, the phonemes. Despite there being a number of published maps it is infrequent to see the effectiveness of these tested, particularly on visual-only lip-reading (many works use audio-visual speech). Here we examine 120 mappings and consider if any are stable across talkers. We show a method for devising maps based on phoneme confusions from an automated lip-reading system, and we present new mappings that show improvements for individual talkers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Association, I.P.: Handbook of the International Phonetic Association: A guide to the use of the International Phonetic Alphabet. Cambridge University Press (1999)
Google Scholar
Chen, T., Rao, R.R.: Audio-visual integration in multimodal communication. Proceedings of the IEEE 86, 837–852 (1998)
Article Google Scholar
Fisher, C.G.: Confusions among visually perceived consonants. Journal of Speech, Language and Hearing Research 11, 796 (1968)
Article Google Scholar
Hazen, T.J., Saenko, K., La, C.H., Glass, J.R.: A segment-based audio-visual speech recognizer: Data collection, development, and initial experiments. In: Proceedings of the 6th International Conference on Multimodal Interfaces, ICMI 2004, pp. 235–242. ACM, New York (2004)
Google Scholar
Theobald, B.J.: Visual speech synthesis using shape and appearance models. PhD thesis, University of East Anglia (2003)
Google Scholar
Binnie, C.A., Jackson, P.L., Montgomery, A.A.: Visual intelligibility of consonants: A lipreading screening test with implications for aural rehabilitation. Journal of Speech and Hearing Disorders 41, 530 (1976)
Article Google Scholar
Franks, J.R., Kimble, J.: The confusion of english consonant clusters in lipreading. Journal of Speech, Language and Hearing Research 15, 474 (1972)
Article Google Scholar
Walden, B.E., Prosek, R.A., Montgomery, A.A., Scherr, C.K., Jones, C.J.: Effects of training on the visual recognition of consonants. Journal of Speech, Language and Hearing Research 20, 130 (1977)
Article Google Scholar
Kricos, P.B., Lesner, S.A.: Differences in visual intelligibility across talkers. The Volta Review (1982)
Google Scholar
Owens, E., Blazek, B.: Visemes observed by hearing-impaired and normal-hearing adult viewers. Journal of Speech and Hearing Research 28, 381 (1985)
Article Google Scholar
Cox, S., Harvey, R., Lan, Y., Newman, J., Theobald, B.-J.: The challenge of multispeaker lip-reading. In: International Conference on Auditory-Visual Speech Processing, Citeseer, pp. 179–184 (2008)
Google Scholar
Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision 60, 135–164 (2004)
Article Google Scholar
Cappelletta, L., Harte, N.: Phoneme-to-viseme mapping for visual speech recognition. In: ICPRAM (2), pp. 322–329 (2012)
Google Scholar
Bozkurt, E., Erdem, C., Erzin, E., Erdem, T., Ozkan, M.: Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In: Proc. of Signal Proc. and Communications Applications, pp. 1–4 (2007)
Google Scholar
Lander, J.: Read my lips: Facial animation techniques (2014), http://www.gamasutra.com/view/feature/131587/read_my_lips_facial_animation_.php (accessed: January 28, 2014)
Jeffers, J., Barley, M.: Speechreading (lipreading). Thomas Springfield, IL (1971)
Google Scholar
Lee, S., Yook, D.: Audio-to-visual conversion using hidden markov models. In: Ishizuka, M., Sattar, A. (eds.) PRICAI 2002. LNCS (LNAI), vol. 2417, pp. 563–570. Springer, Heidelberg (2002)
Google Scholar
Montgomery, A.A., Jackson, P.L.: Physical characteristics of the lips underlying vowel lipreading performance. The Journal of the Acoustical Society of America 73, 2134 (1983)
Article Google Scholar
Neti, C., Potamianos, G., Luettin, J., Matthews, I., Glotin, H., Vergyri, D., Sison, J., Mashari, A., Zhou, J.: Audio-visual speech recognition. In: Final Workshop 2000 Report, vol. 764 (2000)
Google Scholar
Nitchie, E.B.: Lip-Reading, principles and practise: A handbook for teaching and self-practise. Frederick A Stokes Co., New York (1912)
Google Scholar
Finn, K.E., Montgomery, A.A.: Automatic optically-based recognition of speech. Pattern Recognition Letters 8, 159–164 (1988)
Article Google Scholar
Heider, F., Heider, G.M.: An experimental investigation of lipreading. Psychological Monographs 52, 124–153 (1940)
Article Google Scholar
Woodward, M.F., Barber, C.G.: Phoneme perception in lipreading. Journal of Speech, Language and Hearing Research 3, 212 (1960)
Article Google Scholar
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X.A., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchec, V., Woodland, P.: The HTK Book (for HTK Version 3.4). Cambridge University Engineering Department (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing Sciences, University of East Anglia, Norwich, UK
Helen L. Bear, Richard W. Harvey, Barry-John Theobald & Yuxuan Lan

Authors

Helen L. Bear
View author publications
You can also search for this author in PubMed Google Scholar
Richard W. Harvey
View author publications
You can also search for this author in PubMed Google Scholar
Barry-John Theobald
View author publications
You can also search for this author in PubMed Google Scholar
Yuxuan Lan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, University of Nevada at Reno, USA
George Bebis
NASA Ames Research Center, Moffett Field, CA, USA
Richard Boyle
Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Bahram Parvin
Desert Research Institute, Reno, NV, USA
Darko Koracin
The University of Texas at Dallas, 75080, Richardson, TX, USA
Ryan McMahan
NextGen Interactions, 27604, Raleigh, NC, USA
Jason Jerald
Indiana University, 46202, Indianapolis, IN, USA
Hui Zhang
Microsoft Research, 1 Microsoft Way, 98052, Redmond, WA, USA
Steven M. Drucker
University of Delaware, 19716-2712, Newark, DE, USA
Chandra Kambhamettu
Intel Corp., 95054, Sata Clara, CA, USA
Maha El Choubassi
Computer Graphics and Interactive Media Lab, Department of Computer Science, University of Houston, 77004, Houston, TX, USA
Zhigang Deng
NVIDIA, 34788, Leesburg, FL, USA
Mark Carlson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bear, H.L., Harvey, R.W., Theobald, BJ., Lan, Y. (2014). Which Phoneme-to-Viseme Maps Best Improve Visual-Only Computer Lip-Reading?. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2014. Lecture Notes in Computer Science, vol 8888. Springer, Cham. https://doi.org/10.1007/978-3-319-14364-4_22

Download citation

DOI: https://doi.org/10.1007/978-3-319-14364-4_22
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14363-7
Online ISBN: 978-3-319-14364-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics