Skip to main content

Lip-Reading: Furhat Audio Visual Intelligibility of a Back Projected Animated Face

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 7502)

Abstract

Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat’s face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected faces.

Keywords

  • Furhat
  • Talking Head
  • Robot Heads
  • Lip reading
  • Visual Speech

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Al Moubayed, S., Beskow, J., Skantze, G., Granström, B.: Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction. In: Esposito, et al. (eds.) Cognitive Behavioural Systems. LNCS. Springer (2012)

    Google Scholar 

  2. Al Moubayed, S., Beskow, J.: Effects of Visual Prominence Cues on Speech Intelligibility. In: Proceedings of Auditory-Visual Speech Processing AVSP 2009, Norwich, England (2009)

    Google Scholar 

  3. Al Moubayed, S., Edlund, J., Beskow, J.: Taming Mona Lisa: Communicating gaze faithfully in 2D and 3D facial projections. ACM Trans. Interact. Intell. Syst. 1(2), Article 11, 25 pages (2012)

    Google Scholar 

  4. Al Moubayed, S., Skantze, G.: Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays. In: Proceedings of the international conference on Auditory-Visual Speech Processing AVSP, Florence, Italy (2011)

    Google Scholar 

  5. Beskow, J.: Rule-based visual speech synthesis. In: Proc. of the Fourth European Conference on Speech Communication and Technology (1995)

    Google Scholar 

  6. Beskow, J., Edlund, J., Granström, B., Gustafson, J., House, D.: Face-to-Face Interaction and the KTH Cooking Show. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Second COST 2102. LNCS, vol. 5967, pp. 157–168. Springer, Heidelberg (2010)

    CrossRef  Google Scholar 

  7. Cassel, J., Sullivan, J., Prevost, S., Churchill, E.E.: Embodied Conversational Agents. MIT Press (2000)

    Google Scholar 

  8. de Melo, C.M., Gratch, J.: Expression of Emotions Using Wrinkles, Blushing, Sweating and Tears. In: Ruttkay, Z., Kipp, M., Nijholt, A., Vilhjálmsson, H.H. (eds.) IVA 2009. LNCS, vol. 5773, pp. 188–200. Springer, Heidelberg (2009)

    CrossRef  Google Scholar 

  9. Edlund, J., Al Moubayed, S., Beskow, J.: The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In: Vilhjálmsson, H.H., Kopp, S., Marsella, S., Thórisson, K.R. (eds.) IVA 2011. LNCS, vol. 6895, pp. 439–440. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  10. Ezzat, T., Poggio, T.: Visual Speech Synthesis by Morphing Visemes. Visual speech synthesis by morphing visemes. International Journal of Computer Vision 38, 45–57 (2000)

    CrossRef  MATH  Google Scholar 

  11. Erber, N.P.: Effects of angle, distance and illumination on visual reception of speech by profoundly deaf children. J. of Speech and Hearing Research 17, 99–112 (1974)

    Google Scholar 

  12. Granström, B., House, D.: Modeling and evaluating verbal and non-verbal communication in talking animated interface agents. In: Dybkjaer, l., Hemsen, H., Minker, W. (eds.) Evaluation of Text and Speech Systems, pp. 65–98. Springer-Verlag Ltd. (2007)

    Google Scholar 

  13. Gratch, J., Okhmatovskaia, A., Lamothe, F., Marsella, S.C., Morales, M., van der Werf, R.J., Morency, L.-P.: Virtual Rapport. In: Gratch, J., Young, M., Aylett, R.S., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 14–27. Springer, Heidelberg (2006)

    CrossRef  Google Scholar 

  14. Gustafson, J., Boye, J., Fredriksson, M., Johanneson, L., Königsmann, J.: Providing Computer Game Characters with Conversational Abilities. In: Panayiotopoulos, T., Gratch, J., Aylett, R.S., Ballin, D., Olivier, P., Rist, T. (eds.) IVA 2005. LNCS (LNAI), vol. 3661, pp. 37–51. Springer, Heidelberg (2005)

    CrossRef  Google Scholar 

  15. Kopp, S., Gesellensetter, L., Krämer, N., Wachsmuth, I.: A Conversational Agent as Museum Guide – Design and Evaluation of a Real-World Application. In: Panayiotopoulos, T., Gratch, J., Aylett, R.S., Ballin, D., Olivier, P., Rist, T. (eds.) IVA 2005. LNCS (LNAI), vol. 3661, pp. 329–343. Springer, Heidelberg (2005)

    CrossRef  Google Scholar 

  16. Kriegel, M., Aylett, R., Cuba, P., Vala, M., Paiva, A.: Robots Meet IVAs: A Mind-Body Interface for Migrating Artificial Intelligent Agents. In: Vilhjálmsson, H.H., Kopp, S., Marsella, S., Thórisson, K.R. (eds.) IVA 2011. LNCS, vol. 6895, pp. 282–295. Springer, Heidelberg (2011)

    CrossRef  Google Scholar 

  17. Massaro, D.: Perceiving talking faces: from speech perception to a behavioral principle. A Bradford Book. MIT Press, Cambridge (1997) ISBN: 978-0262133371

    Google Scholar 

  18. Massaro, D., Beskow, J., Cohen, M., Fry, C., Rodgriguez, T.: Picture my voice: audio to visual speech synthesis using artificial neural networks. In: Proceedings of the International Conference on Auditory-Visual Speech Processing AVSP 1999, Santa Cruz, USA (1999)

    Google Scholar 

  19. McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746 (1976)

    CrossRef  Google Scholar 

  20. Pelachaud, C.: Modeling Multimodal Expression of Emotion in a Virtual Agent. Philosophical Transactions of Royal Society B Biological Science, B 364, 3539–3548 (2009)

    CrossRef  Google Scholar 

  21. Raskar, R., Welch, G., Low, K.-L., Bandyopadhyay, D.: Shader lamps: animating real objects with image-based illumination. In: Proc. of the 12th Eurographics Workshop on Rendering Techniques, pp. 89–102 (2001)

    Google Scholar 

  22. Ruttkay, Z., Pelachaud, C. (eds.): From Brows till Trust: EvaluatingEmbodied Conversational Agents. Kluwer (2004)

    Google Scholar 

  23. Salvi, G., Beskow, J., Al Moubayed, S., Granström, B.: SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. EURASIP Journal on Audio, Speech, and Music Processing (2009)

    Google Scholar 

  24. Shannon, R., Zeng, F., Kamath, V., Wygonski, J., Ekelid, M.: Speech Recognition with primarily temporal cues. Science 270(5234), 303 (1995)

    CrossRef  Google Scholar 

  25. Siciliano, C., Williams, G., Beskow, J., Faulkner, A.: Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. In: Proceedings of the International Congress of Phonetic Sciences, pp. 131–134 (2003)

    Google Scholar 

  26. Summerfield, Q.: Lipreading and audio-visual speech perception. Philosophical Transactions: Biological Sciences 335(1273), 71–78 (1992)

    CrossRef  Google Scholar 

  27. Sjolander, K.: An HMM-based system for automatic segmentation and alignment of speech. In: Proceedings of Fonetik, pp. 93–96 (2003)

    Google Scholar 

  28. Todorovi, D.: Geometrical basis of perception of gaze direction. Vision Research 45(21), 3549–3562 (2006)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Al Moubayed, S., Skantze, G., Beskow, J. (2012). Lip-Reading: Furhat Audio Visual Intelligibility of a Back Projected Animated Face. In: Nakano, Y., Neff, M., Paiva, A., Walker, M. (eds) Intelligent Virtual Agents. IVA 2012. Lecture Notes in Computer Science(), vol 7502. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33197-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-33197-8_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-33196-1

  • Online ISBN: 978-3-642-33197-8

  • eBook Packages: Computer ScienceComputer Science (R0)