Abstract
Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat’s face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected faces.
Keywords
- Furhat
- Talking Head
- Robot Heads
- Lip reading
- Visual Speech
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Al Moubayed, S., Beskow, J., Skantze, G., Granström, B.: Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction. In: Esposito, et al. (eds.) Cognitive Behavioural Systems. LNCS. Springer (2012)
Al Moubayed, S., Beskow, J.: Effects of Visual Prominence Cues on Speech Intelligibility. In: Proceedings of Auditory-Visual Speech Processing AVSP 2009, Norwich, England (2009)
Al Moubayed, S., Edlund, J., Beskow, J.: Taming Mona Lisa: Communicating gaze faithfully in 2D and 3D facial projections. ACM Trans. Interact. Intell. Syst. 1(2), Article 11, 25 pages (2012)
Al Moubayed, S., Skantze, G.: Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays. In: Proceedings of the international conference on Auditory-Visual Speech Processing AVSP, Florence, Italy (2011)
Beskow, J.: Rule-based visual speech synthesis. In: Proc. of the Fourth European Conference on Speech Communication and Technology (1995)
Beskow, J., Edlund, J., Granström, B., Gustafson, J., House, D.: Face-to-Face Interaction and the KTH Cooking Show. In: Esposito, A., Campbell, N., Vogel, C., Hussain, A., Nijholt, A. (eds.) Second COST 2102. LNCS, vol. 5967, pp. 157–168. Springer, Heidelberg (2010)
Cassel, J., Sullivan, J., Prevost, S., Churchill, E.E.: Embodied Conversational Agents. MIT Press (2000)
de Melo, C.M., Gratch, J.: Expression of Emotions Using Wrinkles, Blushing, Sweating and Tears. In: Ruttkay, Z., Kipp, M., Nijholt, A., Vilhjálmsson, H.H. (eds.) IVA 2009. LNCS, vol. 5773, pp. 188–200. Springer, Heidelberg (2009)
Edlund, J., Al Moubayed, S., Beskow, J.: The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In: Vilhjálmsson, H.H., Kopp, S., Marsella, S., Thórisson, K.R. (eds.) IVA 2011. LNCS, vol. 6895, pp. 439–440. Springer, Heidelberg (2011)
Ezzat, T., Poggio, T.: Visual Speech Synthesis by Morphing Visemes. Visual speech synthesis by morphing visemes. International Journal of Computer Vision 38, 45–57 (2000)
Erber, N.P.: Effects of angle, distance and illumination on visual reception of speech by profoundly deaf children. J. of Speech and Hearing Research 17, 99–112 (1974)
Granström, B., House, D.: Modeling and evaluating verbal and non-verbal communication in talking animated interface agents. In: Dybkjaer, l., Hemsen, H., Minker, W. (eds.) Evaluation of Text and Speech Systems, pp. 65–98. Springer-Verlag Ltd. (2007)
Gratch, J., Okhmatovskaia, A., Lamothe, F., Marsella, S.C., Morales, M., van der Werf, R.J., Morency, L.-P.: Virtual Rapport. In: Gratch, J., Young, M., Aylett, R.S., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 14–27. Springer, Heidelberg (2006)
Gustafson, J., Boye, J., Fredriksson, M., Johanneson, L., Königsmann, J.: Providing Computer Game Characters with Conversational Abilities. In: Panayiotopoulos, T., Gratch, J., Aylett, R.S., Ballin, D., Olivier, P., Rist, T. (eds.) IVA 2005. LNCS (LNAI), vol. 3661, pp. 37–51. Springer, Heidelberg (2005)
Kopp, S., Gesellensetter, L., Krämer, N., Wachsmuth, I.: A Conversational Agent as Museum Guide – Design and Evaluation of a Real-World Application. In: Panayiotopoulos, T., Gratch, J., Aylett, R.S., Ballin, D., Olivier, P., Rist, T. (eds.) IVA 2005. LNCS (LNAI), vol. 3661, pp. 329–343. Springer, Heidelberg (2005)
Kriegel, M., Aylett, R., Cuba, P., Vala, M., Paiva, A.: Robots Meet IVAs: A Mind-Body Interface for Migrating Artificial Intelligent Agents. In: Vilhjálmsson, H.H., Kopp, S., Marsella, S., Thórisson, K.R. (eds.) IVA 2011. LNCS, vol. 6895, pp. 282–295. Springer, Heidelberg (2011)
Massaro, D.: Perceiving talking faces: from speech perception to a behavioral principle. A Bradford Book. MIT Press, Cambridge (1997) ISBN: 978-0262133371
Massaro, D., Beskow, J., Cohen, M., Fry, C., Rodgriguez, T.: Picture my voice: audio to visual speech synthesis using artificial neural networks. In: Proceedings of the International Conference on Auditory-Visual Speech Processing AVSP 1999, Santa Cruz, USA (1999)
McGurk, H., MacDonald, J.: Hearing lips and seeing voices. Nature 264, 746 (1976)
Pelachaud, C.: Modeling Multimodal Expression of Emotion in a Virtual Agent. Philosophical Transactions of Royal Society B Biological Science, B 364, 3539–3548 (2009)
Raskar, R., Welch, G., Low, K.-L., Bandyopadhyay, D.: Shader lamps: animating real objects with image-based illumination. In: Proc. of the 12th Eurographics Workshop on Rendering Techniques, pp. 89–102 (2001)
Ruttkay, Z., Pelachaud, C. (eds.): From Brows till Trust: EvaluatingEmbodied Conversational Agents. Kluwer (2004)
Salvi, G., Beskow, J., Al Moubayed, S., Granström, B.: SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. EURASIP Journal on Audio, Speech, and Music Processing (2009)
Shannon, R., Zeng, F., Kamath, V., Wygonski, J., Ekelid, M.: Speech Recognition with primarily temporal cues. Science 270(5234), 303 (1995)
Siciliano, C., Williams, G., Beskow, J., Faulkner, A.: Evaluation of a multilingual synthetic talking face as a communication aid for the hearing impaired. In: Proceedings of the International Congress of Phonetic Sciences, pp. 131–134 (2003)
Summerfield, Q.: Lipreading and audio-visual speech perception. Philosophical Transactions: Biological Sciences 335(1273), 71–78 (1992)
Sjolander, K.: An HMM-based system for automatic segmentation and alignment of speech. In: Proceedings of Fonetik, pp. 93–96 (2003)
Todorovi, D.: Geometrical basis of perception of gaze direction. Vision Research 45(21), 3549–3562 (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Al Moubayed, S., Skantze, G., Beskow, J. (2012). Lip-Reading: Furhat Audio Visual Intelligibility of a Back Projected Animated Face. In: Nakano, Y., Neff, M., Paiva, A., Walker, M. (eds) Intelligent Virtual Agents. IVA 2012. Lecture Notes in Computer Science(), vol 7502. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33197-8_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-33197-8_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33196-1
Online ISBN: 978-3-642-33197-8
eBook Packages: Computer ScienceComputer Science (R0)
