Multimedia Tools and Applications

, Volume 54, Issue 1, pp 165–179 | Cite as

On creating multimodal virtual humans—real time speech driven facial gesturing

  • Goranka ZoricEmail author
  • Rober Forchheimer
  • Igor S. Pandzic


Because of extensive use of different computer devices, human-computer interaction design nowadays moves towards creating user centric interfaces. It assumes incorporating different modalities that humans use in everyday communication. Virtual humans, who look and behave believably, fit perfectly in the concept of designing interfaces in more natural, effective, as well as social oriented way. In this paper we present a novel method for automatic speech driven facial gesturing for virtual humans capable of real time performance. Facial gestures included are various nods and head movements, blinks, eyebrow gestures and gaze. A mapping from speech to facial gestures is based on the prosodic information obtained from the speech signal. It is realized using a hybrid approach—Hidden Markov Models, rules and global statistics. Further, we test the method using an application prototype—a system for speech driven facial gesturing suitable for virtual presenters. Subjective evaluation of the system confirmed that the synthesized facial movements are consistent and time aligned with the underlying speech, and thus provide natural behavior of the whole face.


Facial gestures Visual prosody Multimodal interfaces Facial animation Speech processing Human-computer interaction 



The work was partly carried out within the research project “Embodied Conversational Agents as interface for networked and mobile services” supported by the Ministry of Science, Education and Sports of the Republic of Croatia. This work was partly supported by grants from The National Foundation for Science, Higher Education and Technological Development of the Republic of Croatia and The Swedish Institute, Sweden.


  1. 1.
    Albrecht I, Haber J, Seidel H (2002) Automatic generation of non-verbal facial expressions from speech. In Proceedings of Computer Graphics International 2002 (CGI 2002), pages 283–293Google Scholar
  2. 2.
    Cavé C, Guaïtella I, Bertrand R, Santi S, Harlay F, Espesser R (1996) About the relationship between eyebrow movements and F0 variations, In Proceedings of Int’l Conf. Spoken Language ProcessingGoogle Scholar
  3. 3.
    Chovil N (1991) Discourse-oriented facial displays in conversation. Research on Language and Social InteractionGoogle Scholar
  4. 4.
    Chuang E, Bregler C (2005) Mood swings: expressive speech animation. ACM Trans Graph 2:331–347CrossRefGoogle Scholar
  5. 5.
    Condon WS, Ogston WD (1971) Speech and body motion synchrony of speaker-hearer. In Horton DL, Jenkins JJ (eds) Perception of language, 150–184Google Scholar
  6. 6.
    Cosnier J (1991) Les gestes de la question. In Kerbrat-Orecchioni, editor, La Question, 163–171, Presses Universitaires de LyonGoogle Scholar
  7. 7.
    Deng Z, Busso C, Narayanan S, Neumann U (2004) Audio-based head motion synthesis for avatar-based telepresence systems. Proc ACM SIGMM Workshop on Effective Telepresence (ETP), 24–30Google Scholar
  8. 8.
    Ekman P (1979) About brows: Emotional and conversational signals. In von Cranach M, Foppa K, Lepenies W, Ploog D (eds) Human ethology: Claims and limits of a new discipline.Google Scholar
  9. 9.
    Graf HP, Cosatto E, Strom V, Huang FJ (2002) Visual prosody: facial movements accompanying speech. In Proceedings of AFGR 2002, 381–386Google Scholar
  10. 10.
    Granström B, House D, Lundeberg M (1999) Eyebrow movements as a cue to prominence. In The Third Swedish Symposium on Multimodal CommunicationGoogle Scholar
  11. 11.
    Hofer G, Shimodaira H (2007) Automatic head motion prediction from speech data, in In Proceedings of InterspeechGoogle Scholar
  12. 12.
    Honda K (2000) Interactions between vowel articulation and F0 control. In Proceedings of Linguistics and Phonetics: Item Order in Language and Speech (LP’98). Fujimura BDJO, Palek B (eds)Google Scholar
  13. 13.
    House D, Beskow J, Granström B (2001) Timing and interaction of visual cues for prominence in audiovisual speech perception. In Proceedings of EurospeechGoogle Scholar
  14. 14.
    HTK, The Hidden Markov Model Toolkit,
  15. 15.
    Kuratate T, Munhall KG, Rubin PE, Vatikiotis-Bateson E, Yehia HC (1999) Audio-visual synthesis of talking faces from speech production correlates. EuroSpeech99 3:1279–1282Google Scholar
  16. 16.
    Levine S, Theobalt C, Koltun V (2009) Real-time prosody-driven synthesis of body language. In proceedings of ACM SIGGRAPH AsiaGoogle Scholar
  17. 17.
    Munhall KG, Jones JA, Callan DE, Kuratate T, Bateson EV (2004) Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychol Sci 15(2):133–137CrossRefGoogle Scholar
  18. 18.
    Pandzic IS, Forchheimer R Editors (2002) MPEG-4 Facial Animation—The Standard, Implementation and Applications, John Wiley & Sons Ltd, ISBN 0-470-84465-5Google Scholar
  19. 19.
    Pelachaud C, Badler N, Steedman M (1996) Generating facial expressions for speech. Cogn Sci 20(1):1–46CrossRefGoogle Scholar
  20. 20.
    Salvi G, Beskow J, Al Moubayed S, Granström B (2009) SynFace—speech-driven facial animation for virtual speech-reading support. EURASIP Journal on Audio, Speech, and Music Processing, vol. 2009Google Scholar
  21. 21.
    Sargin ME, Erzin E, Yemez Y, Tekalp AM, Erdem AT, Erdem C, Ozkan M (2007) Prosody-Driven Head-Gesture Animation, ICASSP’07Google Scholar
  22. 22.
    SPTK, The Speech Signal Processing Toolkit,
  23. 23.
  24. 24.
    Yehia H, Kuratate T, Vatikiotis-Bateson E (2000) Facial animation and head motion driven by speech acoustics, 5th Seminar on Speech Production: Models and Data. Hoole P, (ed) Kloster SeeonGoogle Scholar
  25. 25.
    Zoric G (2005) Automatic Lip Synchronization by Speech Signal Analysis, Master Thesis (03-Ac-17/2002-z) on Faculty of Electrical Engineering and Computing, University of ZagrebGoogle Scholar
  26. 26.
    Zoric G, Smid K, Pandzic I (2007) Facial gestures: Taxonomy and application of nonverbal, nonemotional facial displays for emodied conversational agents. In Toyoaki Nishida (ed) Conversational Informatics—An Engineering Approach. John Wiley & Sons, pp. 161–182, ISBN 978-0-470-02699-1Google Scholar
  27. 27.
    Zoric G, Smid K, Pandzic IS (2009) Towards facial gestures generation by speech signal analysis using HUGE architecture. Multimodal Signals Cogn Algorithmic Issues Lect Notes Comput Sci LNCS 5398:112–120CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Goranka Zoric
    • 1
    Email author
  • Rober Forchheimer
    • 2
  • Igor S. Pandzic
    • 1
  1. 1.Department of Telecommunications, Faculty of Electrical Engineering and ComputingUniversity of ZagrebZagrebCroatia
  2. 2.Department of Electrical EngineeringLinköping UniversityLinköpingSweden

Personalised recommendations