Journal on Multimodal User Interfaces

, Volume 3, Issue 4, pp 299–309 | Cite as

Auditory visual prominence

From intelligibility to behavior
  • Samer Al MoubayedEmail author
  • Jonas Beskow
  • Björn Granström
Original Paper


Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.


Prominence Visual prosody Gesture ECA Eye gaze Head nod, eyebrows 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    McGurk H, MacDonald J (1976) Hearing lips and seeing voices Google Scholar
  2. 2.
    Summerfield Q (1992) Lipreading and audio-visual speech perception. Philos Trans Biol Sci 335(1273):71–78 CrossRefGoogle Scholar
  3. 3.
    Cave C, Guaïtella I, Bertrand R, Santi S, Harlay F, Espesser R (1996) About the relationship between eyebrow movements and Fo variations. In: Proc of the fourth international conference on spoken language, vol 4 Google Scholar
  4. 4.
    Munhall K, Jones J, Callan D, Kuratate T, Vatikiotis-Bateson E (2004) Head movement improves auditory speech perception. Psychol Sci 15(2):133–137 CrossRefGoogle Scholar
  5. 5.
    Davis C, Kim J (2006) Audio-visual speech perception off the top of the head. Cognition 100(3):21–31 CrossRefGoogle Scholar
  6. 6.
    Cvejic E, Kim J, Davis C (2010) Prosody off the top of the head: Prosodic contrasts can be discriminated by head motion. Speech Commun 52 Google Scholar
  7. 7.
    Terken J (1991) Fundamental frequency and perceived prominence of accented syllables. J Acoust Soc Am 89:1768 CrossRefGoogle Scholar
  8. 8.
    Gundel J (1999) On different kinds of focus. In: Focus: linguistic, cognitive, and computational perspectives, pp 293–305 Google Scholar
  9. 9.
    Grice M, Savino M (1997) Can pitch accent type convey information status in yes-no questions. In: Proc of the workshop sponsored by the association for computational linguistics, pp 29–38 Google Scholar
  10. 10.
    Granström B, House D (2005) Audiovisual representation of prosody in expressive speech communication. Speech Commun 46(3–4):473–484 CrossRefGoogle Scholar
  11. 11.
    Beskow J, Granström B, House D (2006) Visual correlates to prominence in several expressive modes. In: Proc of the ninth international conference on spoken language processing Google Scholar
  12. 12.
    House D, Beskow J, Granström B (2001) Timing and interaction of visual cues for prominence in audiovisual speech perception. In: Proc of the seventh European conference on speech communication and technology Google Scholar
  13. 13.
    Swerts M, Krahmer E (2006) The importance of different facial areas for signalling visual prominence. In: Proc of the ninth international conference on spoken language processing Google Scholar
  14. 14.
    Krahmer E, Swerts M (2007) The effects of visual beats on prosodic prominence: Acoustic analyses, auditory perception and visual perception. J Mem Lang 57(3):396–414 CrossRefGoogle Scholar
  15. 15.
    Dohen M, Lœvenbruck H (2009) Interaction of audition and vision for the perception of prosodic contrastive focus. Lang Speech 52(2–3):177 CrossRefGoogle Scholar
  16. 16.
    Dohen M, Lcevenbruck H, Hill H (2009) Recognizing prosody from the lips: is it possible to extract prosodic focus. In: Visual speech recognition: lip segmentation and mapping, p 416 Google Scholar
  17. 17.
    Streefkerk B, Pols L, Bosch L (1999) Acoustical features as predictors for prominence in read aloud Dutch sentences used in ANN’s. In: Sixth European conference on speech communication and technology, Citeseer Google Scholar
  18. 18.
    Fant G, Kruckenberg A, Nord L (1991) Durational correlates of stress in Swedish, French, and English. J Phon 19(3–4):351–365 Google Scholar
  19. 19.
    Bruce G (1977) Swedish word accents in sentence perspective. LiberLäromedel/Gleerup, Malmo Google Scholar
  20. 20.
    Gussenhoven C, Bruce G (1999) Word prosody and intonation. In: Empirical approaches to language typology, pp 233–272 Google Scholar
  21. 21.
    Heldner M, Strangert E (2001) Temporal effects of focus in Swedish. J Phon 29(3):329–361 CrossRefGoogle Scholar
  22. 22.
    Fant G, Kruckenberg A, Liljencrants J, Hertegård S (2000) Acoustic phonetic studies of prominence in Swedish. KTH TMH-QPSR 2(3):2000 Google Scholar
  23. 23.
    Fant G, Kruckenberg A (1994) Notes on stress and word accent in Swedish. In: Proceedings of the international symposium on prosody, 18 September 1994, Yokohama, pp 2–3 Google Scholar
  24. 24.
    Krahmer E, Swerts M (2004) More about brows: a cross-linguistic study via analysis-by-synthesis. In: From brows to trust: evaluating embodied conversational agents, pp 191–216 Google Scholar
  25. 25.
    Massaro D (1998) Perceiving talking faces: from speech perception to a behavioral principle. MIT Press, Cambridge Google Scholar
  26. 26.
    Agelfors E, Beskow J, Dahlquist M, Granström B, Lundeberg M, Spens K-E, Öhman T (1998) Synthetic faces as a lipreading support. In: Proceedings of ICSLP’98 Google Scholar
  27. 27.
    Salvi G, Beskow J, Al Moubayed S, Granström B (2009) Synface—speech-driven facial animation for virtual speech-reading support. J Audio Speech Music Process 2009 Google Scholar
  28. 28.
    Beskow J (1995) Rule-based visual speech synthesis. In: Proc of the fourth European conference on speech communication and technology Google Scholar
  29. 29.
    Sjölander K (2003) An HMM-based system for automatic segmentation and alignment of speech. In: Proceedings of fonetik, pp 93–96 Google Scholar
  30. 30.
    Beskow J (2004) Trainable articulatory control models for visual speech synthesis. Int J Speech Technol 7(4):335–349 CrossRefGoogle Scholar
  31. 31.
    Shannon R, Zeng F, Kamath V, Wygonski J, Ekelid M (1995) Speech recognition with primarily temporal cues. Science 270(5234):303 CrossRefGoogle Scholar
  32. 32.
    Fant G, Kruckenberg A, Nord L (1991) Durational correlates of stress in Swedish, French and English. J Phon 19(1991):351–365 Google Scholar
  33. 33.
    Heldner M, Strangert E (2001) Temporal effects of focus in Swedish. J Phon 29:329–361 CrossRefGoogle Scholar
  34. 34.
    Moubayed S Al, Ananthakrishnan G, Enflo L (2010) Automatic prominence classification in Swedish. In: Proceedings of prosodic prominence: perceptual and automatic identification workshop, Chicago, USA Google Scholar
  35. 35.
    Swerts M, Krahmer E (2004) Congruent and incongruent audiovisual cues to prominence. In: Proc of speech prosody Google Scholar
  36. 36.
    de Cheveigne A, Kawahara H (2002) YIN, a fundamental frequency estimator for speech and musicy. J Acoust Soc Am 111:1917 CrossRefGoogle Scholar
  37. 37.
    Al Moubayed S, Beskow J, Oster A-M, Salvi G, Granström B, van Son N, Ormel E (2009) Virtual speech reading support for hard of hearing in a domestic multi-media setting. In: Proceedings of interspeech 2009 Google Scholar
  38. 38.
    Poggi I, Pelachaud C, De Rosisc F (2000) Eye communication in a conversational 3D synthetic agent. AI Commun 13(3):169–181 Google Scholar
  39. 39.
    Ekman P (1979) About brows: Emotional and conversational signals. In: Human ethology: claims and limits of a new discipline: contributions to the colloquium, pp 169–248 Google Scholar
  40. 40.
    Cassell J, Pelachaud C, Badler N, Steedman M, Achorn B, Becket T, Douville B, Prevost S, Stone M (1994) Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Proceedings of the 21st annual conference on computer graphics and interactive techniques, pp 413–420 CrossRefGoogle Scholar
  41. 41.
    Raidt S, Bailly G, Elisei F (2007) Analyzing and modeling gaze during face-to-face interaction. In: Proceedings of the international conference on auditory-visual speech processing (AVSP 2007) Google Scholar
  42. 42.
    Vatikiotis-Bateson E, Eigsti I, Yano S, Munhall K (1998) Eye movement of perceivers during audiovisual speech perception. Percept Psychophys 60(6):926–940 Google Scholar
  43. 43.
    Paré M, Richler R, ten Hove M, Munhall K (2003) Gaze behavior in audiovisual speech perception: the influence of ocular fixations on the McGurk effect. Percept Psychophys 65(4):553 Google Scholar
  44. 44.
    Cutler A, Otake T (1999) Pitch accent in spoken-word recognition in Japanese. J Acoust Soc Am 105:1877 CrossRefGoogle Scholar
  45. 45.
    van Wassenhove V, Grant K, Poeppel D (2005) Visual speech speeds up the neural processing of auditory speech. Proc Nat Acad Sci 102(4):1181 CrossRefGoogle Scholar

Copyright information

© OpenInterface Association 2010

Authors and Affiliations

  • Samer Al Moubayed
    • 1
    Email author
  • Jonas Beskow
    • 1
  • Björn Granström
    • 1
  1. 1.Center for Speech TechnologyKTHStockholmSweden

Personalised recommendations