Visual Contribution to Word Prominence Detection in a Playful Interaction Setting

  • Martin Heckmann
Conference paper


This paper investigates how prominent words can be distinguished from non-prominent ones in a setting where a user was interacting in a small game, designed as a Wizard-of-Oz experiment, with a computer. Misunderstandings of the system were triggered and the user was asked to correct them naturally, i. e. using prosodic cues. Consequently, the corrected word is expected to be highly prominent. Audio-visual recordings with a distant microphone and without visual markers were made. As acoustic features relative energy, duration and fundamental frequency were calculated. From the visual channel rigid head movements and image transformation-based features from the mouth region were extracted. Different feature combinations are evaluated regarding their power to discriminate the prominent from the non-prominent words using a SVM. Depending on the features accuracies of approximately 70%–80% are achieved. Thereby the visual features are in particular beneficial when the acoustic features are weaker.



I want to thank Petra Wagner, Britta Wrede and Heiko Wersing for fruitful discussions. Furthermore, I am very grateful to Rujiao Yan and Samuel Kevin Ngouoko for helping in setting up the visual processing and the forced alignment, respectively. Many thanks to Mark Dunn for support with the cameras and the recording system as well to Mathias Franzius for support with tuning the SVMs. Special thanks go to my subjects for their patience and effort.


  1. 1.
    Al Moubayed, S., Beskow, J.: Effects of visual prominence cues on speech intelligibility. In: Proceedings of the International Conference on Auditory Visual Speech Process. (AVSP), vol. 9, p. 16. ISCA, Austin (2009)Google Scholar
  2. 2.
    Beskow, J., Granström, B., House, D.: Visual correlates to prominence in several expressive modes. In: Proceedings of INTERSPEECH, pp. 1272–1275. ISCA (2006)Google Scholar
  3. 3.
    Black, A., Taylor, P., Caley, R.: The festival speech synthesis system. Tech. rep. (1998)Google Scholar
  4. 4.
    Bradski, G., Kaehler, A.: Learning OpenCV: Computer vision with the OpenCV library O’reilly (2008)Google Scholar
  5. 5.
    Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011). Software available at
  6. 6.
    Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120, 2421 (2006)CrossRefGoogle Scholar
  7. 7.
    Cvejic, E., Kim, J., Davis, C., Gibert, G.: Prosody for the eyes: Quantifying visual prosody using guided principal component analysis. In: Proceedings of INTERSPEECH. ISCA (2010)Google Scholar
  8. 8.
    Dohen, M., Lœvenbruck, H., Harold, H., et al.: Visual correlates of prosodic contrastive focus in french: Description and inter-speaker variability. In: Speech Prosody. Dresden, Germany (2006)Google Scholar
  9. 9.
    Graf, H., Cosatto, E., Strom, V., Huang, F.: Visual prosody: Facial movements accompanying speech. In: International Conference on Automatic Face and Gesture Recognition, pp. 396–401. IEEE (2002)Google Scholar
  10. 10.
    Heckmann, M.: Audio-visual evaluation and detection of word prominence in a human-machine interaction scenario. In: Proceedings of INTERSPEECH. ISCA, Portland, OR (2012)Google Scholar
  11. 11.
    Heckmann, M., Berthommier, F., Kroschel, K.: Noise adaptive stream weighting in audio-visual speech recognition. EURASIP J. Applied Signal Process. 11, 1260–1273 (2002)CrossRefGoogle Scholar
  12. 12.
    Heckmann, M., Domont, X., Joublin, F., Goerick, C.: A hierarchical framework for spectro-temporal feature extraction. Speech Comm. 53(5), 736–752 (2011). DOI: 10.1016 /j.specom.2010.08.006. Perceptual and Statistical AuditionGoogle Scholar
  13. 13.
    Heckmann, M., Gläser, C., Vaz, M., Rodemann, T., Joublin, F., Goerick, C.: Listen to the parrot: Demonstrating the quality of online pitch and formant extraction via feature-based resynthesis. In: Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Nice (2008)Google Scholar
  14. 14.
    Heckmann, M., Joublin, F., Goerick, C.: Combining rate and place information for robust pitch extraction. In: Proceedings of INTERSPEECH, pp. 2765–2768. Antwerp (2007)Google Scholar
  15. 15.
    Hirschberg, J., Litman, D., Swerts, M.: Prosodic and other cues to speech recognition failures. Speech Communication 43(1-2), 155–175 (2004)Google Scholar
  16. 16.
    Kolossa, D., Zeiler, S., Vorwerk, A., Orglmeister, R.: Audiovisual speech recognition with missing or unreliable data. In: Proceedings of International Conference on Auditory Visual Speech Processing (AVSP) (2009)Google Scholar
  17. 17.
    Munhall, K., Jones, J., Callan, D., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility. Psychol. Sci. 15(2), 133 (2004)CrossRefGoogle Scholar
  18. 18.
    Nöth, E., Batliner, A., Kießling, A., Kompe, R., Niemann, H.: Verbmobil: The use of prosody in the linguistic components of a speech understanding system. IEEE Trans. Speech and Audio Process. 8(5), 519–532 (2000)CrossRefGoogle Scholar
  19. 19.
    Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)CrossRefGoogle Scholar
  20. 20.
    Shriberg, E.: Spontaneous speech: How people really talk and why engineers should care. In: Proceedings of EUROSPEECH, ISCA (2005)Google Scholar
  21. 21.
    Stolcke, A., Ries, K., Coccaro, N., Shriberg, E., Bates, R., Jurafsky, D., Taylor, P., Martin, R., Ess-Dykema, C., Meteer, M.: Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput. ling. 26(3), 339–373 (2000)CrossRefGoogle Scholar
  22. 22.
    Swerts, M., Krahmer, E.: Facial expression and prosodic prominence: Effects of modality and facial area. J. Phonetics 36(2), 219–238 (2008)CrossRefGoogle Scholar
  23. 23.
    Yoshida, T., Nakadai, K., Okuno, H.: Automatic speech recognition improved by two-layered audio-visual integration for robot audition. In: Proceedings of 9th IEEE-RAS International Conference on Humanoid Robots, pp. 604–609. IEEE (2009)Google Scholar
  24. 24.
    Young, S., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book. Cambridge University, Cambridge, United Kingdom (1995)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Honda Research Institute Europe GmbHOffenbach/MainGermany

Personalised recommendations