Skip to main content

Predicting Head Pose in Dyadic Conversation

  • Conference paper
  • First Online:
Intelligent Virtual Agents (IVA 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10498))

Included in the following conference series:

Abstract

Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek features from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue.

Expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose. People involved in dyadic conversation adapt speech and head motion in response to the others’ speech and head motion. Using Deep Bi-Directional Long Short Term Memory (BLSTM) neural networks, we demonstrate that it is possible to predict not just the head motion of the speaker, but also the head motion of the listener from the speech signal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allwood, J., Nivre, J., Ahlsén, E.: On the semantics and pragmatics of linguistic feedback. Journal of Semantics 9(1), 1–26 (1992)

    Article  Google Scholar 

  2. Bengio, Y., Laufer, E., Alain, G., Yosinski, J.: Deep generative stochastic networks trainable by backprop. In: Proceedings of The 31st International Conference on Machine Learning, pp. 226–234 (2014)

    Google Scholar 

  3. Bevacqua, E., De Sevin, E., Hyniewska, S.J., Pelachaud, C.: A listener model: introducing personality traits. Journal on Multimodal User Interfaces 6(1–2), 27–38 (2012)

    Article  Google Scholar 

  4. Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: CoNLL 2016, p. 10 (2016)

    Google Scholar 

  5. Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech, and Language Processing 15(3), 1075–1086 (2007)

    Article  Google Scholar 

  6. Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., Yan, H.: Embodiment in conversational interfaces: Rea. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 520–527. ACM (1999)

    Google Scholar 

  7. Cassell, J., McNeill, D., McCullough, K.E.: Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & Cognition 7(1), 1–34 (1999)

    Article  Google Scholar 

  8. Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras

  9. Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. IEEE (2013)

    Google Scholar 

  10. Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al.: Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8604–8608. IEEE (2013)

    Google Scholar 

  11. Deng, Z., Narayanan, S., Busso, C., Neumann, U.: Audio-based head motion synthesis for avatar-based telepresence systems. In: Proceedings of the 2004 ACM SIGMM Workshop on Effective Telepresence, pp. 24–30. ACM (2004)

    Google Scholar 

  12. Ding, C., Xie, L., Zhu, P.: Head motion synthesis from speech using deep neural networks. In: Multimedia Tools and Applications, pp. 1–18 (2014)

    Article  Google Scholar 

  13. Ding, C., Zhu, P., Xie, L.: Blstm neural networks for speech driven head motion synthesis. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  14. Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381. IEEE (2013)

    Google Scholar 

  15. Gower, J.C.: Generalized procrustes analysis. Psychometrika 40(1), 33–51 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  16. Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013), http://arxiv.org/abs/1308.0850

  17. Haag, K., Shimodaira, H.: Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS, vol. 10011, pp. 198–207. Springer, Cham (2016). doi:10.1007/978-3-319-47665-0_18

    Chapter  Google Scholar 

  18. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  19. Huang, J.T., Li, J., Gong, Y.: An analysis of convolutional neural networks for speech recognition. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4989–4993. IEEE (2015)

    Google Scholar 

  20. Kingma, D.P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learning with deep generative models. In: Advances in Neural Information Processing Systems, pp. 3581–3589 (2014)

    Google Scholar 

  21. Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision 60(2), 135–164 (2004)

    Article  Google Scholar 

  22. McNeill, D.: Hand and mind: What gestures reveal about thought. University of Chicago Press (1992)

    Google Scholar 

  23. Morency, L.-P., Kok, I., Gratch, J.: Predicting listener backchannels: a probabilistic multimodal approach. In: Prendinger, H., Lester, J., Ishizuka, M. (eds.) IVA 2008. LNCS, vol. 5208, pp. 176–190. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85483-8_18

    Chapter  Google Scholar 

  24. Mori, M.: The uncanny valley. Energy 7(4), 33–35 (1970)

    Google Scholar 

  25. Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychological Science: A Journal of the American Psychological Society / APS 15(2), 133–137 (2004)

    Article  Google Scholar 

  26. Nishimura, R., Kitaoka, N., Nakagawa, S.: A spoken dialog system for chat-like conversations considering response timing. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS, vol. 4629, pp. 599–606. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74628-7_77

    Chapter  Google Scholar 

  27. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of The 31st International Conference on Machine Learning, pp. 1278–1286 (2014)

    Google Scholar 

  28. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  29. Ward, N., Tsukahara, W.: Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics 32(8), 1177–1207 (2000)

    Article  Google Scholar 

  30. Watanabe, T., Okubo, M., Nakashige, M., Danbara, R.: Interactor: Speech-driven embodied interactive actor. International Journal of Human-Computer Interaction 17(1), 43–60 (2004)

    Article  Google Scholar 

  31. Yngve, V.H.: On getting a word in edgewise. In: Chicago Linguistics Society, 6th Meeting, pp. 567–578 (1970)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to David Greenwood , Stephen Laycock or Iain Matthews .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Greenwood, D., Laycock, S., Matthews, I. (2017). Predicting Head Pose in Dyadic Conversation. In: Beskow, J., Peters, C., Castellano, G., O'Sullivan, C., Leite, I., Kopp, S. (eds) Intelligent Virtual Agents. IVA 2017. Lecture Notes in Computer Science(), vol 10498. Springer, Cham. https://doi.org/10.1007/978-3-319-67401-8_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67401-8_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67400-1

  • Online ISBN: 978-3-319-67401-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics