Predicting Head Pose in Dyadic Conversation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10498)


Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek features from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue.

Expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose. People involved in dyadic conversation adapt speech and head motion in response to the others’ speech and head motion. Using Deep Bi-Directional Long Short Term Memory (BLSTM) neural networks, we demonstrate that it is possible to predict not just the head motion of the speaker, but also the head motion of the listener from the speech signal.


Speech animation Head motion synthesis Visual prosody Dyadic conversation Generative models BLSTM CVAE 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allwood, J., Nivre, J., Ahlsén, E.: On the semantics and pragmatics of linguistic feedback. Journal of Semantics 9(1), 1–26 (1992)CrossRefGoogle Scholar
  2. 2.
    Bengio, Y., Laufer, E., Alain, G., Yosinski, J.: Deep generative stochastic networks trainable by backprop. In: Proceedings of The 31st International Conference on Machine Learning, pp. 226–234 (2014)Google Scholar
  3. 3.
    Bevacqua, E., De Sevin, E., Hyniewska, S.J., Pelachaud, C.: A listener model: introducing personality traits. Journal on Multimodal User Interfaces 6(1–2), 27–38 (2012)CrossRefGoogle Scholar
  4. 4.
    Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: CoNLL 2016, p. 10 (2016)Google Scholar
  5. 5.
    Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech, and Language Processing 15(3), 1075–1086 (2007)CrossRefGoogle Scholar
  6. 6.
    Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., Yan, H.: Embodiment in conversational interfaces: Rea. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 520–527. ACM (1999)Google Scholar
  7. 7.
    Cassell, J., McNeill, D., McCullough, K.E.: Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & Cognition 7(1), 1–34 (1999)CrossRefGoogle Scholar
  8. 8.
    Chollet, F., et al.: Keras (2015).
  9. 9.
    Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. IEEE (2013)Google Scholar
  10. 10.
    Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al.: Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8604–8608. IEEE (2013)Google Scholar
  11. 11.
    Deng, Z., Narayanan, S., Busso, C., Neumann, U.: Audio-based head motion synthesis for avatar-based telepresence systems. In: Proceedings of the 2004 ACM SIGMM Workshop on Effective Telepresence, pp. 24–30. ACM (2004)Google Scholar
  12. 12.
    Ding, C., Xie, L., Zhu, P.: Head motion synthesis from speech using deep neural networks. In: Multimedia Tools and Applications, pp. 1–18 (2014)CrossRefGoogle Scholar
  13. 13.
    Ding, C., Zhu, P., Xie, L.: Blstm neural networks for speech driven head motion synthesis. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  14. 14.
    Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381. IEEE (2013)Google Scholar
  15. 15.
    Gower, J.C.: Generalized procrustes analysis. Psychometrika 40(1), 33–51 (1975)MathSciNetzbMATHCrossRefGoogle Scholar
  16. 16.
    Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013),
  17. 17.
    Haag, K., Shimodaira, H.: Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS, vol. 10011, pp. 198–207. Springer, Cham (2016). doi: 10.1007/978-3-319-47665-0_18CrossRefGoogle Scholar
  18. 18.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  19. 19.
    Huang, J.T., Li, J., Gong, Y.: An analysis of convolutional neural networks for speech recognition. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4989–4993. IEEE (2015)Google Scholar
  20. 20.
    Kingma, D.P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learning with deep generative models. In: Advances in Neural Information Processing Systems, pp. 3581–3589 (2014)Google Scholar
  21. 21.
    Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision 60(2), 135–164 (2004)CrossRefGoogle Scholar
  22. 22.
    McNeill, D.: Hand and mind: What gestures reveal about thought. University of Chicago Press (1992)Google Scholar
  23. 23.
    Morency, L.-P., Kok, I., Gratch, J.: Predicting listener backchannels: a probabilistic multimodal approach. In: Prendinger, H., Lester, J., Ishizuka, M. (eds.) IVA 2008. LNCS, vol. 5208, pp. 176–190. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-85483-8_18CrossRefGoogle Scholar
  24. 24.
    Mori, M.: The uncanny valley. Energy 7(4), 33–35 (1970)Google Scholar
  25. 25.
    Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychological Science: A Journal of the American Psychological Society / APS 15(2), 133–137 (2004)CrossRefGoogle Scholar
  26. 26.
    Nishimura, R., Kitaoka, N., Nakagawa, S.: A spoken dialog system for chat-like conversations considering response timing. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS, vol. 4629, pp. 599–606. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-74628-7_77CrossRefGoogle Scholar
  27. 27.
    Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of The 31st International Conference on Machine Learning, pp. 1278–1286 (2014)Google Scholar
  28. 28.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)Google Scholar
  29. 29.
    Ward, N., Tsukahara, W.: Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics 32(8), 1177–1207 (2000)CrossRefGoogle Scholar
  30. 30.
    Watanabe, T., Okubo, M., Nakashige, M., Danbara, R.: Interactor: Speech-driven embodied interactive actor. International Journal of Human-Computer Interaction 17(1), 43–60 (2004)CrossRefGoogle Scholar
  31. 31.
    Yngve, V.H.: On getting a word in edgewise. In: Chicago Linguistics Society, 6th Meeting, pp. 567–578 (1970)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.University of East AngliaNorwichUK

Personalised recommendations