Predicting Head Pose in Dyadic Conversation

Greenwood, David; Laycock, Stephen; Matthews, Iain

doi:10.1007/978-3-319-67401-8_18

David Greenwood²¹,
Stephen Laycock²¹ &
Iain Matthews²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10498))

Included in the following conference series:

International Conference on Intelligent Virtual Agents

3556 Accesses
14 Citations
3 Altmetric

Abstract

Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek features from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue.

Expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose. People involved in dyadic conversation adapt speech and head motion in response to the others’ speech and head motion. Using Deep Bi-Directional Long Short Term Memory (BLSTM) neural networks, we demonstrate that it is possible to predict not just the head motion of the speaker, but also the head motion of the listener from the speech signal.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allwood, J., Nivre, J., Ahlsén, E.: On the semantics and pragmatics of linguistic feedback. Journal of Semantics 9(1), 1–26 (1992)
Article Google Scholar
Bengio, Y., Laufer, E., Alain, G., Yosinski, J.: Deep generative stochastic networks trainable by backprop. In: Proceedings of The 31st International Conference on Machine Learning, pp. 226–234 (2014)
Google Scholar
Bevacqua, E., De Sevin, E., Hyniewska, S.J., Pelachaud, C.: A listener model: introducing personality traits. Journal on Multimodal User Interfaces 6(1–2), 27–38 (2012)
Article Google Scholar
Bowman, S.R., Vilnis, L., Vinyals, O., Dai, A.M., Jozefowicz, R., Bengio, S.: Generating sentences from a continuous space. In: CoNLL 2016, p. 10 (2016)
Google Scholar
Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: Analysis and synthesis. IEEE Transactions on Audio, Speech, and Language Processing 15(3), 1075–1086 (2007)
Article Google Scholar
Cassell, J., Bickmore, T., Billinghurst, M., Campbell, L., Chang, K., Vilhjálmsson, H., Yan, H.: Embodiment in conversational interfaces: Rea. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 520–527. ACM (1999)
Google Scholar
Cassell, J., McNeill, D., McCullough, K.E.: Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & Cognition 7(1), 1–34 (1999)
Article Google Scholar
Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras
Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8599–8603. IEEE (2013)
Google Scholar
Deng, L., Li, J., Huang, J.T., Yao, K., Yu, D., Seide, F., Seltzer, M., Zweig, G., He, X., Williams, J., et al.: Recent advances in deep learning for speech research at Microsoft. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8604–8608. IEEE (2013)
Google Scholar
Deng, Z., Narayanan, S., Busso, C., Neumann, U.: Audio-based head motion synthesis for avatar-based telepresence systems. In: Proceedings of the 2004 ACM SIGMM Workshop on Effective Telepresence, pp. 24–30. ACM (2004)
Google Scholar
Ding, C., Xie, L., Zhu, P.: Head motion synthesis from speech using deep neural networks. In: Multimedia Tools and Applications, pp. 1–18 (2014)
Article Google Scholar
Ding, C., Zhu, P., Xie, L.: Blstm neural networks for speech driven head motion synthesis. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Gehring, J., Miao, Y., Metze, F., Waibel, A.: Extracting deep bottleneck features using stacked auto-encoders. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3377–3381. IEEE (2013)
Google Scholar
Gower, J.C.: Generalized procrustes analysis. Psychometrika 40(1), 33–51 (1975)
Article MathSciNet MATH Google Scholar
Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013), http://arxiv.org/abs/1308.0850
Haag, K., Shimodaira, H.: Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS, vol. 10011, pp. 198–207. Springer, Cham (2016). doi:10.1007/978-3-319-47665-0_18
Chapter Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
Article Google Scholar
Huang, J.T., Li, J., Gong, Y.: An analysis of convolutional neural networks for speech recognition. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4989–4993. IEEE (2015)
Google Scholar
Kingma, D.P., Mohamed, S., Rezende, D.J., Welling, M.: Semi-supervised learning with deep generative models. In: Advances in Neural Information Processing Systems, pp. 3581–3589 (2014)
Google Scholar
Matthews, I., Baker, S.: Active appearance models revisited. International Journal of Computer Vision 60(2), 135–164 (2004)
Article Google Scholar
McNeill, D.: Hand and mind: What gestures reveal about thought. University of Chicago Press (1992)
Google Scholar
Morency, L.-P., Kok, I., Gratch, J.: Predicting listener backchannels: a probabilistic multimodal approach. In: Prendinger, H., Lester, J., Ishizuka, M. (eds.) IVA 2008. LNCS, vol. 5208, pp. 176–190. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85483-8_18
Chapter Google Scholar
Mori, M.: The uncanny valley. Energy 7(4), 33–35 (1970)
Google Scholar
Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: head movement improves auditory speech perception. Psychological Science: A Journal of the American Psychological Society / APS 15(2), 133–137 (2004)
Article Google Scholar
Nishimura, R., Kitaoka, N., Nakagawa, S.: A spoken dialog system for chat-like conversations considering response timing. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS, vol. 4629, pp. 599–606. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74628-7_77
Chapter Google Scholar
Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep generative models. In: Proceedings of The 31st International Conference on Machine Learning, pp. 1278–1286 (2014)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Ward, N., Tsukahara, W.: Prosodic features which cue back-channel responses in english and japanese. Journal of Pragmatics 32(8), 1177–1207 (2000)
Article Google Scholar
Watanabe, T., Okubo, M., Nakashige, M., Danbara, R.: Interactor: Speech-driven embodied interactive actor. International Journal of Human-Computer Interaction 17(1), 43–60 (2004)
Article Google Scholar
Yngve, V.H.: On getting a word in edgewise. In: Chicago Linguistics Society, 6th Meeting, pp. 567–578 (1970)
Google Scholar

Download references

Author information

Authors and Affiliations

University of East Anglia, Norwich, UK
David Greenwood, Stephen Laycock & Iain Matthews

Authors

David Greenwood
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Laycock
View author publications
You can also search for this author in PubMed Google Scholar
Iain Matthews
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to David Greenwood , Stephen Laycock or Iain Matthews .

Editor information

Editors and Affiliations

KTH Royal Institute of Technology, Stockholm, Sweden
Jonas Beskow
KTH Royal Institute of Technology, Stockholm, Sweden
Christopher Peters
Uppsala University, Uppsala, Sweden
Ginevra Castellano
Trinity College, Dublin, Ireland
Carol O'Sullivan
KTH Royal Institute of Technology, Stockholm, Sweden
Iolanda Leite
University of Bielefeld, Bielefeld, Germany
Stefan Kopp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Greenwood, D., Laycock, S., Matthews, I. (2017). Predicting Head Pose in Dyadic Conversation. In: Beskow, J., Peters, C., Castellano, G., O'Sullivan, C., Leite, I., Kopp, S. (eds) Intelligent Virtual Agents. IVA 2017. Lecture Notes in Computer Science(), vol 10498. Springer, Cham. https://doi.org/10.1007/978-3-319-67401-8_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-67401-8_18
Published: 26 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67400-1
Online ISBN: 978-3-319-67401-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics