Comparing Cascaded LSTM Architectures for Generating Head Motion from Speech in Task-Oriented Dialogs

  • Duc-Canh NguyenEmail author
  • Gérard Bailly
  • Frédéric Elisei
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10903)


To generate action events for a humanoid robot for human robot interaction (HRI), multimodal interactive behavioral models are typically used given observed actions of the human partner(s). In previous research, we built an interactive model to generate discrete events for gaze and arm gestures, which can be used to drive our iCub humanoid robot [19, 20]. In this paper, we investigate how to generate continuous head motion in the context of a collaborative scenario where head motion contributes to verbal as well as nonverbal functions. We show that in this scenario, the fundamental frequency of speech (F0 feature) is not enough to drive head motion, while the gaze significantly contributes to the head motion generation. We propose a cascaded Long-Short Term Memory (LSTM) model that first estimates the gaze from speech content and hand gestures performed by the partner. This estimation is further used as input for the generation of the head motion. The results show that the proposed method outperforms a single-task model with the same inputs.


Head motion generation Human interactions Multi-tasks learning LSTM Human-robot interaction 



This research is supported by the ANR SOMBRERO (ANR-14-CE27-0014), EQUIPEX ROBOTEX (ANR-10-EQPX-44-01) and the RHUM action of PERSYVAL (11-LABX-0025). The first author is funded by SOMBRERO.


  1. 1.
    Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: Human trajectory prediction in crowded spaces. In: IEEE Conference on Computer Vision and Pattern Recognition (CPVR), pp. 961–971 (2016)Google Scholar
  2. 2.
    Ben Youssef, A., Shimodaira, H., Braude, D.A.: Articulatory features for speech-driven head motion synthesis. In: Interspeech, pp. 2758–2762 (2013)Google Scholar
  3. 3.
    Boersma, P., Weenik, D.: PRAAT: a system for doing phonetics by computer. Report of the Institute of Phonetic Sciences of the University of Amsterdam. University of Amsterdam, Amsterdam (1996)Google Scholar
  4. 4.
    Brimijoin, W.O., Boyd, A.W., Akeroyd, M.A.: The contribution of head movement to the externalization and internalization of sounds. PloS One 8(12), e83068 (2013)CrossRefGoogle Scholar
  5. 5.
    Busso, C., Deng, Z., Grimm, M., Neumann, U., Narayanan, S.: Rigid head motion in expressive speech animation: analysis and synthesis. IEEE Trans. Audio Speech Lang. Process. 15(3), 1075–1086 (2007)CrossRefGoogle Scholar
  6. 6.
    Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., Douville, B., Prevost, S., Stone, M.: Animated conversation: rule-based generation of facial expression, gesture & spoken intonation for multiple conversational agents. In: Annual Conference on Computer Graphics and Interactive Techniques, pp. 413–420. ACM (1994)Google Scholar
  7. 7.
    Dehon, C., Filzmoser, P., Croux, C.: Robust methods for canonical correlation analysis. In: Kiers, H.A.L., Rasson, J.P., Groenen, P.J.F., Schader, M. (eds.) Data Analysis, Classification, and Related Methods. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 321–326. Springer, Heidelberg (2000). Scholar
  8. 8.
    Ding, Y., Pelachaud, C., Artières, T.: Modeling multimodal behaviors from speech prosody. In: Aylett, R., Krenn, B., Pelachaud, C., Shimodaira, H. (eds.) IVA 2013. LNCS (LNAI), vol. 8108, pp. 217–228. Springer, Heidelberg (2013). Scholar
  9. 9.
    Graf, H.P., Cosatto, E., Strom, V., Jie Huang, F.: Visual prosody: Facial movements accompanying speech. In: Automatic Face and Gesture Recognition (FG), pp. 396–401. IEEE (2002)Google Scholar
  10. 10.
    Guitton, D., Volle, M.: Gaze control in humans: eye-head coordination during orienting movements to targets within and beyond the oculomotor range. J. Neurophysiol. 58(3), 427–459 (1987)CrossRefGoogle Scholar
  11. 11.
    Haag, K., Shimodaira, H.: Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis. In: Traum, D., Swartout, W., Khooshabeh, P., Kopp, S., Scherer, S., Leuski, A. (eds.) IVA 2016. LNCS (LNAI), vol. 10011, pp. 198–207. Springer, Cham (2016). Scholar
  12. 12.
    Lee, J., Marsella, S.: Nonverbal behavior generator for embodied conversational agents. In: Gratch, J., Young, M., Aylett, R., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS (LNAI), vol. 4133, pp. 243–255. Springer, Heidelberg (2006). Scholar
  13. 13.
    Levine, S., Theobalt, C., Koltun, V.: Real-time prosody-driven synthesis of body language. In: ACM Transactions on Graphics (TOG), vol. 28, Article no. 172. ACM (2009)CrossRefGoogle Scholar
  14. 14.
    Liu, C., Ishi, C.T., Ishiguro, H., Hagita, N.: Generation of nodding, head tilting and eye gazing for human-robot dialogue interaction. In: Human-Robot Interaction (HRI), pp. 285–292. IEEE (2012)Google Scholar
  15. 15.
    Mariooryad, S., Busso, C.: Generating human-like behaviors using joint, speech-driven models for conversational agents. IEEE Trans. Audio Speech Lang. Process. 20(8), 2329–2340 (2012)CrossRefGoogle Scholar
  16. 16.
    May, T., Ma, N., Brown, G.J.: Robust localisation of multiple speakers exploiting head movements and multi-conditional training of binaural cues. In: Acoustics, Speech and Signal Processing (ICASSP), pp. 2679–2683. IEEE (2015)Google Scholar
  17. 17.
    Mihoub, A., Bailly, G., Wolf, C., Elisei, F.: Graphical models for social behavior modeling in face-to face interaction. Pattern Recogn. Lett. (PRL) 74(2016), 82–89 (2016)CrossRefGoogle Scholar
  18. 18.
    Munhall, K.G., Jones, J.A., Callan, D.E., Kuratate, T., Vatikiotis-Bateson, E.: Visual prosody and speech intelligibility: Head movement improves auditory speech perception. Psychol. Sci. 15(2), 133–137 (2004)CrossRefGoogle Scholar
  19. 19.
    Nguyen, D.-C., Bailly, G., Elisei, F.: Conducting neuropsychological tests with a humanoid robot: design and evaluation. In: Cognitive Infocommunications (CogInfoCom), pp. 337–342. IEEE (2016)Google Scholar
  20. 20.
    Nguyen, D.-C., Bailly, G., Elisei, F.: Learning Off-line vs. On-line models of interactive multimodal behaviors with recurrent neural networks. Pattern Recognition Letters (PRL) (accepted with minor revision)Google Scholar
  21. 21.
    Sadoughi, N., Busso, C.: Speech-driven Animation with Meaningful Behaviors (2017). arXiv preprint arXiv:1708.01640
  22. 22.
    Thórisson, K.R.: Natural turn-taking needs no manual: computational theory and model, from perception to action. In: Granström, B., House, D., Karlsson, I. (eds.) Multimodality in Language and Speech Systems. Text, Speech and Language Technology, vol. 19, pp. 173–207. Springer, Dordrecht (2002). Scholar
  23. 23.
    Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: a professional framework for multimodality research. In: International Conference on Language Resources and Evaluation (LREC) (2006)Google Scholar
  24. 24.
    Yehia, H., Kuratate, T., Vatikiotis-Bateson, E.: Facial animation and head motion driven by speech acoustics. In: 5th Seminar on Speech Production: Models and Data, pp. 265–268. Kloster Seeon, Germany (2000)Google Scholar
  25. 25.
    Wolpert, D.M., Doya, K., Kawato, M.: A unifying computational framework for motor control and social interaction. Philos. Trans. R. Soc. B Biol. Sci. 358(1431), 593–602 (2003)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Duc-Canh Nguyen
    • 1
    Email author
  • Gérard Bailly
    • 1
  • Frédéric Elisei
    • 1
  1. 1.GIPSA-LabGrenoble-Alpes Univ. and CNRSGrenobleFrance

Personalised recommendations