Gaze, Prosody and Semantics: Relevance of Various Multimodal Signals to Addressee Detection in Human-Human-Computer Conversations

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11096)


The present research is focused on multimodal addressee detection in human-human-computer conversations. A modern spoken dialogue system operating under realistic conditions that may include multiparty interaction (several people solve a cooperative task by addressing the system while talking to each other) is supposed to distinguish machine- from human-addressed utterances. Machine-addressed queries should be directly responded to, while human-addressed utterances should be either ignored or processed in an implicit way. We propose a multimodal system performing the visual, acoustic-prosodic, and textual analysis of users’ utterances. We managed to outperform the existing baseline for the Smart Video Corpus by applying our system. We also investigated the performance of different models for separate speech categories with various speech spontaneity and determined that the acoustical model has difficulties in classifying constrained speech, and the textual model performs worse for spontaneous speech, while the performance of the visual model drops for read human-addressed speech and for spontaneous human-addressed speech significantly due to the ambiguous behaviour of users.


Computational Paralinguistics Off-Talk Speaking style Text classification Frontal face detection Spoken dialogue system 



This work was partially financially supported by the Government of the Russian Federation (Grant No. 08-08) and by DAAD within the program ‘Research Grants for Doctoral Candidates and Young Academics and Scientists’ and within the program ‘Leonhard-Euler’.


  1. 1.
    Spirina, A., Minker, W., Sidorov, M.: Could emotions be beneficial for interaction quality modelling in human-human conversations? In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 447–455. Springer, Cham (2017). Scholar
  2. 2.
    Batliner, A., Hacker, C., Noeth, E.: To talk or not to talk with a computer. J. Multimodal User Interfaces 2(3), 171–186 (2008)CrossRefGoogle Scholar
  3. 3.
    Maglio, P.P., Matlock, T., Campbell, C.S., Zhai, S., Smith, B.A.: Gaze and speech in attentive user interfaces. In: Tan, T., Shi, Y., Gao, W. (eds.) ICMI 2000. LNCS, vol. 1948, pp. 1–7. Springer, Heidelberg (2000). Scholar
  4. 4.
    Lee, M.K., Kiesler, S., Forlizzi, J.: Receptionist or information kiosk: how do people talk with a robot? In: Proceedings of ACM Conference on Computer-Supported Cooperative Work, pp. 31–40 (2010)Google Scholar
  5. 5.
    Schuller, B., et al.: The INTERSPEECH 2017 computational paralinguistics challenge: addressee, cold & snoring. In: Proceedings of Interspeech, Stockholm (2017)Google Scholar
  6. 6.
    Ouchi, H., Tsuboi, Y.: Addressee and response selection for multi-party conversation. In: Proceedings of EMNLP, Austin, pp. 2133–2143 (2016)Google Scholar
  7. 7.
    Akhtiamov, O., Sidorov, M., Karpov, A., Minker, W.: Speech and text analysis for multimodal addressee detection in human-human-computer interaction. In: Proceedings of Interspeech, Stockholm, pp. 2521–2525 (2017)Google Scholar
  8. 8.
    Ishii, R., Shiro, K., Kazuhiro, O.: Prediction of next-utterance timing using head movement in multi-party meetings. In: Proceedings of the 5th International Conference on Human Agent Interaction. ACM (2017)Google Scholar
  9. 9.
    Skantze, G., Gustafson, J.: Attention and interaction control in a human-human-computer dialogue setting. In: Proceedings of SIGDIAL. Association for Computational Linguistics (2009)Google Scholar
  10. 10.
    Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of Interspeech (2013)Google Scholar
  11. 11.
    Ravuri, S., Stolcke, A.: Recurrent neural network and LSTM models for lexical utterance classification. In: Proceedings of Interspeech, pp. 135–139 (2015)Google Scholar
  12. 12.
    Tsai, T.J., Stolcke, A., Slaney, M.: A study of multimodal addressee detection in human-human-computer interaction. IEEE Trans. Multimed. 17(9), 1550–1561 (2015)CrossRefGoogle Scholar
  13. 13.
    Akhtiamov, O., Sergienko, R., Minker, W.: An approach to off-talk detection based on text classification within an automatic spoken dialogue system. In: Proceedings of ICINCO, Lisbon, vol. 2, pp. 288–293 (2016)Google Scholar
  14. 14.
    Akhtiamov, O., Ubskii, D., Feldina, E., Pugachev, A., Karpov, A., Minker, W.: Are you addressing me? Multimodal addressee detection in human-human-computer conversations. In: Karpov, A., Potapova, R., Mporas, I. (eds.) SPECOM 2017. LNCS (LNAI), vol. 10458, pp. 152–161. Springer, Cham (2017). Scholar
  15. 15.
    Pugachev, A., Akhtiamov, O., Karpov, A., Minker, W.: Deep learning for acoustic addressee detection in spoken dialogue systems. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 45–53. Springer, Cham (2018). Scholar
  16. 16.
    Viola, P., Jones, M.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 137–154 (2004)CrossRefGoogle Scholar
  17. 17.
    Schuller, B., et al.: The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of Interspeech, Lyon (2013)Google Scholar
  18. 18.
    Schuller, B., et al: The INTERSPEECH 2016 computational paralinguistics challenge: deception, sincerity & native language. In: Proceedings of Interspeech (2016)Google Scholar
  19. 19.
    Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of EMNLP, Doha, vol. 14, pp. 1532–1543 (2014)Google Scholar
  20. 20.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. J. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  21. 21.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. ML Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  22. 22.
    Noth, E., Hacker, C., Batliner, A.: Does multimodality really help? The classification of emotion and of on/off-focus in multimodal dialogues. In: ELMAR. IEEE (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Ulm UniversityUlmGermany
  2. 2.ITMO UniversitySaint PetersburgRussia

Personalised recommendations