Are You Addressing Me? Multimodal Addressee Detection in Human-Human-Computer Conversations

  • Oleg AkhtiamovEmail author
  • Dmitrii Ubskii
  • Evgeniia Feldina
  • Aleksei Pugachev
  • Alexey Karpov
  • Wolfgang Minker
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)


The goal of addressee detection is to answer the question ‘Are you addressing me?’ In order to participate in multiparty conversations, a spoken dialogue system is supposed to determine whether a user is addressing the system or another human. The present paper describes three levels of speech and text analysis (acoustical, lexical, and syntactical) for multimodal addressee detection and reveals the connection between them and the classification performance for different categories of speech. We propose several classification models and compare their performance with the results of the original research performed by the authors of the Smart Video Corpus which we use in our computations. Our most effective meta-classifier working with acoustical, syntactical, and lexical features provides an unweighted average recall equal to 0.917, showing a nine percent advantage over the best baseline model, though the baseline classifier additionally uses head orientation data. We also propose an LSTM neural network for text classification which replaces the lexical and the syntactical classifier by a single model reaching the same performance as the most effective meta-classifier does, despite the fact that this meta-model additionally analyses acoustical data.


Off-Talk Speaking style Text classification Long Short-Term memory Data fusion Multimodal interaction Spoken dialogue system 



This research is partially supported by DAAD together with the Ministry of Education and Science of the Russian Federation within Michail Lomonosov Program (project No. 8.704.2016/DAAD), by RFBR (project No. 16-37-60100), by the Government of the Russian Federation (grant No. 074-U01), and by the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” which is funded by the German Research Foundation (DFG).


  1. 1.
    Tsai, T.J., Stolcke, A., Slaney, M.: A study of multimodal addressee detection in human-human-computer interaction. IEEE Trans. Multimed. 17(9), 1550–1561 (2015)CrossRefGoogle Scholar
  2. 2.
    Dowding, J., Alena, R., Clancey, W.J., Sierhuis, M., Graham, J.: Are you talking to me? Dialogue systems supporting mixed teams of humans and robots. In: Proceedings of AAAI Fall Symposium Aurally Informed Performance: Integrating Machine Listening and Auditory Presentation in Robotic Systems, Washington, DC, USA, pp. 22–27 (2006)Google Scholar
  3. 3.
    Paek, T., Horvitz, E., Ringger, E.: Continuous listening for unconstrained spoken dialog. In: Yuan, B., Huang, T., Tang, X. (eds). Proceedings of ICSLP, vol. 1, pp. 138–141 (2000)Google Scholar
  4. 4.
    Akhtiamov, O., Sergienko, R., Minker, W.: An approach to Off-Talk detection based on text classification within an automatic spoken dialogue system. In: Proceedings of ICINCO, Lisbon, Portugal, vol. 2, pp. 288–293 (2016)Google Scholar
  5. 5.
    Batliner, A., Hacker, C., Noeth, E.: To talk or not to talk with a computer. J. Multimodal User Interfaces 2(3), 171–186 (2008)CrossRefGoogle Scholar
  6. 6.
    Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of Interspeech, pp. 2559–2563 (2013)Google Scholar
  7. 7.
    Ravuri, S., Stolcke, A.: Recurrent neural network and LSTM models for lexical utterance classification. In: Proceedings of Interspeech, pp. 135–139 (2015)Google Scholar
  8. 8.
    Johansson, M., Skantze, G., Gustafson, J.: Head pose patterns in multiparty human-robot team-building interactions. In: Proceedings of ICSR, Bristol, UK, pp. 351–360 (2013)Google Scholar
  9. 9.
    Lee, M.K., Kiesler, S., Forlizzi, J.: Receptionist or information kiosk: how do people talk with a robot? In: Proceedings of ACM Conference on Computer Supported Cooperative Work, pp. 31–40 (2010)Google Scholar
  10. 10.
    Schuller, B., et al.: The INTERSPEECH 2017 computational paralinguistics challenge: addressee, cold & snoring. In: Proceedings of Interspeech, Stockholm, Sweden (2017)Google Scholar
  11. 11.
    Ouchi, H., Tsuboi, Y.: Addressee and response selection for multi-party conversation. In: Proceedings of EMNLP, Austin, Texas, pp. 2133–2143 (2016)Google Scholar
  12. 12.
    Schuller, B., et al.: The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of Interspeech, Lyon, France (2013)Google Scholar
  13. 13.
    Ben-Hur, A., Weston, J.: A User’s Guide to Support Vector Machines. Data Mining Techniques for the Life Sciences, pp. 223–239. Humana Press (2010)Google Scholar
  14. 14.
  15. 15.
    Sergienko, R., Shan, M., Minker, W.: A comparative study of text preprocessing approaches for topic detection of user utterances. In: Proceedings of LREC, Portorož, Slovenia, pp. 1826–1831 (2016)Google Scholar
  16. 16.
    Zhou, Y., Li, Y., Xia, S.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)CrossRefGoogle Scholar
  17. 17.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)zbMATHGoogle Scholar
  18. 18.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. No. CMU-CS, pp. 96–118, Carnegie-Mellon University, Pittsburgh, PA, Department of Computer Science (1996)Google Scholar
  19. 19.
    Klecka, W.: Discriminant Analysis, 9th edn. Sage Publications Inc., Beverly Hills (1988)Google Scholar
  20. 20.
    Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of EMNLP, Doha, Qatar, vol. 14, pp. 1532–1543 (2014)Google Scholar
  21. 21.
  22. 22.
    Maglio, P.P., Matlock, T., Campbell, C.S., Zhai, S., Smith, B.A.: Gaze and speech in attentive user interfaces. In: Tan, T., Shi, Y., Gao, W. (eds.) ICMI 2000. LNCS, vol. 1948, pp. 1–7. Springer, Heidelberg (2000). doi: 10.1007/3-540-40063-X_1 CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Oleg Akhtiamov
    • 1
    • 2
    Email author
  • Dmitrii Ubskii
    • 2
  • Evgeniia Feldina
    • 2
  • Aleksei Pugachev
    • 2
    • 3
  • Alexey Karpov
    • 2
    • 3
  • Wolfgang Minker
    • 1
  1. 1.Ulm UniversityUlmGermany
  2. 2.ITMO UniversitySt. PetersburgRussia
  3. 3.SPIIRASSt. PetersburgRussia

Personalised recommendations