Abstract
The goal of addressee detection is to answer the question ‘Are you addressing me?’ In order to participate in multiparty conversations, a spoken dialogue system is supposed to determine whether a user is addressing the system or another human. The present paper describes three levels of speech and text analysis (acoustical, lexical, and syntactical) for multimodal addressee detection and reveals the connection between them and the classification performance for different categories of speech. We propose several classification models and compare their performance with the results of the original research performed by the authors of the Smart Video Corpus which we use in our computations. Our most effective meta-classifier working with acoustical, syntactical, and lexical features provides an unweighted average recall equal to 0.917, showing a nine percent advantage over the best baseline model, though the baseline classifier additionally uses head orientation data. We also propose an LSTM neural network for text classification which replaces the lexical and the syntactical classifier by a single model reaching the same performance as the most effective meta-classifier does, despite the fact that this meta-model additionally analyses acoustical data.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Tsai, T.J., Stolcke, A., Slaney, M.: A study of multimodal addressee detection in human-human-computer interaction. IEEE Trans. Multimed. 17(9), 1550–1561 (2015)
Dowding, J., Alena, R., Clancey, W.J., Sierhuis, M., Graham, J.: Are you talking to me? Dialogue systems supporting mixed teams of humans and robots. In: Proceedings of AAAI Fall Symposium Aurally Informed Performance: Integrating Machine Listening and Auditory Presentation in Robotic Systems, Washington, DC, USA, pp. 22–27 (2006)
Paek, T., Horvitz, E., Ringger, E.: Continuous listening for unconstrained spoken dialog. In: Yuan, B., Huang, T., Tang, X. (eds). Proceedings of ICSLP, vol. 1, pp. 138–141 (2000)
Akhtiamov, O., Sergienko, R., Minker, W.: An approach to Off-Talk detection based on text classification within an automatic spoken dialogue system. In: Proceedings of ICINCO, Lisbon, Portugal, vol. 2, pp. 288–293 (2016)
Batliner, A., Hacker, C., Noeth, E.: To talk or not to talk with a computer. J. Multimodal User Interfaces 2(3), 171–186 (2008)
Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of Interspeech, pp. 2559–2563 (2013)
Ravuri, S., Stolcke, A.: Recurrent neural network and LSTM models for lexical utterance classification. In: Proceedings of Interspeech, pp. 135–139 (2015)
Johansson, M., Skantze, G., Gustafson, J.: Head pose patterns in multiparty human-robot team-building interactions. In: Proceedings of ICSR, Bristol, UK, pp. 351–360 (2013)
Lee, M.K., Kiesler, S., Forlizzi, J.: Receptionist or information kiosk: how do people talk with a robot? In: Proceedings of ACM Conference on Computer Supported Cooperative Work, pp. 31–40 (2010)
Schuller, B., et al.: The INTERSPEECH 2017 computational paralinguistics challenge: addressee, cold & snoring. In: Proceedings of Interspeech, Stockholm, Sweden (2017)
Ouchi, H., Tsuboi, Y.: Addressee and response selection for multi-party conversation. In: Proceedings of EMNLP, Austin, Texas, pp. 2133–2143 (2016)
Schuller, B., et al.: The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism. In: Proceedings of Interspeech, Lyon, France (2013)
Ben-Hur, A., Weston, J.: A User’s Guide to Support Vector Machines. Data Mining Techniques for the Life Sciences, pp. 223–239. Humana Press (2010)
spaCy library. https://github.com/explosion/spaCy
Sergienko, R., Shan, M., Minker, W.: A comparative study of text preprocessing approaches for topic detection of user utterances. In: Proceedings of LREC, Portorož, Slovenia, pp. 1826–1831 (2016)
Zhou, Y., Li, Y., Xia, S.: An improved KNN text classification algorithm based on clustering. J. Comput. 4(3), 230–237 (2009)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. No. CMU-CS, pp. 96–118, Carnegie-Mellon University, Pittsburgh, PA, Department of Computer Science (1996)
Klecka, W.: Discriminant Analysis, 9th edn. Sage Publications Inc., Beverly Hills (1988)
Pennington, J., Socher, R., Manning, C.: GloVe: Global vectors for word representation. In: Proceedings of EMNLP, Doha, Qatar, vol. 14, pp. 1532–1543 (2014)
Keras library. https://github.com/fchollet/keras
Maglio, P.P., Matlock, T., Campbell, C.S., Zhai, S., Smith, B.A.: Gaze and speech in attentive user interfaces. In: Tan, T., Shi, Y., Gao, W. (eds.) ICMI 2000. LNCS, vol. 1948, pp. 1–7. Springer, Heidelberg (2000). doi:10.1007/3-540-40063-X_1
Acknowledgements
This research is partially supported by DAAD together with the Ministry of Education and Science of the Russian Federation within Michail Lomonosov Program (project No. 8.704.2016/DAAD), by RFBR (project No. 16-37-60100), by the Government of the Russian Federation (grant No. 074-U01), and by the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” which is funded by the German Research Foundation (DFG).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Akhtiamov, O., Ubskii, D., Feldina, E., Pugachev, A., Karpov, A., Minker, W. (2017). Are You Addressing Me? Multimodal Addressee Detection in Human-Human-Computer Conversations. In: Karpov, A., Potapova, R., Mporas, I. (eds) Speech and Computer. SPECOM 2017. Lecture Notes in Computer Science(), vol 10458. Springer, Cham. https://doi.org/10.1007/978-3-319-66429-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-66429-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66428-6
Online ISBN: 978-3-319-66429-3
eBook Packages: Computer ScienceComputer Science (R0)