Advertisement

Deep Learning for Acoustic Addressee Detection in Spoken Dialogue Systems

  • Aleksei Pugachev
  • Oleg Akhtiamov
  • Alexey Karpov
  • Wolfgang Minker
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 789)

Abstract

The addressee detection problem arises in real spoken dialogue systems (SDSs) which are supposed to distinguish the speech addressed to them from the speech addressed to real humans. In this work, several modalities were analyzed, and acoustic data has been chosen as the main modality by reason of the most flexible usability in modern SDSs. To resolve the problem of addressee detection, deep learning methods such as fully-connected neural networks and Long Short-Term Memory were applied in the present study. The developed models were improved by using different optimization methods, activation functions and a learning rate optimization method. Also the models were optimized by using a recursive feature elimination method and multiple initialization to increase the training speed. A fully-connected neural network reaches an average recall of 0.78, a Long Short-Term Memory neural network shows an average recall of 0.65. Advantages and disadvantages of both architectures are provided for the particular task.

Keywords

Off-talk Multiparty conversation LSTM Fully-connected neural network Speech processing Speaking style 

Notes

Acknowledgments

This work is partially supported by the grant of the President of Russia (No. MD-254.2017.8) and by the RFBR (project No. 16-37-60100).

References

  1. 1.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  2. 2.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  3. 3.
    Lee, H., Stolcke, A., Shriberg, E.: Using out-of-domain data for lexical addressee detection in human-human-computer dialog. In: Proceedings of NAACL, pp. 221–229 (2013)Google Scholar
  4. 4.
    Shriberg, E., Stolcke, A., Ravuri, S.: Addressee detection for dialog systems using temporal and spectral dimensions of speaking style. In: Proceedings of Interspeech (2013)Google Scholar
  5. 5.
    Tsai, T.J., Stolcke, A., Slaney, M.: Multimodel addressee detection in multiparty dialogue systems. In: Proceedings of ICASSP, pp. 2314–2318, April 2015Google Scholar
  6. 6.
    SmartWeb Handled Corpus. http://catalog.elra.info/product_info.php?products_id=1068. Accessed 27 Apr 2017
  7. 7.
    Batliner, A., Hacker, C., Nöth, E.: J Multimodal User Interfaces 2, 171 (2008).  https://doi.org/10.1007/s12193-009-0016-6CrossRefGoogle Scholar
  8. 8.
  9. 9.
    INTERSPEECH 2013 Computational Paralinguistics Challenge. http://emotionresearch.net/sigs/speech-sig/is13-compare. Accessed 21 June 2017
  10. 10.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995).  https://doi.org/10.1007/BF00994018CrossRefMATHGoogle Scholar
  11. 11.
    Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). In: ICLR (2016)Google Scholar
  12. 12.
    Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: International Conference on Computational Statistics, pp. 177–187 (2010)CrossRefGoogle Scholar
  13. 13.
    Zeng, X., Chen, Y.W.: Feature selection using recursive feature elimination for handwritten digit recognition. In: Proceedings of Fifth International Conference on Intelligent Information Hiding and Multimedia Signal Processing, pp. 1205–1208 (2009)Google Scholar
  14. 14.
    Ray, A., Rajeswar, S., Chaudhury, S.: Text recognition using deep blstm network. In: Proceedings of the International Conference on Advances of Pattern Recognition (2015)Google Scholar
  15. 15.
    Understanding LSTM Networks – colah’s blog. http://colah.github.io/posts/2015-08Understanding-LSTMs/. Accessed 21 June 2017
  16. 16.
    An overview of gradient descent optimization algorithms. http://sebastianruder.com/optimizing-gradient-descent/index.html#rmsprop. Accessed 21 June 2017
  17. 17.
    Cho, K.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv:1406.1078 (2014)
  18. 18.
    Weiss, K., Khoshgoftaar, T.M., Wang, D.D.: A survey of transfer learning. J. Big Data 3(1), 1–40 (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Aleksei Pugachev
    • 1
    • 2
  • Oleg Akhtiamov
    • 1
    • 3
  • Alexey Karpov
    • 1
    • 2
  • Wolfgang Minker
    • 3
  1. 1.ITMO UniversitySaint-PetersburgRussia
  2. 2.SPIIRAS InstituteSaint-PetersburgRussia
  3. 3.Ulm UniversityUlmGermany

Personalised recommendations