Advertisement

Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor

  • Robin Ruede
  • Markus MüllerEmail author
  • Sebastian Stüker
  • Alex Waibel
Chapter
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 510)

Abstract

Using supporting backchannel (BC) cues can make human-computer interaction more social. BCs provide a feedback from the listener to the speaker indicating to the speaker that he is still listened to. BCs can be expressed in different ways, depending on the modality of the interaction, for example as gestures or acoustic cues. In this work, we only considered acoustic cues. We are proposing an approach towards detecting BC opportunities based on acoustic input features like power and pitch. While other works in the field rely on the use of a hand-written rule set or specialized features, we made use of artificial neural networks. They are capable of deriving higher order features from input features themselves. In our setup, we first used a fully connected feed-forward network to establish an updated baseline in comparison to our previously proposed setup. We also extended this setup by the use of Long Short-Term Memory (LSTM) networks which have shown to outperform feed-forward based setups on various tasks. Our best system achieved an F1-Score of 0.37 using power and pitch features. Adding linguistic information using word2vec, the score increased to 0.39.

Keywords

Backchannels Building rapport Artificial intelligence Speech recognition 

Notes

Acknowledgements

This work has been conducted in the SecondHands project which has received funding from the European Unions Horizon 2020 Research and Innovation programme (call:H2020- ICT-2014-1, RIA) under grant agreement No 643950.

References

  1. 1.
    Dieleman S, Schlter J, Raffel C, Olson E, Sønderby SK et al (2015) Lasagne: first release.  https://doi.org/10.5281/zenodo.27878
  2. 2.
    Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. Aistats 9:249–256Google Scholar
  3. 3.
    Godfrey J, Holliman E (1993) Switchboard-1 release 2. https://catalog.ldc.upenn.edu/ldc97s62
  4. 4.
    Harkins D et al (2003) ISIP switchboard word alignments. https://www.isip.piconepress.com/projects/switchboard/
  5. 5.
    Huang L, Morency LP, Gratch J (2010) Learning backchannel prediction model from parasocial consensus sampling: a subjective evaluation. In: International conference on intelligent virtual agents. Springer, pp 159–172Google Scholar
  6. 6.
    Jurafsky D, Van Ess-Dykema C et al (1997) Switchboard discourse language modeling projectGoogle Scholar
  7. 7.
    Kawahara T, Uesato M, Yoshino K, Takanashi K (2015) Toward adaptive generation of backchannels for attentive listening agents. In: International workshop serien on spoken dialogue systems technology, pp 1–10Google Scholar
  8. 8.
    Kawahara T, Yamaguchi T, Inoue K, Takanashi K, Ward N (2016) Prediction and generation of backchannel form for attentive listening systems. In: Proceedings of the INTERSPEECH, vol 2016Google Scholar
  9. 9.
    Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
  10. 10.
    de Kok I, Heylen D (2012) A survey on evaluation metrics for backchannel prediction models. In: Proceedings of the interdisciplinary workshop on feedback behaviors in dialogGoogle Scholar
  11. 11.
    Kok ID, Heylen D (2012) A survey on evaluation metrics for backchannel prediction models. In: Feedback behaviors in dialogGoogle Scholar
  12. 12.
    Laskowski K, Heldner M, Edlund J (2008) The fundamental frequency variation spectrum. Proc Fon 2008:29–32Google Scholar
  13. 13.
    Levin L, Lavie A, Woszczyna M, Gates D, Gavaldá M, Koll D, Waibel A (2000) The janus-iii translation system: speech-to-speech translation in multiple domains. Mach Trans 15(1):3–25.  https://doi.org/10.1023/A:1011186420821
  14. 14.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
  15. 15.
    Müller M, Leuschner D, Briem L, Schmidt M, Kilgour K, Stüker S, Waibel A (2015) Using neural networks for data-driven backchannel prediction: a survey on input features and training techniques. In: International conference on human-computer interaction. Springer, pp 329–340Google Scholar
  16. 16.
    Mockus J (1974) On bayesian methods for seeking the extremum. In: Proceedings of the IFIP technical conference. Springer, London, pp 400–404. http://dl.acm.org/citation.cfm?id=646296.687872
  17. 17.
    Morency LP, de Kok I, Gratch J (2010) A probabilistic multimodal approach for predicting listener backchannels. Auton Agent Multi-Agent Syst 20(1):70–84.  https://doi.org/10.1007/s10458-009-9092-y
  18. 18.
    Niehues J, Nguyen TS, Cho E, Ha TL, Kilgour K, Müller M, Sperber M, Stüker S, Waibel A (2016) Dynamic transcription for low-latency speech translation. Interspeech 2016:2513–2517CrossRefGoogle Scholar
  19. 19.
    Ries K (1999) HMM and neural network based speech act detection. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, 1999, vol 1. IEEE Computer Society, pp 497–500Google Scholar
  20. 20.
    Schroder M, Bevacqua E, Cowie R, Eyben F, Gunes H, Heylen D, Ter Maat M, McKeown G, Pammi S, Pantic M et al (2012) Building autonomous sensitive artificial listeners. IEEE Trans Affect Comput 3(2):165–183CrossRefGoogle Scholar
  21. 21.
    Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958MathSciNetzbMATHGoogle Scholar
  22. 22.
    Stolcke A, Ries K, Coccaro N, Shriberg E, Bates R, Jurafsky D, Taylor P, Martin R, Van Ess-Dykema C, Meteer M (2000) Dialogue act modeling for automatic tagging and recognition of conversational speech. Comput Linguist 26(3):339–373CrossRefGoogle Scholar
  23. 23.
    Stolcke A, et al (1998) Dialog act modeling for conversational speech. In: AAAI spring symposium on applying machine learning to discourse processing, pp 98–105Google Scholar
  24. 24.
    Theano Development Team: Theano: a python framework for fast computation of mathematical expressions (2016). arXiv e-prints http://arxiv.org/abs/1605.02688
  25. 25.
    Truong KP, Poppe RW, Heylen DKJ (2010) A rule-based backchannel prediction model using pitch and pause information. In: Proceedings of the interspeech 2010, Makuhari, Chiba, Japan. International Speech Communication Association (ISCA), pp 3058–3061Google Scholar
  26. 26.
    Waibel A, Hanazawa T, Hinton G, Shikano K, Lang KJ (1989) Phoneme recognition using time-delay neural networks. IEEE Trans Acoust Speech Signal Process 37(3):328–339CrossRefGoogle Scholar
  27. 27.
    Ward N, Tsukahara W (2000) Prosodic features which cue back-channel responses in English and Japanese. J Pragmat 32(8):1177–1207CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2019

Authors and Affiliations

  • Robin Ruede
    • 1
  • Markus Müller
    • 1
    Email author
  • Sebastian Stüker
    • 1
  • Alex Waibel
    • 1
    • 2
  1. 1.Karlsruhe Institute of TechnologyKarlsruheGermany
  2. 2.Carnegie Mellon UniversityPAUSA

Personalised recommendations