Advertisement

Recognising Conversational Speech: What an Incremental ASR Should Do for a Dialogue System and How to Get There

  • Timo BaumannEmail author
  • Casey Kennington
  • Julian Hough
  • David Schlangen
Chapter
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 427)

Abstract

Automatic speech recognition (asr) is not only becoming increasingly accurate, but also increasingly adapted for producing timely, incremental output. However, overall accuracy and timeliness alone are insufficient when it comes to interactive dialogue systems which require stability in the output and responsivity to the utterance as it is unfolding. Furthermore, for a dialogue system to deal with phenomena such as disfluencies, to achieve deep understanding of user utterances these should be preserved or marked up for use by downstream components, such as language understanding, rather than be filtered out. Similarly, word timing can be informative for analyzing deictic expressions in a situated environment and should be available for analysis. Here we investigate the overall accuracy and incremental performance of three widely used systems and discuss their suitability for the aforementioned perspectives. From the differing performance along these measures we provide a picture of the requirements for incremental asr in dialogue systems and describe freely available tools for using and evaluating incremental asr.

Keywords

Incremental ASR Conversational speech System requirements Evaluation 

Notes

Acknowledgements

This work is supported by a Daimler and Benz Foundation PostDoc Grant to the first author, by the BMBF KogniHome project, DFG DUEL project (grant SCHL 845/5-1) and the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ (EXC 277) at Bielefeld University.

References

  1. 1.
    Schlangen, D., Skantze, G.: A general, abstract model of incremental dialogue processing. Dial. Discourse 2(1), 83–111 (2011)CrossRefGoogle Scholar
  2. 2.
    Aist, G., Allen, J., Campana, E., Galescu, L., Gallo, C.A.G., Stoness, S., Swift, M., Tanenhaus, M.: Software architectures for incremental understanding of human speech. In: Proceedings of Interspeech, pp. 1922–1925 (2006)Google Scholar
  3. 3.
    Skantze, G., Schlangen, D.: Incremental dialogue processing in a micro-domain. In: Proceedings of EACL, pp. 745–753 (2009)Google Scholar
  4. 4.
    Skantze, G., Hjalmarsson, A.: Towards incremental speech generation in dialogue systems. In: Proceedings of SIGdial (2010)Google Scholar
  5. 5.
    Asri, L.E., Laroche, R., Pietquin, O., Khouzaimi, H.: NASTIA: negotiating appointment setting interface. In: Proceedings of LREC, pp. 266–271 (2014)Google Scholar
  6. 6.
    Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A.: Towards human-like spoken dialogue systems. Speech Commun. 50(8–9), 630–645 (2008)CrossRefGoogle Scholar
  7. 7.
    Aist, G., Allen, J., Campana, E., Gallo, C.G., Stoness, S., Swift, M.: Incremental understanding in human-computer dialogue and experimental evidence for advantages over nonincremental methods. In: Proceedings of SemDial, pp. 149–154 (2007)Google Scholar
  8. 8.
    Baumann, T., Atterer, M., Schlangen, D.: Assessing and improving the performance of speech recognition for incremental systems. In: Proceedings of NAACL-HTL 2009, pp. 380–388. ACL (2009)Google Scholar
  9. 9.
    Selfridge, E.O., Arizmendi, I., Heeman, P.A., Williams, J.D.: Stability and accuracy in incremental speech recognition. In: Proceedings of SigDial, pp. 110–119. ACL (2011)Google Scholar
  10. 10.
    McGraw, I., Gruenstein, A.: Estimating word-stability during incremental speech recognition. In: Proceedings of Interspeech (2012)Google Scholar
  11. 11.
    Baumann, T., Schlangen, D.: The inproTK 2012 release. In: Proceedings of SDCTD. ACL (2012)Google Scholar
  12. 12.
    Shriberg, E.: Disfluencies in switchboard. In: Proceedings of ICSLP (1996)Google Scholar
  13. 13.
    Fischer, K.: What computer talk is and isn’t: Human-computer conversation as intercultural communication. In: Linguistics—Computational Linguistics, vol. 17. AQ-Verlag (2006)Google Scholar
  14. 14.
    Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. ACL (1992)Google Scholar
  15. 15.
    Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R., Can, D., Georgiou, P., Narayanan, S., Leuski, A., Traum, D.: Which ASR should i choose for my dialogue system? In: Proceedings of SigDial, pp. 394–403 (2013)Google Scholar
  16. 16.
    Ginzburg, J., Tian, Y., Amsili, P., Beyssade, C., Hemforth, B., Mathieu, Y., Saillard, C., Hough, J., Kousidis, S., Schlangen, D.: The disfluency, exclamation and laughter in dialogue (DUEL) project. In: Proceedings of SemDial, pp. 176–178 (2014)Google Scholar
  17. 17.
    Meteer, M., Taylor, A., MacIntyre, R., Iyer, R.: Disfluency annotation stylebook for the switchboard corpus. ms. Technical report, Department of Computer and Information Science, University of Pennsylvania (1995)Google Scholar
  18. 18.
    Brennan, S., Schober, M.: How listeners compensate for disfluencies in spontaneous speech. J. Memory Lang. 44(2), 274–296 (2001)CrossRefGoogle Scholar
  19. 19.
    Hough, J., Purver, M.: Strongly incremental repair detection. In: Proceedings of EMNLP, pp. 78–89. ACL (2014)Google Scholar
  20. 20.
    Clark, H.H.: Using Language. Cambridge University Press (1996)Google Scholar
  21. 21.
    Core, M.G., Schubert, L.K.: A syntactic framework for speech repairs and other disruptions. In: Proceedings of ACL, pp. 413–420 (1999)Google Scholar
  22. 22.
    Clark, H.H., Fox Tree, J.E.: Using uh and um in spontaneous speaking. Cognition 84(1), 73–111 (2002)CrossRefGoogle Scholar
  23. 23.
    Ginzburg, J., Fernández, R., Schlangen, D.: Disfluencies as intra-utterance dialogue moves. Seman. Pragmat. 7(9), 1–64 (2014)Google Scholar
  24. 24.
    von der Malsburg, T., Baumann, T., Schlangen, D.: TELIDA: a package for manipulation and visualisation of timed linguistic data. In: Proceedings of SigDial (2009)Google Scholar
  25. 25.
    Baumann, T., Buß, O., Schlangen, D.: Evaluation and optimisation of incremental processors. Dial. Discourse 2(1), 113–141 (2011)CrossRefGoogle Scholar
  26. 26.
    Fernández, R., Lucht, T., Schlangen, D.: Referring under restricted interactivity conditions. In: Proceedings of SIGdial, pp. 136–139. ACL (2007)Google Scholar
  27. 27.
    Kousidis, S., Kennington, C., Schlangen, D.: Investigating speaker gaze and pointing behaviour in human-computer interaction with the mint.tools collection. In: Proceedings of SIGdial (2013)Google Scholar
  28. 28.
    Kennington, C., Schlangen, D.: Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In: Proceedings of ACL (2015)Google Scholar
  29. 29.
    Walker, W., Lamere, P., Kwok, P., Raj, B., Singh, R., Gouvea, E., Wolf, P., Woelfel, J.: Sphinx-4: a flexible open source framework for speech recognition. Technical Report SMLI TR2004-0811, Sun Microsystems Inc. (2004)Google Scholar
  30. 30.
    Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., Strope, B.: Your word is my command: Google search by voice: a case study. In: Advances in Speech Recognition, pp. 61–90. Springer (2010)Google Scholar
  31. 31.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)Google Scholar
  32. 32.
    Plátek, O., Jurčíček, F.: Free on-line speech recogniser based on Kaldi ASR toolkit producing word posterior lattices. In: Proceedings of SIGdial, pp. 108–112. ACL (2014)Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2017

Authors and Affiliations

  • Timo Baumann
    • 1
    Email author
  • Casey Kennington
    • 2
  • Julian Hough
    • 2
  • David Schlangen
    • 2
  1. 1.Natural Language Systems Group, Informatics DepartmentUniversität HamburgHamburgGermany
  2. 2.Dialogue Systems Group, Faculty of Linguistics and Literature and CITECBielefeld UniversityBielefeldGermany

Personalised recommendations