Skip to main content

Metrics and Evaluation of Spoken Dialogue Systems

  • Chapter
  • First Online:
Data-Driven Methods for Adaptive Spoken Dialogue Systems

Abstract

The ultimate goal of an evaluation framework is to determine a dialogue system’s performance, which can be defined as “the ability of a system to provide the function it has been designed for” [32]. Also important, particularly for industrial systems, is dialogue quality or usability. To measure usability, one can use subjective measures such as User Satisfaction or likelihood of future use. These subjective metrics are difficult to measure and are dependent on the context and the individual user, whose goal and values may differ from other users. This chapter will survey evaluation frameworks and discuss their advantages and disadvantages. We will examine metrics for evaluating system performance and dialogue quality. We will also discuss evaluation techniques that can be used to automatically detect problems in the dialogue, thus filtering out good dialogues and leaving poor dialogues for further evaluation and investigation [62].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.classic-project.org

  2. 2.

    http://www.parlance-project.eu

References

  1. Ai, H., Litman, D.: Assessing dialog system user simulation evaluation measures using human judges. In: Proceedings of ACL, Columbus, Ohio (USA), pp. 622–629 (2008)

    Google Scholar 

  2. Araki, M., Doshita, S.: Automatic evaluation environment for spoken dialogue systems. In: ECAI Workshop on Dialogue Processing in Spoken Language Systems’96, pp. 183–194 (1996)

    Google Scholar 

  3. Balentine, B., Morgan, D.P.: How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues. Enterprise Integration Group (2002)

    Google Scholar 

  4. Black, A.W., Burger, S., Conkie, A., Hastie, H., Keizer, S., Lemon, O., Merigaud, N., Gabriel Parent, G., Schubiner, G., Thomson, B., Williams, J.D., Yu, K., Young, S., Eskenazi, M.: Spoken dialog challenge 2010: Comparison of live and control test results. In: Proceedings of the SIGdial (2011)

    Google Scholar 

  5. Bonneau-Maynard, H., Devillers, L., Rosse, S.: Predictive performance of dialog systems. In: Proceedings of the Language Resources and Evaluation Conference (LREC) (2000)

    Google Scholar 

  6. Cohen, M.H., Giangola, J.P., Balogh, J.: Voice User Interface Design. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA (2004)

    Google Scholar 

  7. Cuayhuitl, H., Renals, S., Lemon, O., Shimodaira, H.: Human-computer dialogue simulation using hidden markov models. In: Proceedings of ASRU, pp. 290–295 (2005)

    Google Scholar 

  8. Danieli, M., Gerbino, E., Metrics for evaluating dialogue strategies in a spoken language system. CoRR (1996)

    Google Scholar 

  9. Devillers, L., Bonneau-maynard, H.: Evaluation of dialog strategies for a tourist information retrieval system. In: Proceedings of ICSLP, pp. 1187–1190 (1998)

    Google Scholar 

  10. Eckert, W., Levin, E., Pieraccini, R.: User modelling for spoken dialogue system evaluation. In: Proceedings of ASRU, pp. 80–87 (1997)

    Google Scholar 

  11. Engelbrecht, K.P.: Gödde, F., Hartard, F., Ketabdar, H., Möller, S., Modeling user satisfaction with hidden markov model. In: Proceedings of SIGdial (2009)

    Google Scholar 

  12. Engelbrecht, K.P., Quade, M., Möller, S.: Analysis of a new simulation approach to dialog system evaluation. Speech Commun. 51, 1234–1252 (2009)

    Article  Google Scholar 

  13. Frostad, K.: Best practices in designing speech interfaces. (2004) http://msdn.microsoft.com/en-us/library/ms994646.aspx

  14. Georgila, K., Henderson, J., Lemon, O.: User Simulation for Spoken Dialogue Systems: Learning and Evaluation. In: Proceedings of Interspeech (2006)

    Google Scholar 

  15. Gorin, A.L., Riccardi, G., Wright, J.H.: How may I help you? Speech Commun. 23, 113–127 (1997)

    Article  Google Scholar 

  16. Grice, H.P.: Logic and conversation. Syntax Semant. Vol 3. Speech Acts, 3 41–58 (1975)

    Google Scholar 

  17. Hartikainen, M., Salonen, E.P., Markku Turunen, M.: Subjective evaluation of spoken dialogue systems using SERVQUAL method. In: Proceedings of Interspeech (2004)

    Google Scholar 

  18. Henderson, J., Lemon, O., Georgila, K.: Hybrid reinforcement/supervised learning for dialogue policies from communicator data. In: Proceedings of the IJCAI workshop on Knowledge and Reasoning in Practical Dialogue Systems (2005)

    Google Scholar 

  19. Hirschman, L., Pao, C.: The cost of errors in a spoken language system. In: Proceedings of Eurospeech’93 (1993)

    Google Scholar 

  20. Hone, K.S. Graham, R.: Towards a tool for the subjective assessment of speech system interfaces (SASSI). Nat. Lang. Eng. 6, 303–387 (2000)

    Article  Google Scholar 

  21. ITU-T Supplement 24. Parameters describing the interaction with spoken dialogue systems. Technical report, Internationals Telecommuncation Union (2005)

    Google Scholar 

  22. ITU-T Rec. P851. 2003. Subjective quality evaluation of telephone services based on spoken dialogue systems. Technical report, Internationals Telecommuncation Union (2003)

    Google Scholar 

  23. Janarthanam, S., Lemon, O.: Learning to adapt to unknown users: referring expression generation in spoken dialogue systems. In: Proceedings of ACL ’10 (2010)

    Google Scholar 

  24. Janarthanam, S., Lemon, O.: A Two-tier User Simulation Model for Reinforcement Learning of Adaptive Referring Expression Generation Policies. In: Proceedings of SIGdial (2009)

    Google Scholar 

  25. Kamm, C.: User Interfaces for voice applications, pp. 422–442. National Academy Press, Washington, DC, USA (1994)

    Google Scholar 

  26. Keeney, R.L., Raiffa, H.: Decisions with multiple objectives: Preferences and value tradeoffs. John Wiley and Sons, New York (1976)

    Google Scholar 

  27. Lamel, L., Rosset, S., Gauvain, J.L.: Considerations in the design and evaluation of spoken language dialog systems. In: Proceedings of ICSLP (2000)

    Google Scholar 

  28. Levin, E., Pieraccini, R., Eckert, W.: A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. Speech. Audio. Process. 8(1), 11–23 (2000)

    Article  Google Scholar 

  29. Lin, B.S., Lee, L.S.: Computer-aided analysis and design for spoken dialogue systems based on quantitative simulations. IEEE Trans. Speech. Audio. Process. 9(5), 534–548 (2001)

    Article  Google Scholar 

  30. López-Cózar, R., Callejas, Z., McTear, M.F.: Testing the performance of spoken dialogue systems by means of an artificially simulated user. Artif. Intell. Rev. 26(4), 291–323 (2006)

    Article  Google Scholar 

  31. Möller, S., Englert, R., Engelbrecht, K., Hafner, V., Anthony Jameson, A., Antti Oulasvirta, A., Raake, E.R., Reithinger, N.: Memo: Towards automatic usability evaluation of spoken dialogue services by user error simulations (2006)

    Google Scholar 

  32. Möller, S.: Quality of Telephone-Based Spoken Dialogue Systems. Springer (2005)

    Google Scholar 

  33. Möller, S., Ward, N.G.: A framework for model-based evaluation of spoken dialog systems. In: Proceedings of SIGdial (2008)

    Google Scholar 

  34. Paek, T., Empirical methods for evaluating dialog systems. In: Proceedings of the Second SIGdial Workshop on Discourse and Dialogue - Volume 16. Association for Computational Linguistics. (2001)

    Google Scholar 

  35. Paek, T.: Toward evaluation that leads to best practices: reconciling dialog evaluation in research and industry. In: Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, pp. 40–47, Association for Computational Linguistics (2007)

    Google Scholar 

  36. Pieraccini, R., Huerta, J.: Where do we go from here? research and commercial spoken dialog systems. In: Proceedings of 6th SIGdial Workshop on Discourse and Dialog, (2005)

    Google Scholar 

  37. Pietquin, O.: A framework for unsupervised learning of dialogue strategies. Presses univ. de Louvain (2004)

    Google Scholar 

  38. Pietquin, O., Hastie, H.: A survey on metrics for the evaluation of user simulations. Knowledge Engineering Review, 2013. Accepted for Publication.

    Google Scholar 

  39. Putois, G., Young, S., Henderson, J., Lemon, O., Rieser, V., Liu, X., Bretier, P., Laroche, R.: Initial communication architecture and module interface definitions. Technical report, Classic Deliverable D5.1.1 (2008)

    Google Scholar 

  40. Rahim, M., Fabbrizio, G.D., Kamm, C., Walker, M., Pokrovsky, A., Ruscitti, P., Levin, E., Lee, S., Syrdal, A., Schlosser, K.: Voice-if: A mixed-initiative spoken dialogue system for. In: Proceedings of Eurospeech (2001)

    Google Scholar 

  41. Rieser, V., Lemon, O.: Simulations for learning dialogue strategies. In: Proceedings of Interspeech, Pittsburg (USA) (2006)

    Google Scholar 

  42. Rieser, V., Lemon, O.: Reinforcement Learning for Adaptive Dialogue Systems: A Data-driven Methodology for Dialogue Management and Natural Language Generation. Spinger (2011)

    Google Scholar 

  43. Rieser, V., Lemon, O.: Automatic learning and evaluation of user-centered objective functions for dialogue system optimisation. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC) (2008)

    Google Scholar 

  44. Rieser, V., Lemon, O.: Learning effective multimodal dialogue strategies from wizard-of-oz data: bootstrapping and evaluation (2008)

    Google Scholar 

  45. Schatzmann, J., Georgila, K., Young, S.: Quantitative evaluation of user simulation techniques for spoken dialogue systems. In: Proceedings of SIGdial’05 (2005)

    Google Scholar 

  46. Scheffler, T., Roller, R., Reithinger, N.: SpeechEval – evaluating spoken dialog systems by user simulation. In: Proceedings of the 6th IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Pasadena, CA, USA, pp. 93–98 (2009)

    Google Scholar 

  47. Schmitt, A., Schatz, B., Minker, W.: Modeling and Predicting Quality in Spoken Human-Computer Interaction. In: Proceedings of SIGdial (2011)

    Google Scholar 

  48. Shriberg, E., Wade, E., Price, P.: Human-machine problem solving using spoken language systems (SLS): factors affecting performance and user satisfaction. In: HLT ’91: Proceedings of the workshop on Speech and Natural Language, pp. 49–54. Association for Computational Linguistics (1992)

    Google Scholar 

  49. Suendermann, D., Evanini, K., Liscombe, J., Hunter., P, Dayanidhi, K., Pieraccini, R., From Rule-Based to Statistical Grammars: Continuous Improvement of Large-Scale Spoken Dialog Systems, Proceedings of the 2009 IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP 2009), Taipei, Taiwan, April 19–24 (2009)

    Google Scholar 

  50. Suendermann, D., Liscombe, J., Pieraccini, R.: Contender. In: Proceedings of the SLT 2010 IEEE Workshop on Spoken Language Technology (2010)

    Google Scholar 

  51. Suendermann, D., Liscombe, J., Krishna Dayanidhi, K., Roberto Pieraccini, R.: A handsome set of metrics to measure utterance classification performance in spoken dialog systems. In: Proceedings of SIGdial pp. 349–356 (2009)

    Google Scholar 

  52. Walker, M.A., Langkilde-Geary, I., Wright-Hastie, H., Wright, J., Gorin, A.: Automatically training a problematic dialogue predictor for a spoken dialogue system. J. Artif. Intell. Res. 16, 293–319 (2002)

    MATH  Google Scholar 

  53. Walker, M., Rudnicky, A., Aberdeen, J., Owen Bratt, E., Garofolo, J., Hastie, H., Le, A., Pellom, B., Potamianos, A., Passonneau, R., Prasad, R., Roukos, S., Greg, S., Stallard, S.D.: Darpa communicator evaluation: Progress from 2000 to 2001. In: Proceedings of ICSLP 02, pp. 273–276 (2002)

    Google Scholar 

  54. Walker, M.A., Passonneau, R., Boland. J.E.: Quantitative and qualitative evaluation of DARPA communicator spoken dialogue systems. In: Proceedings of ACL (2001)

    Google Scholar 

  55. Walker, M.A., Aberdeen, J., Boland, J., Bratt, E., Garofolo, J., Hirschman, L., Le, A., Lee, S., Narayanan, S., Papineni, K., Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S.: Darpa communicator dialog travel planning systems: The june 2000 data collection. In: Proceedings of Eurospeech (2001)

    Google Scholar 

  56. Walker, M.A., Rudnicky, A., Aberdeen, J., Bratt, E., Garofolo, J., Hastie, H., Le, A., Pellom, B., Potamianos, A., Passonneau, R., Prasad, R., Roukos, S., Sanders, G., Seneff, S., Stallard, D.: Darpa communicator: Cross-system results for the 2001 evaluation. In: Proceedings of ICSLP (2002)

    Google Scholar 

  57. Walker, M.A., Kamm, C.A., Litman, D.J.: Towards Developing General Models of Usability with PARADISE. Nat. Lang. Eng., 6(3), 363–377 (2000)

    Article  Google Scholar 

  58. Walker, M., Passoneau, R.: DATE: A dialogue act tagging scheme for evaluation. In: Proceedings of the Human Language Technology Conference (HLT) (2001)

    Google Scholar 

  59. Walker, M.A.: An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system for email. J. Artif. Intell. Res. 12, 387–416 (2000)

    MATH  Google Scholar 

  60. Walker, M.A.: Can we talk? methods for evaluation and training of spoken dialogue systems. Lang. Resour. Evaluation 39(1), 65–75 (2005)

    Article  Google Scholar 

  61. Walker, X., Boland, J., Kamm, C.: The utility of elapsed time as a usability metric for spoken dialogue systems. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU99) (1999)

    Google Scholar 

  62. Wright-Hastie, H., Prasad, R., Walker, M.: What’s the trouble: Automatically identifying problematic dialogues in. In: Proceedings of ACL, pp. 384–391 (2002)

    Google Scholar 

  63. Young, S., Gasic, M., Keizer, S., Mairesse, F., Schatzmann, J., Thomson, B., Yu, J.: The hidden information state model: a practical framework for POMDP-based spoken dialogue management. Computer Speech and Language 24(2), 150–174 (2010)

    Article  Google Scholar 

Download references

Acknowledgements

I would like to acknowledge Olivier Pietquin and Oliver Lemon for their guidance in writing this chapter. The research leading to this work has received funding from the EC’s FP7 programmes: (FP7/2007–13) under grant agreement no. 216594 (CLASSiC); (FP7/2011–14) under grant agreement no. 248765 (Help4Mood); (FP7/2011–14) under grant agreement no. 287615 (PARLANCE).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Helen Hastie .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer Science+Business Media New York

About this chapter

Cite this chapter

Hastie, H. (2012). Metrics and Evaluation of Spoken Dialogue Systems. In: Lemon, O., Pietquin, O. (eds) Data-Driven Methods for Adaptive Spoken Dialogue Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4803-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-4803-7_7

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-4802-0

  • Online ISBN: 978-1-4614-4803-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics