Abstract
The ultimate goal of an evaluation framework is to determine a dialogue system’s performance, which can be defined as “the ability of a system to provide the function it has been designed for” [32]. Also important, particularly for industrial systems, is dialogue quality or usability. To measure usability, one can use subjective measures such as User Satisfaction or likelihood of future use. These subjective metrics are difficult to measure and are dependent on the context and the individual user, whose goal and values may differ from other users. This chapter will survey evaluation frameworks and discuss their advantages and disadvantages. We will examine metrics for evaluating system performance and dialogue quality. We will also discuss evaluation techniques that can be used to automatically detect problems in the dialogue, thus filtering out good dialogues and leaving poor dialogues for further evaluation and investigation [62].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ai, H., Litman, D.: Assessing dialog system user simulation evaluation measures using human judges. In: Proceedings of ACL, Columbus, Ohio (USA), pp. 622–629 (2008)
Araki, M., Doshita, S.: Automatic evaluation environment for spoken dialogue systems. In: ECAI Workshop on Dialogue Processing in Spoken Language Systems’96, pp. 183–194 (1996)
Balentine, B., Morgan, D.P.: How to Build a Speech Recognition Application: A Style Guide for Telephony Dialogues. Enterprise Integration Group (2002)
Black, A.W., Burger, S., Conkie, A., Hastie, H., Keizer, S., Lemon, O., Merigaud, N., Gabriel Parent, G., Schubiner, G., Thomson, B., Williams, J.D., Yu, K., Young, S., Eskenazi, M.: Spoken dialog challenge 2010: Comparison of live and control test results. In: Proceedings of the SIGdial (2011)
Bonneau-Maynard, H., Devillers, L., Rosse, S.: Predictive performance of dialog systems. In: Proceedings of the Language Resources and Evaluation Conference (LREC) (2000)
Cohen, M.H., Giangola, J.P., Balogh, J.: Voice User Interface Design. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA (2004)
Cuayhuitl, H., Renals, S., Lemon, O., Shimodaira, H.: Human-computer dialogue simulation using hidden markov models. In: Proceedings of ASRU, pp. 290–295 (2005)
Danieli, M., Gerbino, E., Metrics for evaluating dialogue strategies in a spoken language system. CoRR (1996)
Devillers, L., Bonneau-maynard, H.: Evaluation of dialog strategies for a tourist information retrieval system. In: Proceedings of ICSLP, pp. 1187–1190 (1998)
Eckert, W., Levin, E., Pieraccini, R.: User modelling for spoken dialogue system evaluation. In: Proceedings of ASRU, pp. 80–87 (1997)
Engelbrecht, K.P.: Gödde, F., Hartard, F., Ketabdar, H., Möller, S., Modeling user satisfaction with hidden markov model. In: Proceedings of SIGdial (2009)
Engelbrecht, K.P., Quade, M., Möller, S.: Analysis of a new simulation approach to dialog system evaluation. Speech Commun. 51, 1234–1252 (2009)
Frostad, K.: Best practices in designing speech interfaces. (2004) http://msdn.microsoft.com/en-us/library/ms994646.aspx
Georgila, K., Henderson, J., Lemon, O.: User Simulation for Spoken Dialogue Systems: Learning and Evaluation. In: Proceedings of Interspeech (2006)
Gorin, A.L., Riccardi, G., Wright, J.H.: How may I help you? Speech Commun. 23, 113–127 (1997)
Grice, H.P.: Logic and conversation. Syntax Semant. Vol 3. Speech Acts, 3 41–58 (1975)
Hartikainen, M., Salonen, E.P., Markku Turunen, M.: Subjective evaluation of spoken dialogue systems using SERVQUAL method. In: Proceedings of Interspeech (2004)
Henderson, J., Lemon, O., Georgila, K.: Hybrid reinforcement/supervised learning for dialogue policies from communicator data. In: Proceedings of the IJCAI workshop on Knowledge and Reasoning in Practical Dialogue Systems (2005)
Hirschman, L., Pao, C.: The cost of errors in a spoken language system. In: Proceedings of Eurospeech’93 (1993)
Hone, K.S. Graham, R.: Towards a tool for the subjective assessment of speech system interfaces (SASSI). Nat. Lang. Eng. 6, 303–387 (2000)
ITU-T Supplement 24. Parameters describing the interaction with spoken dialogue systems. Technical report, Internationals Telecommuncation Union (2005)
ITU-T Rec. P851. 2003. Subjective quality evaluation of telephone services based on spoken dialogue systems. Technical report, Internationals Telecommuncation Union (2003)
Janarthanam, S., Lemon, O.: Learning to adapt to unknown users: referring expression generation in spoken dialogue systems. In: Proceedings of ACL ’10 (2010)
Janarthanam, S., Lemon, O.: A Two-tier User Simulation Model for Reinforcement Learning of Adaptive Referring Expression Generation Policies. In: Proceedings of SIGdial (2009)
Kamm, C.: User Interfaces for voice applications, pp. 422–442. National Academy Press, Washington, DC, USA (1994)
Keeney, R.L., Raiffa, H.: Decisions with multiple objectives: Preferences and value tradeoffs. John Wiley and Sons, New York (1976)
Lamel, L., Rosset, S., Gauvain, J.L.: Considerations in the design and evaluation of spoken language dialog systems. In: Proceedings of ICSLP (2000)
Levin, E., Pieraccini, R., Eckert, W.: A stochastic model of human-machine interaction for learning dialog strategies. IEEE Trans. Speech. Audio. Process. 8(1), 11–23 (2000)
Lin, B.S., Lee, L.S.: Computer-aided analysis and design for spoken dialogue systems based on quantitative simulations. IEEE Trans. Speech. Audio. Process. 9(5), 534–548 (2001)
López-Cózar, R., Callejas, Z., McTear, M.F.: Testing the performance of spoken dialogue systems by means of an artificially simulated user. Artif. Intell. Rev. 26(4), 291–323 (2006)
Möller, S., Englert, R., Engelbrecht, K., Hafner, V., Anthony Jameson, A., Antti Oulasvirta, A., Raake, E.R., Reithinger, N.: Memo: Towards automatic usability evaluation of spoken dialogue services by user error simulations (2006)
Möller, S.: Quality of Telephone-Based Spoken Dialogue Systems. Springer (2005)
Möller, S., Ward, N.G.: A framework for model-based evaluation of spoken dialog systems. In: Proceedings of SIGdial (2008)
Paek, T., Empirical methods for evaluating dialog systems. In: Proceedings of the Second SIGdial Workshop on Discourse and Dialogue - Volume 16. Association for Computational Linguistics. (2001)
Paek, T.: Toward evaluation that leads to best practices: reconciling dialog evaluation in research and industry. In: Proceedings of the Workshop on Bridging the Gap: Academic and Industrial Research in Dialog Technologies, pp. 40–47, Association for Computational Linguistics (2007)
Pieraccini, R., Huerta, J.: Where do we go from here? research and commercial spoken dialog systems. In: Proceedings of 6th SIGdial Workshop on Discourse and Dialog, (2005)
Pietquin, O.: A framework for unsupervised learning of dialogue strategies. Presses univ. de Louvain (2004)
Pietquin, O., Hastie, H.: A survey on metrics for the evaluation of user simulations. Knowledge Engineering Review, 2013. Accepted for Publication.
Putois, G., Young, S., Henderson, J., Lemon, O., Rieser, V., Liu, X., Bretier, P., Laroche, R.: Initial communication architecture and module interface definitions. Technical report, Classic Deliverable D5.1.1 (2008)
Rahim, M., Fabbrizio, G.D., Kamm, C., Walker, M., Pokrovsky, A., Ruscitti, P., Levin, E., Lee, S., Syrdal, A., Schlosser, K.: Voice-if: A mixed-initiative spoken dialogue system for. In: Proceedings of Eurospeech (2001)
Rieser, V., Lemon, O.: Simulations for learning dialogue strategies. In: Proceedings of Interspeech, Pittsburg (USA) (2006)
Rieser, V., Lemon, O.: Reinforcement Learning for Adaptive Dialogue Systems: A Data-driven Methodology for Dialogue Management and Natural Language Generation. Spinger (2011)
Rieser, V., Lemon, O.: Automatic learning and evaluation of user-centered objective functions for dialogue system optimisation. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC) (2008)
Rieser, V., Lemon, O.: Learning effective multimodal dialogue strategies from wizard-of-oz data: bootstrapping and evaluation (2008)
Schatzmann, J., Georgila, K., Young, S.: Quantitative evaluation of user simulation techniques for spoken dialogue systems. In: Proceedings of SIGdial’05 (2005)
Scheffler, T., Roller, R., Reithinger, N.: SpeechEval – evaluating spoken dialog systems by user simulation. In: Proceedings of the 6th IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Pasadena, CA, USA, pp. 93–98 (2009)
Schmitt, A., Schatz, B., Minker, W.: Modeling and Predicting Quality in Spoken Human-Computer Interaction. In: Proceedings of SIGdial (2011)
Shriberg, E., Wade, E., Price, P.: Human-machine problem solving using spoken language systems (SLS): factors affecting performance and user satisfaction. In: HLT ’91: Proceedings of the workshop on Speech and Natural Language, pp. 49–54. Association for Computational Linguistics (1992)
Suendermann, D., Evanini, K., Liscombe, J., Hunter., P, Dayanidhi, K., Pieraccini, R., From Rule-Based to Statistical Grammars: Continuous Improvement of Large-Scale Spoken Dialog Systems, Proceedings of the 2009 IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP 2009), Taipei, Taiwan, April 19–24 (2009)
Suendermann, D., Liscombe, J., Pieraccini, R.: Contender. In: Proceedings of the SLT 2010 IEEE Workshop on Spoken Language Technology (2010)
Suendermann, D., Liscombe, J., Krishna Dayanidhi, K., Roberto Pieraccini, R.: A handsome set of metrics to measure utterance classification performance in spoken dialog systems. In: Proceedings of SIGdial pp. 349–356 (2009)
Walker, M.A., Langkilde-Geary, I., Wright-Hastie, H., Wright, J., Gorin, A.: Automatically training a problematic dialogue predictor for a spoken dialogue system. J. Artif. Intell. Res. 16, 293–319 (2002)
Walker, M., Rudnicky, A., Aberdeen, J., Owen Bratt, E., Garofolo, J., Hastie, H., Le, A., Pellom, B., Potamianos, A., Passonneau, R., Prasad, R., Roukos, S., Greg, S., Stallard, S.D.: Darpa communicator evaluation: Progress from 2000 to 2001. In: Proceedings of ICSLP 02, pp. 273–276 (2002)
Walker, M.A., Passonneau, R., Boland. J.E.: Quantitative and qualitative evaluation of DARPA communicator spoken dialogue systems. In: Proceedings of ACL (2001)
Walker, M.A., Aberdeen, J., Boland, J., Bratt, E., Garofolo, J., Hirschman, L., Le, A., Lee, S., Narayanan, S., Papineni, K., Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S.: Darpa communicator dialog travel planning systems: The june 2000 data collection. In: Proceedings of Eurospeech (2001)
Walker, M.A., Rudnicky, A., Aberdeen, J., Bratt, E., Garofolo, J., Hastie, H., Le, A., Pellom, B., Potamianos, A., Passonneau, R., Prasad, R., Roukos, S., Sanders, G., Seneff, S., Stallard, D.: Darpa communicator: Cross-system results for the 2001 evaluation. In: Proceedings of ICSLP (2002)
Walker, M.A., Kamm, C.A., Litman, D.J.: Towards Developing General Models of Usability with PARADISE. Nat. Lang. Eng., 6(3), 363–377 (2000)
Walker, M., Passoneau, R.: DATE: A dialogue act tagging scheme for evaluation. In: Proceedings of the Human Language Technology Conference (HLT) (2001)
Walker, M.A.: An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system for email. J. Artif. Intell. Res. 12, 387–416 (2000)
Walker, M.A.: Can we talk? methods for evaluation and training of spoken dialogue systems. Lang. Resour. Evaluation 39(1), 65–75 (2005)
Walker, X., Boland, J., Kamm, C.: The utility of elapsed time as a usability metric for spoken dialogue systems. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU99) (1999)
Wright-Hastie, H., Prasad, R., Walker, M.: What’s the trouble: Automatically identifying problematic dialogues in. In: Proceedings of ACL, pp. 384–391 (2002)
Young, S., Gasic, M., Keizer, S., Mairesse, F., Schatzmann, J., Thomson, B., Yu, J.: The hidden information state model: a practical framework for POMDP-based spoken dialogue management. Computer Speech and Language 24(2), 150–174 (2010)
Acknowledgements
I would like to acknowledge Olivier Pietquin and Oliver Lemon for their guidance in writing this chapter. The research leading to this work has received funding from the EC’s FP7 programmes: (FP7/2007–13) under grant agreement no. 216594 (CLASSiC); (FP7/2011–14) under grant agreement no. 248765 (Help4Mood); (FP7/2011–14) under grant agreement no. 287615 (PARLANCE).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media New York
About this chapter
Cite this chapter
Hastie, H. (2012). Metrics and Evaluation of Spoken Dialogue Systems. In: Lemon, O., Pietquin, O. (eds) Data-Driven Methods for Adaptive Spoken Dialogue Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4803-7_7
Download citation
DOI: https://doi.org/10.1007/978-1-4614-4803-7_7
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4802-0
Online ISBN: 978-1-4614-4803-7
eBook Packages: Computer ScienceComputer Science (R0)