Semi-formal Evaluation of Conversational Characters

  • Ron Artstein
  • Sudeep Gandhe
  • Jillian Gerten
  • Anton Leuski
  • David Traum
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5533)


Conversational dialogue systems cannot be evaluated in a fully formal manner, because dialogue is heavily dependent on context and current dialogue theory is not precise enough to specify a target output ahead of time. Instead, we evaluate dialogue systems in a semi-formal manner, using human judges to rate the coherence of a conversational character and correlating these judgments with measures extracted from within the system. We present a series of three evaluations of a single conversational character over the course of a year, demonstrating how this kind of evaluation helps bring about an improvement in overall dialogue coherence.


Speech Recognition Automatic Speech Recognition Dialogue System Selection Policy Word Error Rate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19(2), 313–330 (1993)Google Scholar
  2. 2.
    Levin, E., Pieraccini, R., Eckert, W.: A stochastic model of human–machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing 8(1), 11–23 (2000)CrossRefGoogle Scholar
  3. 3.
    Walker, M.A.: An application of reinforcement learning to dialogue strategy selection in a spoked dialogue system for email. Journal of Artificial Intelligence Research 12, 387–416 (2000)zbMATHGoogle Scholar
  4. 4.
    Leuski, A., Patel, R., Traum, D., Kennedy, B.: Building effective question answering characters. In: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, Sydney, Australia, Association for Computational Linguistics, July 2006, pp. 18–27 (2006)Google Scholar
  5. 5.
    Leuski, A., Traum, D.: A statistical approach for text processing in virtual humans. In: 26th Army Science Conference, Orlando, Florida (December 2008)Google Scholar
  6. 6.
    Artstein, R., Gandhe, S., Leuski, A., Traum, D.: Field testing of an interactive question-answering character. In: ELRA Workshop on Evaluation, Marrakech, Morocco, May 2008, pp. 36–40 (2008)Google Scholar
  7. 7.
    Artstein, R., Cannon, J., Gandhe, S., Gerten, J., Henderer, J., Leuski, A., Traum, D.: Coherence of off-topic responses for a virtual character. In: 26th Army Science Conference, Orlando, Florida (December 2008)Google Scholar
  8. 8.
    Ai, H., Raux, A., Bohus, D., Eskenazi, M., Litman, D.: Comparing spoken dialog corpora collected with recruited subjects versus real users. In: Keizer, S., Bunt, H., Paek, T. (eds.) Proceedings of the 8th SIGdial Workshop on Discourse and Dialogue, Antwerp, Belgium, September 2007, pp. 124–131. Association for Computational Linguistics (2007)Google Scholar
  9. 9.
    Robinson, S., Traum, D., Ittycheriah, M., Henderer, J.: What would you ask a conversational agent? Observations of human-agent dialogues in a museum setting. In: LREC 2008 Proceedings, Marrakech, Morocco (May 2008)Google Scholar
  10. 10.
    Patel, R., Leuski, A., Traum, D.: Dealing with out of domain questions in virtual characters. In: Gratch, J., Young, M., Aylett, R.S., Ballin, D., Olivier, P. (eds.) IVA 2006. LNCS, vol. 4133, pp. 121–131. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  11. 11.
    Krippendorff, K.: Content Analysis: An Introduction to Its Methodology, ch. 12, pp. 129–154. Sage, Beverly Hills (1980)Google Scholar
  12. 12.
    Siegel, S., Castellan Jr., N.J.: Nonparametric Statistics for the Behavioral Sciences, 2nd edn., ch. 9.8, pp. 284–291. McGraw-Hill, New York (1988)Google Scholar
  13. 13.
    Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Computational Linguistics 34(4), 555–596 (2008)CrossRefGoogle Scholar
  14. 14.
    Hayes, A.F., Krippendorff, K.: Answering the call for a standard reliability measure for coding data. Communication Methods and Measures 1(1), 77–89 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Ron Artstein
    • 1
  • Sudeep Gandhe
    • 1
  • Jillian Gerten
    • 1
  • Anton Leuski
    • 1
  • David Traum
    • 1
  1. 1.Institute for Creative TechnologiesUniversity of Southern CaliforniaMarina del ReyUSA

Personalised recommendations