Extrinsic Versus Intrinsic Evaluation of Natural Language Generation for Spoken Dialogue Systems and Social Robotics

  • Helen Hastie
  • Heriberto Cuayáhuitl
  • Nina Dethlefs
  • Simon Keizer
  • Xingkun Liu
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 427)


In the past 10 years, very few published studies include some kind of extrinsic evaluation of an NLG component in an end-to-end-system, be it for phone or mobile-based dialogues or social robotic interaction. This may be attributed to the fact that these types of evaluations are very costly to set-up and run for a single component. The question therefore arises whether there is anything to be gained over and above intrinsic quality measures obtained in off-line experiments? In this article, we describe a case study of evaluating two variants of an NLG surface realiser and show that there are significant differences in both extrinsic measures and intrinsic measures. These differences can be used to inform further iterations of component and system development.


Natural language generation Evaluation Spoken dialogue systems 



This research was funded by the European Commission FP7 programme FP7/2011-14 under grant agreement no. 287615 (PARLANCE). We thank all members of the PARLANCE consortium for their help in designing, building and testing the Parlance end-to-end spoken dialogue system. We would also like to acknowledge other members of the Heriot-Watt Parlance team in particular Prof. Oliver Lemon and Dr. Verena Rieser.


  1. 1.
    Kelleher, J.D., Kruijff, G.J.M.: Incremental generation of spatial referring expressions in situated dialog. In: Proceedings of ACL, Sydney, Australia (2006)Google Scholar
  2. 2.
    Giuliani, M., Foster, M.E., Isard, A., Matheson, C., Oberlander, J., Knoll, A.: Situated reference in a hybrid human-robot interaction system. In: Proceedings of the INLG, Trim, Ireland (2010)Google Scholar
  3. 3.
    Gkatzia, D., Mahamood, S.: A snapshot of NLG evaluation practices 2005 to 2014. In: Proceedings of ENLG (2015)Google Scholar
  4. 4.
    Deshmukh, A., Janarthanam, S., Hastie, H., Lim, M.Y., Aylett, R., Castellano, G.: How expressiveness of a robotic tutor is perceived by children in a learning environment. In: Proceedings of HRI (2016)Google Scholar
  5. 5.
    Rieser, V., Lemon, O., Keizer, S.: Natural language generation as incremental planning under uncertainty: adaptive information presentation for statistical dialogue systems. IEEE/ACM Trans. Audio Speech Lang. Process. 22(5) (2014)Google Scholar
  6. 6.
    Cox, R., O’Donnell, M., Oberlander, J.: Dynamic versus static hypermedia in museum education: an evaluation of ILEX, the intelligent labelling explorer. In: Proceedings of AIED (1999)Google Scholar
  7. 7.
    Karasimos, A., Isard, A.: Multi-lingual evaluation of a natural language generation systems. In: Proceedings of LREC (2004)Google Scholar
  8. 8.
    Williams, S., Reiter, E.: Generating basic skills reports for low-skilled readers. Nat. Lang. Eng. 14(4), 495–525 (2008)CrossRefGoogle Scholar
  9. 9.
    Dethlefs, N., Cuayáhuitl, H., Hastie, H., Rieser, V., Lemon, O.: Cluster-based prediction of user ratings for stylistic surface realisation. In: Proceedings of the European Chapter of the Annual Meeting of the Association for Computational Linguistics (EACL), Gothenburg, Sweden (2014)Google Scholar
  10. 10.
    Cuayáhuitl, H., Dethlefs, N., Hastie, H., Liu, X.: Training a statistical surface realiser from automatic slot labelling. In: Proceedings of SLT, South Lake Tahoe, CA, USA (2014)Google Scholar
  11. 11.
    Dethlefs, N., Hastie, H., Cuayáhuitl, H., Lemon, O.: Conditional random fields for responsive surface realisation using global features. In: Proceedings of ACL (2013)Google Scholar
  12. 12.
    Hastie, H., Aufaure, M.A., Alexopoulos, P., Cuayáhuitl, H., Dethlefs, N., Gašić, M., Henderson, J., Lemon, O., Liu, X., Mika, P., Ben Mustapha, N., Rieser, V., Thomson, B., Tsiakoulis, P., Vanrompay, Y., Villazon-Terrazas, B.: Demonstration of the PARLANCE system: a data-driven incremental, spoken dialogue system for interactive search. In: Proceedings of SIGDIAL (2013)Google Scholar
  13. 13.
    Young, S., Kershaw, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book Version 3.0. Cambridge University, UK (2000)Google Scholar
  14. 14.
    Yazdani, M., Breslin, C., Tsiakoulis, P., Young, S., Henderson, J.: Domain adaptation in ASR and SLU. Technical report, PARLANCE FP7 Project (2014)Google Scholar
  15. 15.
    Gašić, M., Breslin, C., Henderson, M., Kim, D., Szummer, M., Thomson, B., Tsiakoulis, P., Young, S.: POMDP-based dialogue manager adaptation to extended domains. In: Proceedings of SIGDIAL (2013)Google Scholar
  16. 16.
    Tsiakoulis, P., Breslin, C., Gašić, M., Henderson, M., Kim, D., Young, S.J.: Dialogue context sensitive speech synthesis using factorized decision trees. In: Proceedings of INTERSPEECH (2014)Google Scholar
  17. 17.
    Cuayáhuitl, H., Dethlefs, N., Hastie, H.: A semi-supervised clustering approach for semantic slot labelling. In: Proceedings of ICMLA, Detroit, MI, USA (2014)Google Scholar
  18. 18.
    Castellano, G., Paiva, A., Kappas, A., Aylett, R., Hastie, H., Barendregt, W., Nabais, F., Bull, S.: Towards empathic virtual and robotic tutors. In: Artificial Intelligence in Education, pp. 733–736. Springer, Berlin (2013)Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2017

Authors and Affiliations

  • Helen Hastie
    • 1
  • Heriberto Cuayáhuitl
    • 2
  • Nina Dethlefs
    • 3
  • Simon Keizer
    • 1
  • Xingkun Liu
    • 1
  1. 1.School of Mathematical and Computer Sciences, Heriot-Watt UniversityEdinburghUK
  2. 2.School of Computer Science, University of LincolnLincolnUK
  3. 3.School of Engineering and Computer Science, University of HullHullUK

Personalised recommendations