Human Evaluation of a German Surface Realisation Ranker

  • Aoife Cahill
  • Martin Forst
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5790)


In this chapter we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models (language model, log-linear model), as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order, but that there are clearly also factors that make certain realisation alternatives more natural than others. We then examine correlations between native speaker judgements of automatically generated German text and automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the correlation between the human judgements and the automatic metrics was quite weak, the General Text Matcher (GTM) tool providing the only metric that correlates with the human judgements at a statistically significant level.


generation evaluation surface realisation human evaluation German human judgements automatic metrics correlation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bangalore, S., Rambow, O., Whittaker, S.: Evaluation metrics for generation. In: Proceedings of the First International Natural Language Generation Conference (INLG 2000), Mitzpe Ramon, Israel, pp. 1–8 (2000)Google Scholar
  2. 2.
    Belz, A., Kow, E.: Assessing the trade-off between system building cost and output quality in data-to-text generation. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 180–200. Springer, Heidelberg (2010)Google Scholar
  3. 3.
    Belz, A., Reiter, E.: Comparing automatic and human evaluation of NLG systems. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp. 313–320 (2006)Google Scholar
  4. 4.
    Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER treebank. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria, pp. 24–41 (2002)Google Scholar
  5. 5.
    Bresnan, J.: Lexical-Functional Syntax. Blackwell, Oxford (2001)Google Scholar
  6. 6.
    Cahill, A., Forst, M., Rohrer, C.: Stochastic realisation ranking for a free word order language. In: Proceedings of the Eleventh European Workshop on Natural Language Generation, DFKI GmbH, Saarbrücken, Germany, June 2007, pp. 17–24 (2007) (document D-07-01)Google Scholar
  7. 7.
    Cahill, A., Riester, A.: Incorporating information status into generation ranking. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, August 2009, pp. 817–825. Association for Computational Linguistics, Suntec (August 2009)Google Scholar
  8. 8.
    Callaway, C.: Evaluating coverage for large symbolic NLG grammars. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico, pp. 811–817 (2003)Google Scholar
  9. 9.
    Crouch, R., Kaplan, R., King, T.H., Riezler, S.: A comparison of evaluation metrics for a broad coverage parser. In: Proceedings of the LREC Workshop: Beyond PARSEVAL – Towards Improved Evaluation Measures for Parsing Systems, Las Palmas, Canary Islands, Spain, pp. 67–74 (2002)Google Scholar
  10. 10.
    Filippova, K., Strube, M.: Generating constituent order in German clauses. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 320–327. Association for Computational Linguistics, Prague (June 2007)Google Scholar
  11. 11.
    Gatt, A., Belz, A.: Introducing shared task evaluation to NLG: The TUNA shared task evaluation challenges. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 264–293. Springer, Heidelberg (2010)Google Scholar
  12. 12.
    Hall, J., Nivre, J.: A dependency-driven parser for German dependency and constituency representations. In: Proceedings of the Workshop on Parsing German, pp. 47–54. Association for Computational Linguistics, Columbus (June 2008)CrossRefGoogle Scholar
  13. 13.
    Hovy, E., Yew Lin, C., Zhou, L.: Evaluating duc 2005 using basic elements. In: Proceedings of DUC 2005 (2005)Google Scholar
  14. 14.
    Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Marie-Francine Moens, S.S. (ed.) Text Summarization Branches Out: Proceedings of the ACL 2004 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona (July 2004)Google Scholar
  15. 15.
    Melamed, I.D., Green, R., Turian, J.P.: Precision and recall of machine translation. In: NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 61–63. Association for Computational Linguistics, Morristown (2003)Google Scholar
  16. 16.
    Nakanishi, H., Miyao, Y., Tsujii, J.: Probabilistic models for disambiguation of an HPSG-based chart generator. In: Proceedings of the Ninth International Workshop on Parsing Technology, pp. 93–102. Association for Computational Linguistics, Vancouver (October 2005)CrossRefGoogle Scholar
  17. 17.
    Nenkova, A., Chae, J., Louis, A., Pitler, E.: Structural features for predicting the linguistic quality of text: Applications to machine translation, automatic summarization and human-authored text. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 222–241. Springer, Heidelberg (2010)Google Scholar
  18. 18.
    Owczarzak, K.: DEPEVAL(summ): Dependency-based evaluation for automatic summaries. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 190–198. Association for Computational Linguistics, Suntec (August 2009)Google Scholar
  19. 19.
    Owczarzak, K., van Genabith, J., Way, A.: Evaluating machine translation with LFG dependencies. Machine Translation 21, 95–119 (2008)CrossRefGoogle Scholar
  20. 20.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Philadelphia (July 2002)Google Scholar
  21. 21.
    Reiter, E., Sripada, S.: Should corpora texts be gold standards for NLG? In: Proceedings of INLG 2002, Harriman, NY, pp. 97–104 (2002)Google Scholar
  22. 22.
    Rohrer, C., Forst, M.: Improving coverage and parsing quality of a large-scale LFG for German. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2006), Genoa, Italy, pp. 2206–2211 (2006)Google Scholar
  23. 23.
    Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Weischedel, R.: A study of translation error rate with targeted human annotation. In: Proceedings of the Association for Machine Translation in the Americas Conference 2006, pp. 223–231 (2006)Google Scholar
  24. 24.
    Stent, A., Marge, M., Singhai, M.: Evaluating evaluation methods for generation in the presence of variation. In: Proceedings of CICLING, pp. 341–351 (2005)Google Scholar
  25. 25.
    Velldal, E.: Empirical Realization Ranking. Ph.D. thesis, University of Oslo (2008)Google Scholar
  26. 26.
    Velldal, E., Oepen, S.: Statistical ranking in tactical generation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 517–525. Association for Computational Linguistics, Sydney (July 2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Aoife Cahill
    • 1
  • Martin Forst
    • 2
  1. 1.Institut für Maschinelle Sprachverarbeitung (IMS)University of StuttgartStuttgartGermany
  2. 2.Powerset/MicrosoftSan FranciscoUSA

Personalised recommendations