Skip to main content

Human Evaluation of a German Surface Realisation Ranker

  • Chapter
Empirical Methods in Natural Language Generation (EACL 2009, ENLG 2009)

Abstract

In this chapter we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models (language model, log-linear model), as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order, but that there are clearly also factors that make certain realisation alternatives more natural than others. We then examine correlations between native speaker judgements of automatically generated German text and automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the correlation between the human judgements and the automatic metrics was quite weak, the General Text Matcher (GTM) tool providing the only metric that correlates with the human judgements at a statistically significant level.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bangalore, S., Rambow, O., Whittaker, S.: Evaluation metrics for generation. In: Proceedings of the First International Natural Language Generation Conference (INLG 2000), Mitzpe Ramon, Israel, pp. 1–8 (2000)

    Google Scholar 

  2. Belz, A., Kow, E.: Assessing the trade-off between system building cost and output quality in data-to-text generation. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 180–200. Springer, Heidelberg (2010)

    Google Scholar 

  3. Belz, A., Reiter, E.: Comparing automatic and human evaluation of NLG systems. In: Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp. 313–320 (2006)

    Google Scholar 

  4. Brants, S., Dipper, S., Hansen, S., Lezius, W., Smith, G.: The TIGER treebank. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, Sozopol, Bulgaria, pp. 24–41 (2002)

    Google Scholar 

  5. Bresnan, J.: Lexical-Functional Syntax. Blackwell, Oxford (2001)

    Google Scholar 

  6. Cahill, A., Forst, M., Rohrer, C.: Stochastic realisation ranking for a free word order language. In: Proceedings of the Eleventh European Workshop on Natural Language Generation, DFKI GmbH, Saarbrücken, Germany, June 2007, pp. 17–24 (2007) (document D-07-01)

    Google Scholar 

  7. Cahill, A., Riester, A.: Incorporating information status into generation ranking. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, August 2009, pp. 817–825. Association for Computational Linguistics, Suntec (August 2009)

    Google Scholar 

  8. Callaway, C.: Evaluating coverage for large symbolic NLG grammars. In: Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI 2003), Acapulco, Mexico, pp. 811–817 (2003)

    Google Scholar 

  9. Crouch, R., Kaplan, R., King, T.H., Riezler, S.: A comparison of evaluation metrics for a broad coverage parser. In: Proceedings of the LREC Workshop: Beyond PARSEVAL – Towards Improved Evaluation Measures for Parsing Systems, Las Palmas, Canary Islands, Spain, pp. 67–74 (2002)

    Google Scholar 

  10. Filippova, K., Strube, M.: Generating constituent order in German clauses. In: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pp. 320–327. Association for Computational Linguistics, Prague (June 2007)

    Google Scholar 

  11. Gatt, A., Belz, A.: Introducing shared task evaluation to NLG: The TUNA shared task evaluation challenges. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 264–293. Springer, Heidelberg (2010)

    Google Scholar 

  12. Hall, J., Nivre, J.: A dependency-driven parser for German dependency and constituency representations. In: Proceedings of the Workshop on Parsing German, pp. 47–54. Association for Computational Linguistics, Columbus (June 2008)

    Chapter  Google Scholar 

  13. Hovy, E., Yew Lin, C., Zhou, L.: Evaluating duc 2005 using basic elements. In: Proceedings of DUC 2005 (2005)

    Google Scholar 

  14. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Marie-Francine Moens, S.S. (ed.) Text Summarization Branches Out: Proceedings of the ACL 2004 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona (July 2004)

    Google Scholar 

  15. Melamed, I.D., Green, R., Turian, J.P.: Precision and recall of machine translation. In: NAACL 2003: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 61–63. Association for Computational Linguistics, Morristown (2003)

    Google Scholar 

  16. Nakanishi, H., Miyao, Y., Tsujii, J.: Probabilistic models for disambiguation of an HPSG-based chart generator. In: Proceedings of the Ninth International Workshop on Parsing Technology, pp. 93–102. Association for Computational Linguistics, Vancouver (October 2005)

    Chapter  Google Scholar 

  17. Nenkova, A., Chae, J., Louis, A., Pitler, E.: Structural features for predicting the linguistic quality of text: Applications to machine translation, automatic summarization and human-authored text. In: Krahmer, E., Theune, M. (eds.) Empirical Methods in NLG. LNCS (LNAI), vol. 5790, pp. 222–241. Springer, Heidelberg (2010)

    Google Scholar 

  18. Owczarzak, K.: DEPEVAL(summ): Dependency-based evaluation for automatic summaries. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 190–198. Association for Computational Linguistics, Suntec (August 2009)

    Google Scholar 

  19. Owczarzak, K., van Genabith, J., Way, A.: Evaluating machine translation with LFG dependencies. Machine Translation 21, 95–119 (2008)

    Article  Google Scholar 

  20. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, Philadelphia (July 2002)

    Google Scholar 

  21. Reiter, E., Sripada, S.: Should corpora texts be gold standards for NLG? In: Proceedings of INLG 2002, Harriman, NY, pp. 97–104 (2002)

    Google Scholar 

  22. Rohrer, C., Forst, M.: Improving coverage and parsing quality of a large-scale LFG for German. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2006), Genoa, Italy, pp. 2206–2211 (2006)

    Google Scholar 

  23. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Weischedel, R.: A study of translation error rate with targeted human annotation. In: Proceedings of the Association for Machine Translation in the Americas Conference 2006, pp. 223–231 (2006)

    Google Scholar 

  24. Stent, A., Marge, M., Singhai, M.: Evaluating evaluation methods for generation in the presence of variation. In: Proceedings of CICLING, pp. 341–351 (2005)

    Google Scholar 

  25. Velldal, E.: Empirical Realization Ranking. Ph.D. thesis, University of Oslo (2008)

    Google Scholar 

  26. Velldal, E., Oepen, S.: Statistical ranking in tactical generation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 517–525. Association for Computational Linguistics, Sydney (July 2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Cahill, A., Forst, M. (2010). Human Evaluation of a German Surface Realisation Ranker. In: Krahmer, E., Theune, M. (eds) Empirical Methods in Natural Language Generation. EACL ENLG 2009 2009. Lecture Notes in Computer Science(), vol 5790. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15573-4_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15573-4_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15572-7

  • Online ISBN: 978-3-642-15573-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics