Empirical Methods in Natural Language Generation

Volume 5790 of the series Lecture Notes in Computer Science pp 201-221

Human Evaluation of a German Surface Realisation Ranker

  • Aoife CahillAffiliated withInstitut für Maschinelle Sprachverarbeitung (IMS), University of Stuttgart
  • , Martin ForstAffiliated withPowerset/Microsoft

* Final gross prices may vary according to local VAT.

Get Access


In this chapter we present a human-based evaluation of surface realisation alternatives. We examine the relative rankings of naturally occurring corpus sentences and automatically generated strings chosen by statistical models (language model, log-linear model), as well as the naturalness of the strings chosen by the log-linear model. We also investigate to what extent preceding context has an effect on choice. We show that native speakers do accept quite some variation in word order, but that there are clearly also factors that make certain realisation alternatives more natural than others. We then examine correlations between native speaker judgements of automatically generated German text and automatic evaluation metrics. We look at a number of metrics from the MT and Summarisation communities and find that for a relative ranking task, most automatic metrics perform equally well and have fairly strong correlations to the human judgements. In contrast, on a naturalness judgement task, the correlation between the human judgements and the automatic metrics was quite weak, the General Text Matcher (GTM) tool providing the only metric that correlates with the human judgements at a statistically significant level.


generation evaluation surface realisation human evaluation German human judgements automatic metrics correlation