Skip to main content

Evaluating Evaluation Methods for Generation in the Presence of Variation

  • Conference paper
Book cover Computational Linguistics and Intelligent Text Processing (CICLing 2005)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Abstract

Recent years have seen increasing interest in automatic metrics for the evaluation of generation systems. When a system can generate syntactic variation, automatic evaluation becomes more difficult. In this paper, we compare the performance of several automatic evaluation metrics using a corpus of automatically generated paraphrases. We show that these evaluation metrics can at least partially measure adequacy (similarity in meaning), but are not good measures of fluency (syntactic correctness). We make several proposals for improving the evaluation of generation systems that produce variation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Elhadad, M., Robin, J.: Controlling content realization with functional unification grammar. In: Dale, R., Rösner, D., Stock, O., Hovy, E. (eds.) IWNLG 1992. LNCS, vol. 587. Springer, Heidelberg (1992)

    Google Scholar 

  2. Bangalore, S., Rambow, O.: Exploiting a probabilistic hierarchical model for generation. In: Proceedings of COLING 2000 (2000)

    Google Scholar 

  3. Langkilde, I.: Forest-based statistical sentence generation. In: Proceedings of ANLP 2000 (2000)

    Google Scholar 

  4. McKeown, K.: Paraphrasing using given and new information in a question-answer system. In: Proceedings of ACL 1979 (1979)

    Google Scholar 

  5. Murata, M., Isahara, H.: Universal model for paraphrasing – using transformation based on a defined criteria. In: Proccedings of the NLPRS 2001 workshop on Automatic Paraphrasing: Theories and Applications (2001)

    Google Scholar 

  6. Barzilay, R., Lee, L.: Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In: Proceedings of HLT-NAACL 2003 (2003)

    Google Scholar 

  7. Barzilay, R., McKeown, K.: Extracting paraphrases from a parallel corpus. In: Proceedings of ACL/EACL 2001 (2001)

    Google Scholar 

  8. Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned corpora. In: Proceedings of the 2nd International Workshop on Paraphrasing (2003)

    Google Scholar 

  9. Pang, B., Knight, K., Marcu, D.: Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In: Proceedings of HLT-NAACL 2003 (2003)

    Google Scholar 

  10. Shinyama, Y., Sekine, S., Sudo, K., Grishman, R.: Automatic paraphrase acquisition from news articles. In: Proceedings of HLT-NAACL 2002 (2002)

    Google Scholar 

  11. NIST: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics (2002)

    Google Scholar 

  12. Papenini, K., Roukos, S., Ward, T., Zhu, W.: BLEU: A method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), Thomas J. Watson Research Center, IBM Research Division (2001)

    Google Scholar 

  13. Turian, J., Shen, L., Melamed, I.D.: Evalaution of machine translation and its evaluation. In: Proceedings of MT Summit IX (2003)

    Google Scholar 

  14. Bangalore, S., Rambow, O., Whittaker, S.: Evaluation metrics for generation. In: Proceedings of INLG 2000 (2000)

    Google Scholar 

  15. Langkilde, I.: An empirical verification of coverage and correctness for a general-purpose sentence generator. In: Proceedings of INLG 2002 (2002)

    Google Scholar 

  16. Callaway, C.: Evaluating coverage for large symbolic NLG grammars. In: Proceedings of IJCAI 2003 (2003)

    Google Scholar 

  17. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (1990)

    Google Scholar 

  18. Daelemans, W., Buchholz, S., Veenstra, J.: Memory-based shallow parsing. In: Proceedings of CoNLL 1999 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Stent, A., Marge, M., Singhai, M. (2005). Evaluating Evaluation Methods for Generation in the Presence of Variation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30586-6_38

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24523-0

  • Online ISBN: 978-3-540-30586-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics