Evaluating Evaluation Methods for Generation in the Presence of Variation

Stent, Amanda; Marge, Matthew; Singhai, Mohit

doi:10.1007/978-3-540-30586-6_38

Amanda Stent¹⁷,
Matthew Marge¹⁷ &
Mohit Singhai¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3406))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2278 Accesses
27 Citations
1 Altmetric

Abstract

Recent years have seen increasing interest in automatic metrics for the evaluation of generation systems. When a system can generate syntactic variation, automatic evaluation becomes more difficult. In this paper, we compare the performance of several automatic evaluation metrics using a corpus of automatically generated paraphrases. We show that these evaluation metrics can at least partially measure adequacy (similarity in meaning), but are not good measures of fluency (syntactic correctness). We make several proposals for improving the evaluation of generation systems that produce variation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Elhadad, M., Robin, J.: Controlling content realization with functional unification grammar. In: Dale, R., Rösner, D., Stock, O., Hovy, E. (eds.) IWNLG 1992. LNCS, vol. 587. Springer, Heidelberg (1992)
Google Scholar
Bangalore, S., Rambow, O.: Exploiting a probabilistic hierarchical model for generation. In: Proceedings of COLING 2000 (2000)
Google Scholar
Langkilde, I.: Forest-based statistical sentence generation. In: Proceedings of ANLP 2000 (2000)
Google Scholar
McKeown, K.: Paraphrasing using given and new information in a question-answer system. In: Proceedings of ACL 1979 (1979)
Google Scholar
Murata, M., Isahara, H.: Universal model for paraphrasing – using transformation based on a defined criteria. In: Proccedings of the NLPRS 2001 workshop on Automatic Paraphrasing: Theories and Applications (2001)
Google Scholar
Barzilay, R., Lee, L.: Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In: Proceedings of HLT-NAACL 2003 (2003)
Google Scholar
Barzilay, R., McKeown, K.: Extracting paraphrases from a parallel corpus. In: Proceedings of ACL/EACL 2001 (2001)
Google Scholar
Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned corpora. In: Proceedings of the 2nd International Workshop on Paraphrasing (2003)
Google Scholar
Pang, B., Knight, K., Marcu, D.: Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In: Proceedings of HLT-NAACL 2003 (2003)
Google Scholar
Shinyama, Y., Sekine, S., Sudo, K., Grishman, R.: Automatic paraphrase acquisition from news articles. In: Proceedings of HLT-NAACL 2002 (2002)
Google Scholar
NIST: Automatic evaluation of machine translation quality using n-gram co-occurrence statistics (2002)
Google Scholar
Papenini, K., Roukos, S., Ward, T., Zhu, W.: BLEU: A method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), Thomas J. Watson Research Center, IBM Research Division (2001)
Google Scholar
Turian, J., Shen, L., Melamed, I.D.: Evalaution of machine translation and its evaluation. In: Proceedings of MT Summit IX (2003)
Google Scholar
Bangalore, S., Rambow, O., Whittaker, S.: Evaluation metrics for generation. In: Proceedings of INLG 2000 (2000)
Google Scholar
Langkilde, I.: An empirical verification of coverage and correctness for a general-purpose sentence generator. In: Proceedings of INLG 2002 (2002)
Google Scholar
Callaway, C.: Evaluating coverage for large symbolic NLG grammars. In: Proceedings of IJCAI 2003 (2003)
Google Scholar
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (1990)
Google Scholar
Daelemans, W., Buchholz, S., Veenstra, J.: Memory-based shallow parsing. In: Proceedings of CoNLL 1999 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Stony Brook University, Stony Brook, NY, 11794, USA
Amanda Stent, Matthew Marge & Mohit Singhai

Authors

Amanda Stent
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Marge
View author publications
You can also search for this author in PubMed Google Scholar
Mohit Singhai
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, México
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stent, A., Marge, M., Singhai, M. (2005). Evaluating Evaluation Methods for Generation in the Presence of Variation. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2005. Lecture Notes in Computer Science, vol 3406. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30586-6_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-30586-6_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24523-0
Online ISBN: 978-3-540-30586-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics