Skip to main content

The Significance of Recall in Automatic Metrics for MT Evaluation

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 3265)

Abstract

Recent research has shown that a balanced harmonic mean (F1 measure) of unigram precision and recall outperforms the widely used BLEU and NIST metrics for Machine Translation evaluation in terms of correlation with human judgments of translation quality. We show that significantly better correlations can be achieved by placing more weight on recall than on precision. While this may seem unexpected, since BLEU and NIST focus on n-gram precision and disregard recall, our experiments show that correlation with human judgments is highest when almost all of the weight is assigned to recall. We also show that stemming is significantly beneficial not just to simpler unigram precision and recall based metrics, but also to BLEU and NIST.

Keywords

  • Machine Translation
  • Human Judgment
  • Word Error Rate
  • Sentence Level
  • Bleu Score

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-540-30194-3_16
  • Chapter length: 10 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   74.99
Price excludes VAT (USA)
  • ISBN: 978-3-540-30194-3
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.00
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Kishore, P., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a Method for Automatic Evaluation of Machine Translation. In: Proceedings of 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, PA, pp. 311–318 (July 2002)

    Google Scholar 

  2. Doddington, G.: Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics. In: Proceedings of the Second Conference on Human Language Technology (HLT 2002), San Diego, CA, pp. 128–132 (2002)

    Google Scholar 

  3. Su, K.-Y., Wu, M.-W., Chang, J.-S.: A New Quantitative Quality Measure for Machine Translation Systems. In: Proceedings of the fifteenth International Conference on Computational Linguistics (COLING 1992), Nantes, France, pp. 433–439 (1992)

    Google Scholar 

  4. Akiba, Y., Imamura, K., Sumita, E.: Using Multiple Edit Distances to Automatically Rank Machine Translation Output. In: Proceedings of MT Summit VIII, Santiago de Compostela, Spain, pp. 15–20 (2001)

    Google Scholar 

  5. Niessen, S., Och, F.J., Leusch, G., Ney, H.: An Evaluation Tool for Machine Translation: Fast Evaluation for Machine Translation Research. In: Proceedings of the Second International Conference on Language Resources and Evaluation (LREC-2000) Athens, Greece, pp. 39–45 (2000)

    Google Scholar 

  6. Leusch, G., Ueffing, N., Ney, H.: String-to-String Distance Measure with Applications to Machine Translation Evaluation. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 240–247 (2003)

    Google Scholar 

  7. Melamed, D., Green, R., Turian, J.: Precision and Recall of Machine Translation. In: Proceedings of HLT-NAACL 2003.Short Papers, Edmonton, Canada, May 2003, pp. 61–63 (2003)

    Google Scholar 

  8. Turian, J.P., Shen, L., Dan Melamed, I.: Evaluation of Machine Translation and its Evaluation. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 386–393 (2003)

    Google Scholar 

  9. Lin, C.-Y., Hovy, E.: Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of HLT-NAACL 2003, Edmonton, Canada, May 2003, pp. 71–78 (2003)

    Google Scholar 

  10. van Rijsbergen, C.: Information Retrieval. Butterworths, 2nd edn., London, England (1979)

    Google Scholar 

  11. Coughlin, D.: Correlating Automated and Human Assessments of Machine Translation Quality. In: Proceedings of MT Summit IX, New Orleans, LA, September 2003, pp. 63–70 (2003)

    Google Scholar 

  12. Efron, B., Tibshirani, R.: Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science 1(1), 54–77 (1986)

    CrossRef  MathSciNet  Google Scholar 

  13. Doddington, G.: Automatic Evaluation of Language Translation using N-gram Co-occurrence Statistics. Presentation at DARPA/TIDES 2003 MT Workshop. NIST, Gathersberg, MD (July 2003)

    Google Scholar 

  14. Pang, B., Knight, K., Marcu, D.: Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences. In: Proceedings of HLT-NAACL 2003, Edmonton, Canada, May 2003, pp. 102–109 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lavie, A., Sagae, K., Jayaraman, S. (2004). The Significance of Recall in Automatic Metrics for MT Evaluation. In: Frederking, R.E., Taylor, K.B. (eds) Machine Translation: From Real Users to Research. AMTA 2004. Lecture Notes in Computer Science(), vol 3265. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30194-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30194-3_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23300-8

  • Online ISBN: 978-3-540-30194-3

  • eBook Packages: Springer Book Archive