Advertisement

Machine Translation Evaluation and Optimization

  • Bonnie Dorr
  • Joseph Olive
  • John McCary
  • Caitlin Christianson
Chapter

Abstract

The evaluation of machine translation (MT) systems is a vital field of research, both for determining the effectiveness of existing MT systems and for optimizing the performance of MT systems. This part describes a range of different evaluation approaches used in the GALE community and introduces evaluation protocols and methodologies used in the program. We discuss the development and use of automatic, human, task-based and semi-automatic (human-in-the-loop) methods of evaluating machine translation, focusing on the use of a human-mediated translation error rate HTER as the evaluation standard used in GALE. We discuss the workflow associated with the use of this measure, including post editing, quality control, and scoring. We document the evaluation tasks, data, protocols, and results of recent GALE MT Evaluations. In addition, we present a range of different approaches for optimizing MT systems on the basis of different measures. We outline the requirements and specific problems when using different optimization approaches and describe how the characteristics of different MT metrics affect the optimization. Finally, we describe novel recent and ongoing work on the development of fully automatic MT evaluation metrics that have the potential to substantially improve the effectiveness of evaluation and optimization of MT systems.

Keywords

Machine Translation Human Judgment Word Error Rate Bleu Score Reference Translation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Agarwal, A. and A. Lavie. 2008. Meteor, M-BLEU and M-TER: Evaluation metrics for high-correlation with human rankings of machine translation output. Proceedings of the Third Workshop on Statistical Machine Translation, pages 115–118, Columbus, Ohio, June. Association for Computational Linguistics.Google Scholar
  2. ALPAC (Automatic Language Processing Advisory Committee). 1966. Report of the ALPAC; Language and Machines: Computers in Translation and Linguistics. Division of Behavioral Sciences, National Academy of Sciences, National Research Council, Washington, D.C.Google Scholar
  3. Banerjee, S. and A. Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL 2005 Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, pages 65–72, Ann Arbor, Michigan, June.Google Scholar
  4. Bannard, C. and C. Callison-Burch. 2005. Paraphrasing with Bilingual Parallel Corpora. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pages 597–604, Ann Arbor, Michigan, June.Google Scholar
  5. Belvin, R.S., S. Rieheman and K. Precoda. 2004. A Fine-Grained Evaluation Method for Speech-to-Speech Machine Translation Using Concept Annotations. Proceedings of the Fourth International Conference On Language Resources and Evaluation (LREC 2004), Lisbon, Portugal.Google Scholar
  6. Blatz, J., E. Fitzgerald, G. Foster, S. Gandrabur, C. Goutte, A. Kulesza, A. Sanchis and N. Ueffing. 2003. Confidence Estimation for Machine Translation. Technical Report Natural Language Engineering Workshop Final Report, Johns Hopkins University.Google Scholar
  7. Cahill, A., M. Burke, R. O’Donovan, J. van Genabith and A. Way. 2004. Long-distance dependency resolution in automatically acquired wide-coverage PCFG-based LFG approximations. Proc. ACL, pages 319–326.Google Scholar
  8. Callison-Burch, C., M. Osborne and P. Koehn. 2006. Re-Evaluating the Role of Bleu in Machine Translation Research. Proceedings of EACL-2006, pages 249–256, Trento, Italy.Google Scholar
  9. Callison-Burch, C., C. Fordyce, P. Koehn, C. Monz and J. Schroeder. 2007. (meta-) evaluation of machine translation. Proceedings of the Second Workshop on Statistical Machine Translation, pages 136–158, Prague, Czech Republic, June. Association for Computational Linguistics.Google Scholar
  10. Charniak, E. and M. Johnson. 2005. Coarse-to-fine n-best parsing and maxent discriminative reranking. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 173–180, Ann Arbor, Michigan, June. Association for Computational Linguistics.Google Scholar
  11. Child, J.R., T. Ray and P. Lowe, Jr. 1993. Proficiency and Performance in Language Testing. Applied Language Learning, Vol. 4.Google Scholar
  12. Church, K. and E. Hovy. 1993. Good applications for crummy machine translation. Machine Translation, 8:239–258.CrossRefGoogle Scholar
  13. Coughlin, D. 2003. Correlating Automated and Human Assessments of Machine Translation Quality. Proceedings of MT Summit IX, pages 63–70, New Orleans, LA.Google Scholar
  14. Dagan, I., O. Glickman and B. Magnini. 2006. The PASCAL Recognising Textual Entailment Challenge. Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (Eds.) Machine Learning Challenges. Lecture Notes in Computer Science, Vol. 3944, pp. 177-190, Springer.Google Scholar
  15. Doddington, G. 2002. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurence Statistics. Proceedings of the Human Language Technology (Notebook), pages 128–132, San Diego, CA.Google Scholar
  16. Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. MIT Press. http://www.cogsci.princeton.edu/~wn [2000, September 7].
  17. Fisher, F. and C.R. Voss. 1997. Falcon, an mt system support tool for nonlinguists. Proceedings of the Advanced Information Processing and Analysis Conference (AIPA 97), pages 182–191, McLean, VA.Google Scholar
  18. Fisher, F., C. Schlesiger, L. Decrozant, R. Zuba, M. Holland and C.R. Voss. 1999. Searching and translating arabic documents on a mobile platform. Proceedings of the Advanced Information Processing and Analysis Conference (AIPA 99), Washington, DC.Google Scholar
  19. Frederking, R. and S. Nirenburg. 1994. Three Heads are Better than One. Proceedings of the Fourth Conference on Applied Natural Language Processing (ANLP-94), Stuttgart, Germany.Google Scholar
  20. Garofolo, J.S., T. Robinson, and J. G. Fiscus. 1994. The development of file formats for very large speech corpora: Sphere and shorten. Proceeding of ICASSP. Google Scholar
  21. Gates, D., A. Lavie, L. Levin, A. Waibel, M. Gavald`a, L. Mayfield, M. Woszczyna and P. Zhan. 1996. End-to-End Evaluation in JANUS: a Speech-to-Speech Translation System. Proceedings of the European Conference on Artificial Intelligence (ECAI-1996) (Workshop on “Dialogue Processing in Spoken Language”), Budapest, Hungary, August.Google Scholar
  22. Gimenez, J. and L. Marquez. 2007. Linguistic features for automatic evaluation of heterogenous MT systems. Proceedings of the Second Workshop on Statistical Machine Translation, pages 256–264, Prague, Czech Republic, June. Association for Computational Linguistics.Google Scholar
  23. Habash, N. and A. Elkholy. 2008. Sepia: Surface span extension to syntactic dependency precision-based mt evaluation. Proceedings of the NIST Metrics for Machine Translation Workshop at the Association for Machine Translation in the Americas conference, AMTA-2008, Waikiki, Hawaii.Google Scholar
  24. Hutchins, W.J. 2001. Machine Translation Over Fifty Years. Histoire Épistémologie Langage, I(23):7–31. 2008. ILR language skill level descriptions. http://www.govtilr.org.CrossRefGoogle Scholar
  25. Jones, D. and W. Shen. 2006. Two New Experiments for ILR-Based MT Evaluation. Proceedings of Association for Machine Translation in the Americas.Google Scholar
  26. Jones, E., T. Oliphant, P. Peterson, et al. 2001–. SciPy: Open source scientific tools for Python.Google Scholar
  27. Jones, D.A., W. Shen, N. Granoien, M. Herzog and C. Weinstein. 2005. Measuring translation quality by testing English speakers with a new defense language proficiency test for Arabic. Proceedings of 2005 International Conference on Intelligence Analysis, May 2-6, 2005, McLean, VA., May.Google Scholar
  28. Jones, D., M. Herzog, H. Ibrahim, A. Jairam, W. Shen, E. Gibson and M. Emonts. 2007. ILR-based MT comprehension test with multi-level questions. Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pages 77–80, Rochester, New York, April. Association for Computational Linguistics.Google Scholar
  29. Joshi, A.K. and Y. Schabes. 1997. Tree-adjoining grammars. G. Rozenberg and A. Salomaa, eds., Handbook of Formal Languages, Volume 3: Beyond Words, pages 69–124. Springer, New York.Google Scholar
  30. Kauchak, D. and R. Barzilay. 2006. Paraphrasing for Automatic Evaluation. Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL, pages 455–462.Google Scholar
  31. King, M. 1996. Evaluating Natural Language Processing Systems. Communication of the ACM, 29(1):73–79, January.CrossRefGoogle Scholar
  32. Knight, K. and I. Chander. 1994. Automated Postediting of Documents. Proceedings of National Conference on Artificial Intelligence (AAAI), pages 779–784, Seattle, Washington.Google Scholar
  33. Laoudi, J., C. Tate and C.R.Voss. 2006. Task-based mt evaluation: From who/when/where extraction to event understanding. Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC-2006, pages 2048–2053, Genoa, Italy.Google Scholar
  34. Lavie, A. and A. Agarwal. 2007. METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231, Prague, Czech Republic, June. Association for Computational Linguistics.Google Scholar
  35. Lavie, A., K. Sagae and S. Jayaraman. 2004. The Significance of Recall in Automatic Metrics for MT Evaluation. Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004), pages 134–143, Washington, DC, September.Google Scholar
  36. Leusch, G., N. Ueffing and H. Ney. 2006. CDER: Efficient MT Evaluation Using Block Movements. Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006).Google Scholar
  37. Levenshtein, V. I. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics Doklady, 10:707–710.MathSciNetGoogle Scholar
  38. Levin, L., D. Gates, A. Lavie, F. Pianesi, D. Wallace, T. Watanabe and M. Woszczyna. 2000. Evaluation of a Practical Interlingua for Task-Oriented Dialogue. Proceedings of ANLP/NAACL-2000 Workshop on Applied Interlinguas, pages 18–23, Seattle, WA.Google Scholar
  39. Liu D. and D. Gildea. 2005. Syntactic features for evaluation of machine translation. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 25–32, Ann Arbor, Michigan.Google Scholar
  40. Lopresti, D. and A. Tomkins. 1997. Block edit models for approximate string matching. Theoretical Computer Science, 181(1):159–179, July.MathSciNetzbMATHCrossRefGoogle Scholar
  41. Madnani, N., N.F. Ayan, P. Resnik and B.J. Dorr. 2007. Using paraphrases for parameter tuning in statistical machine translation. Proceedings of the Workshop on Statistical Machine Translation, Prague, Czech Republic, June. Association for Computational Linguistics.Google Scholar
  42. Madnani, N., P. Resnik, B.J. Dorr and R. Schwartz. 2008. Are Multiple Reference Translations Necessary? Investigating the Value of Paraphrased Reference Translations in Parameter Optimization. Proceedings of the Eighth Conference of the Association for Machine Translation in the Americas, October.Google Scholar
  43. Magerman, D.M. 1995. Statistical decision-tree models for parsing. Proc. ACL, pages 276–283.Google Scholar
  44. Matusov, E., G. Leusch, O. Bender and H. Ney. 2005. Evaluating machine translation output with automatic sentence segmentation. Proceedings of the International Workshop on Spoken Language Translation (IWSLT) 2005.Google Scholar
  45. Moore, R. C. and C. Quirk. 2008. Random restarts in minimum error rate training for statistical machine translation. Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 585–592, August.Google Scholar
  46. Nelder, J. A. and R. Mead. 1965. A simplex method for function minimization. Computer Journal, 7:308–313.zbMATHGoogle Scholar
  47. Nießen, S., F.J. Och, G. Leusch, and H. Ney. 2000. An evaluation tool for machine translation: Fast evaluation for MT research. In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2000), pages 39–45.Google Scholar
  48. NIST and LDC. 2007. Post Editing Guidelines for GALE Machine Translation Evaluation, Version 3.0.2, May 25.Google Scholar
  49. Nϋbel, Rita. 1997. End-to-End Evaluation in VERBMOBIL I. Proceedings of MT Summit VI, pages 232–239, San Diego, CA.Google Scholar
  50. Och, F.J. and H. Ney. 2002. Discriminative training and maximum entropy models for statistical machine translation. Proceedings of the 40 th Annual Meeting of the Association for Computational Linguistics (ACL), pages 295–302, Philadelphia, PA, July.Google Scholar
  51. Och, F.J.. 2003. Minimum Error Rate Training in Statistical Machine Translation. Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics, pages 160–167, July.Google Scholar
  52. Owczarzak, K., J. van Genabith and A. Way. 2007. Labeled dependencies in machine translation evaluation. Proceedings of the Second Workshop on Statistical Machine Translation, pages 104–111, Prague, Czech Republic. Association for Computational Linguistics.Google Scholar
  53. Papineni, K., S. Roukos, T. Ward and W. Zhu. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, PA.Google Scholar
  54. Porter, M.F. 1980. An algorithm for suffic stripping. Program, 14(3):130–137.Google Scholar
  55. Powell, M. J. D. 1964. An efficient method for finding the minimum of a function of several variables without calculating derivatives. Computer Journal, 7:155–162.MathSciNetzbMATHCrossRefGoogle Scholar
  56. Przybocki, M., K. Peterson and S. Bronsart. 2008. Official results of the NIST 2008. “Metrics for MAchine TRanslation” Challenge (Metrics-MATR08). http://nist.gov/speech/tests/metricsmatr/2008/results/, October.
  57. Roark, B., M. Harper, E. Charniak, B. Dorr, M. Johnson, J.G. Kahn, Y. Liu, M. Ostendorf, J. Hale, A. Krasnyanskaya, M. Lease, I. Shafran, M. Snover, R. Stewart and L. Yung. 2006. SParseval: Evaluation metrics for parsing speech. Proc. LREC.Google Scholar
  58. Rosti, A.V., S. Matsoukas and R. Schwartz. 2007. Improved word-level system combination for machine translation. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 312–319, Prague, Czech Republic, June. Association for Computational Linguistics.Google Scholar
  59. Russo-Lassner, G., J. Lin and P. Resnik. 2005. A Paraphrase-Based Approach to Machine Translation Evaluation. Technical Report LAMPTR-125/CS-TR-4754/UMIACS-TR-2005-57, University of Maryland, College Park.Google Scholar
  60. Sanders, G.A., S. Bronsart, S. Condon and C. Schlenoff. 2008. Odds of Successful Transfer of Low-level Concepts: A Key Metric for Bidirectional Speech-to-speech Machine Translation in DARPA’s TRANSTAC Program. Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC-2008), Marrakesh, Morocco, May. European Language Resources Association (ELRA).Google Scholar
  61. Snover, M., B. Dorr and R. Schwartz. 2004. A Lexically-Driven Algorithm for Disfluency Detection. Proceedings of HLT/NAACL, pages 157–160.Google Scholar
  62. Snover, M., B. Dorr, R. Schwartz, L. Micciulla and J. Makhoul. 2006. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of Association for Machine Translation in the Americas (AMTA-2006), pages 223–231, Cambridge, Massachusetts.Google Scholar
  63. Tate, C.R. and C.R. Voss. 2006. Combining evaluation metrics via loss functions. Proceedings of the Association for Machine Translation in the Americas conference, AMTA-2006, Boston, MA.Google Scholar
  64. Tate, C.R. 2007. An Investigation of the Relationship between Automated Machine Translation Evaluation Metrics and User Performance on an Information Extraction Task. Ph.D. thesis, University of Maryland, College Park, MD.Google Scholar
  65. Tate, C.R. 2008. A statistical analysis of automated mt evaluation metrics for assessments in task-based mt evaluation. Proceedings of the Association for Machine Translation in the Americas conference, AMTA-2008, pages 182–191, Waikiki, Hawaii.Google Scholar
  66. Taylor, K. and J. White. 1998. Predicting what mt is good for: User judgments and task performance. Proceedings of the Association for Machine Translation in the Americas conference, AMTA-1998, Langhorne, PA.Google Scholar
  67. Tillmann, C., S. Vogel, H. Ney, A. Zubiag and H. Sawaf. 1997. Accelerated DP Based Search For Statistical Translation. European Conference on Speech Communication and Technology, pages 2667–2670, Rhodes, Greece, September.Google Scholar
  68. Turian, J.P., L. Shen and D.I. Melamed. 2003. Evaluation of machine translation and its evaluation. Proc. MT Summit IX, pages 386–393, New Orleans, LA.Google Scholar
  69. Vanni, M., C.R. Voss and C.R. Tate. 2004. Ground truth, reference truth and “omniscient truth” – parallel phrases in parallel texts for mt evaluation. Proceedings of the Fourth International Conference On Language Resources and Evaluation (LREC 2004), pages 10–13, Lisbon, Portugal.Google Scholar
  70. Voss, C.R. and C.R. Tate. 2006. Task-Based Evaluation of Machine Translation (MT) Engines: Measuring How Well People Extract Who, When, Where Type Elements in MT Output. Proceedings of the 11th Annual Conference of the European Association for Machine Translation (EAMT-2006), pages 203–212, Oslo, Norway.Google Scholar
  71. White, J.S. and T. O’Connell. 1994. Evaluation in the ARPA Machine Translation Program: 1993 Methodology. Proceedings of the ARPA HLT Workshop, Plainsboro, NJ.Google Scholar
  72. White, J.S., T. O’Connell and F. O’Mara. 1994. The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches. Proceedings of the First Conference of the Association for Machine Translation, pages 193–205.Google Scholar
  73. White, J.S., J.B. Doyon and S.W. Talbott. 2000. Task Tolerance of MT Output in Integrated Text Processes. ANLP/NAACL 2000: Embedded Machine Translation Systems, pages 9–16.Google Scholar
  74. Wilks, Y. 2008. Machine Translation: Its Scope and Limits. Springer Verlag, New York, NY.Google Scholar
  75. Zhou, L., C.Y. Lin and E. Hovy. 2006. Re-evaluating Machine Translation Results with Paraphrase Support. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006), pages 77–84.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Bonnie Dorr
    • 1
  • Joseph Olive
    • 2
  • John McCary
    • 2
  • Caitlin Christianson
    • 2
  1. 1.University of MarylandCollege ParkUSA
  2. 2.Defense Advanced Research Projects AgencyArlingtonUSA

Personalised recommendations