Skip to main content
Log in

Principles of Context-Based Machine Translation Evaluation

  • Published:
Machine Translation

Abstract

This article defines a Framework for Machine Translation Evaluation( FEMTI) which relates the quality model used to evaluate a machinetranslation system to the purpose and context of the system. Ourproposal attempts to put together, into a coherent picture, previousattempts to structure a domain characterised by overall complexity andlocal difficulties. In this article, we first summarise theseattempts, then present an overview of the ISO/IEC guidelines forsoftware evaluation (ISO/IEC 9126 and ISO/IEC 14598). As anapplication of these guidelines to machine translation software, weintroduce FEMTI, a framework that is made of two interrelatedclassifications or taxonomies. The first classification enablesevaluators to define an intended context of use, while the links tothe second classification generate a relevant quality model (qualitycharacteristics and metrics) for the respective context. The secondclassification provides definitions of various metrics used bythe community. Further on, as part of ongoing, long-term research, weexplain how metrics are analyzed, first from the general pointof view of “meta-evaluation”, then focusing on examples. Finally, weshow how consensus towards the present framework is sought for, andhow feedback from the community is taken into account in the FEMTIlife-cycle.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • ALPAC: 1966, Language and Machines: Computers in Translation and Linguistics, A report by the Automatic Language Processing Advisory Committee, Division of Behavioral Sciences, National Research Council, Washington, DC: National Academy of Sciences. [Available at: http://www.nap.edu/openbook/ARC000005/html/]

    Google Scholar 

  • Arnold, Doug, R. Lee Humphreys and Louisa Sadler (eds): 1993, Special Issue on Evaluation of MT Systems, Machine Translation 8. 1–2.

  • Blasband, Marc: 1999, ‘Practice of Validation: The ARISE Application of the Eagles Framework’, in Proceedings of the European Evaluation of Language Systems Conference (EELS), Hoevelaken, The Netherlands. [Available at: http://www.computeer.nl/eels.htm]

  • Canelli, Maria, Daniele Grasso and Margaret King: 2000, ‘Methods and Metrics for the Evaluation of Dictation Systems: A Case Study’, in LREC 2000: Second International Conference on Language Resources and Evaluation, Athens, pp. 1325–1331.

  • Church, Kenneth W. and Eduard H. Hovy: 1993, ‘Good Applications for Crummy Machine Translation’, Machine Translation 8, 239–258.

    Google Scholar 

  • Crook, M. and H. Bishop: 1965, Evaluation of Machine Translation, Final report, Institute for Psychological Research, Tufts University, Medford, MA.

    Google Scholar 

  • Daly-Jones, Owen, Nigel Bevan and Cathy Thomas: 1999, Handbook of User-Centred Design, Deliverable 6.2.1, INUSE European Project IE-2016. [Available at: http://www.ejeisa.com/nectar/inuse/]

  • Dostert, B. H.: 1973, User's Evaluation of Machine Translation: Georgetown MT System, 1963-1973, Report RADC-TR-73-239, Rome Air Development Center, Grifiss Air Force Base, NY, and Report AD-768-451, Texas A&M University, College Station, TX.

    Google Scholar 

  • Eagles-EWG: 1996, Eagles Evaluation of Natural Language Processing Systems, Final Report EAG-EWG-PR.2, Project LRE-61-100, Center for Sprogteknologi, Copenhagen, Denmark. [Available at: http://www.issco.unige.ch/projects/ewg96/]

    Google Scholar 

  • Eagles-EWG: 1999, Eagles Evaluation of Natural Language Processing Systems, Final Report EAG-II-EWG-PR.2, Project LRE-61-100, Center for Sprogteknologi, Copenhagen, Denmark. [Available at: http://www.issco.unige.ch/projects/eagles/]

  • Flanagan, Mary A.: 1994, ‘Error Classification for MT Evaluation’, in Technology Partnerships for Crossing the Language Barrier: Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, pp. 65–72.

  • Forner, Monika and John S. White: 2001, ‘Predicting MT Fidelity from Noun-Compound Handling’, in MT Summit VIII Workshop on MT Evaluation “Who did what to whom?”, Santiago de Compostela, Spain, pp. 45–48.

  • Fuji, M., N. Hatanaka, E. Ito, S. Kamei, H. Kumai, T. Sukehiro and H. Isahara: 2001, ‘Evaluation Method for Determining Groups of UsersWho Find MT “Useful”’, in MT Summit VIII: Machine Translation in the Information Age, Santiago de Compostela, Spain, pp. 103–108.

  • Halliday, M. A. K. and R. Hasan: 1976, Cohesion in English, London: Longman.

    Google Scholar 

  • Halliday, T. and E. Briss: 1977, The Evaluation and Systems Analysis of the Systran Machine Translation System, Report RADC-TR-76-399, Rome Air Development Center, Griffiss Air Force Base, NY.

    Google Scholar 

  • Hovy, Eduard H.: 1999, ‘Toward Finely Differentiated Evaluation Metrics for Machine Translation’, in Proceedings of EAGLES Workshop on Standards and Evaluation, Pisa, Italy.

  • Hovy, Eduard H., Margaret King and Andrei Popescu-Belis: 2002, ‘Computer-Aided Specification of QualityModels forMT Evaluation’, in LREC 2002: Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, Spain, pp. 1239–1246.

  • Infoshop: 1999, Language Translations: World Market Overview, Current Developments and Competitive Assessment, Kawasaki, Japan: Infoshop Japan/Global Information Inc.

    Google Scholar 

  • ISO/IEC: 1991, ISO/IEC 9126:1991 (E) - Information Technology - Software Product Evaluation - Quality Characteristics and Guidelines for Their Use, Geneva: International Organization for Standardization & International Electrotechnical Commission, December, 1991.

    Google Scholar 

  • ISO/IEC: 1998, ISO/IEC 14598-5:1998 (E) - Software engineering - Product evaluation - Part 5: Process for evaluators, Geneva: International Organization for Standardization & International Electrotechnical Commission, July, 1998.

    Google Scholar 

  • ISO/IEC: 1999a, ISO/IEC 14598-1:1999 (E) - Information technology - Software product evaluation - Part 1: General overview, Geneva: International Organization for Standardization & International Electrotechnical Commission, April, 1999.

    Google Scholar 

  • ISO/IEC: 1999b, ISO/IEC 14598-4:1999 (E) - Software engineering - Product evaluation - Part 4: Process for acquirers, Geneva: International Organization for Standardization & International Electrotechnical Commission, October, 1999.

    Google Scholar 

  • ISO/IEC: 2000a, ISO/IEC 14598-2:2000 (E) - Software engineering - Product evaluation -Part 2: Planning and management, Geneva: International Organization for Standardization & International Electrotechnical Commission, February, 2000.

    Google Scholar 

  • ISO/IEC: 2000b, ISO/IEC 14598-3:2000 (E) - Software engineering - Product evaluation - Part 3: Process for developers, Geneva: International Organization for Standardization & International Electrotechnical Commission, February, 2000.

    Google Scholar 

  • ISO/IEC: 2001a, ISO/IEC 14598-6:2001 (E) - Software engineering - Product evaluation - Part 6: Documentation of evaluation modules, Geneva: International Organization for Standardization & International Electrotechnical Commission, June, 2001.

    Google Scholar 

  • ISO/IEC: 2001b, ISO/IEC 9126-1:2001 (E) - Software engineering - Product quality - Part 1: Quality model, Geneva: International Organization for Standardization & International Electrotechnical Commission, June, 2001.

    Google Scholar 

  • Kay, Martin: 1980, The Proper Place of Men and Machines in Language Translation, Research Report CSL-80-11, Xerox PARC, Palo Alto, CA; repr. in Machine Translation 12 (1997), 3–23.

    Google Scholar 

  • King, Margaret and Kirsten Falkedal: 1990, ‘Using Test Suites in Evaluation of Machine Translation Systems’, in COLING-90: Papers presented to the 13th International Conference on Computational Linguistics, Helsinki, vol. 2, pp. 211–216.

    Google Scholar 

  • Leavitt, A., J. Gates and S. Shannon: 1971, Machine Translation Quality and Production Process Evaluation, Report RADC-TR-71-206, Rome Air Development Center, Griffiss Air Force Base, NY.

    Google Scholar 

  • Lehrberger, John and Laurent Bourbeau: 1988, Machine Translation: Linguistic Characteristics of MT Systems and General Methodology of Evaluation, Amsterdam: John Benjamins Press.

    Google Scholar 

  • Mann, William C. and Sandra A. Thompson: 1988, ‘Rhetorical Structure Theory: A Theory of Text Organization’, Text 8, 243–281.

    Google Scholar 

  • Mason, Jane and Adriane Rinsche: 1995, Translation Technology Products, London: OVUM Ltd.

    Google Scholar 

  • Miller, Keith J. and Michelle Vanni: 2001, ‘Scaling the ISLE Taxonomy: Development of Metrics for the Multi-Dimensional Characterisation of MT Quality’, in MT Summit VIII: Machine Translation in the Information Age, Santiago de Compostela, Spain, pp. 229–234.

  • Morris, J. and G. Hirst: 1991, ‘Lexical Cohesion, the Thesaurus, and the Structure of Text’, Computational Linguistics 17, 21–48.

    Google Scholar 

  • Nagao, Makoto: 1980, A Japanese View on Machine Translation in Light of the Considerations and Recommendations Reported by ALPAC, USA, Tokyo: Japan Electronic Industry Development Association (JEIDA).

    Google Scholar 

  • Niessen, Sonja, Franz Josef Och, Gregor Leusch and Hermann Ney: 2000, ‘An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research’, in LREC 2000: Second International Conference on Language Resources and Evaluation, Athens, Greece, pp. 39–45.

  • Nomura, Hirosato: 1992, JEIDA Methodology and Criteria on Machine Translation Evaluation, Tokyo: Japan Electronic Industry Development Association (JEIDA).

    Google Scholar 

  • Nomura, Hirosato and Hitoshi Isahara: 1992, ‘Evaluation Surveys: The JEIDA Methodology and Survey’, in MT Evaluation: Basis for Future Directions, Proceedings of a workshop sponsored by the National Science Foundation, San Diego, CA, pp. 11–12.

  • Orr, D. and V. Small: 1967, ‘Comprehensibility of Machine-Aided Translations of Russian Scientific Documents’, Mechanical Translation and Computational Linguistics 10, 1–10.

    Google Scholar 

  • Papineni, Kishore, Salim Roukos, Todd Ward and Wei-Jing Zhu: 2001, BLEU: a Method for Automatic Evaluation of Machine Translation, Computer Science Research Report RC22176 (W0109-022), IBM Research Division, T. J. Watson Research Center, Yorktown Heights, NY. [Available at: http://domino.watson.ibm.com/library/Cyberdig.nsf/home]

    Google Scholar 

  • Pfafflin, S.: 1965, ‘Evaluation of Machine Translations by Reading Comprehension Tests and Subjective Judgments’, Mechanical Translation 8, 2–8.

    Google Scholar 

  • Popescu-Belis, Andrei: 1999a, ‘Evaluation of Natural Language Processing Systems: a Model for Coherence Verification of Quality Measures’, in Marc Blasband and Patrick Paroubek (eds), A Blueprint for a General Infrastructure for Natural Language Processing Systems Evaluation Using Semi-Automatic Quantitative Black Box Approach in a Multilingual Environment, Deliverable D1.1, Project LE-4-8340, LIMSI-CNRS, Orsay, France.

    Google Scholar 

  • Popescu-Belis, Andrei: 1999b, ‘L'évaluation en Génie Linguistique: Un Modèle pour Vérifier la Cohérence des Mesures’ [Evaluation in Language Engineering: A Model for Coherence Verification of Measures], Langues 2, 151–162.

    Google Scholar 

  • Popescu-Belis, Andrei, Sandra Manzi and Margaret King: 2001, ‘Towards a Two-stage Taxonomy for Machine Translation Evaluation’, in MT Summit VIII Workshop on MT Evaluation “Who did what to whom?”, Santiago de Compostela, Spain, pp. 1–8.

  • Rajman, Martin and Anthony Hartley: 2002, ‘Automatic Ranking of MT Systems’, in LREC 2002: Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, Spain, pp. 1247–1253.

  • Sinaiko, H. W.: 1979, ‘Measurement of Usefulness by Performance Test’, in van Slype (1979), pp. 91ff.

  • Sparck Jones, Karen and Julia Rose Galliers: 1993, Evaluating Natural Language Processing Systems, Technical Report 291, University of Cambridge Computer Laboratory.

  • Sparck Jones, Karen and Julia Rose Galliers: 1996, Evaluating Natural Language Processing Systems: An Analysis and Review, Berlin: Springer-Verlag.

    Google Scholar 

  • Taylor, Kathryn B. and John S. White: 1998, ‘Predicting What MT is Good for: User Judgements and Task Performance’, in David Farwell, Laurie Gerber and Eduard H. Hovy (eds), Machine Translation and the Information Soup, Berlin: Springer-Verlag, pp. 364–373.

    Google Scholar 

  • TEMAA: 1996, TEMAA Final Report, LRE-62-070, Center for Sprogteknologi, Copenhagen, Denmark. [Available at: http://cst.dk/temaa/D16/d16exp.html]

    Google Scholar 

  • Thompson, Henry S. (ed.): 1992, Proceedings of the Workshop on The Strategic Role of Evaluation in Natural Language Processing and Speech Technology, HCRC, University of Edinburgh.

  • Tomita, Masaru: 1992, ‘Application of the TOEFL Test to the Evaluation of English-Japanese MT’, in MT Evaluation: Basis for Future Directions, Proceedings of a workshop sponsored by the National Science Foundation, San Diego, CA, p. 59.

  • Vanni, Michelle and Keith J. Miller: 2001, ‘Scoring Methods forMulti-DimensionalMeasurement of Machine Translation Quality’, in MT Summit VIII Workshop on MT Evaluation “Who did what to whom?”, Santiago de Compostela, Spain, pp. 21–28.

  • Vanni, Michelle and Keith J. Miller: 2002, ‘Scaling the ISLE Framework: Use of Existing Corpus Resources for Validation of MT Metrics across Languages’, in LREC 2002: Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, Spain, pp. 1254–1262.

  • van Slype, Georges: 1979, Critical Study of Methods for Evaluating the Quality of Machine Translation, Final report BR 19142, Brussels: Bureau Marcel van Dijk. [Available at: http://issco-www.unige.ch/projects/isle/van-slype.pdf]

    Google Scholar 

  • Vasconcellos, Muriel (ed.): 1992, MT Evaluation: Basis for Future Directions, Proceedings of a workshop sponsored by the National Science Foundation, San Diego, CA.

  • Vauquois, Bernard: 1979, ‘Measurement of Intelligibility of Sentences on Two Scales’, in van Slype (1979), pp.71ff.

  • White, John S.: 2001, ‘Predicting Intelligibility from Fidelity in MT Evaluation’, in MT Summit VIII Workshop on MT Evaluation “Who did what to whom?”, Santiago de Compostela, Spain, pp. 35–38.

  • White, John S. and Theresa A. O'Connell: 1994a, ARPAWorkshops on Machine Translation: a Series of Four Workshops on Comparative Evaluation, 1992-1994, McLean, VA: Litton PRC Inc.

    Google Scholar 

  • White, John S. and Theresa A. O'Connell: 1994b, ‘The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches’, in Technology Partnerships for Crossing the Language Barrier: Proceedings of the First Conference of the Association for Machine Translation in the Americas, Columbia, Maryland, pp. 193–205.

  • White, John S. and Kathryn B. Taylor: 1998, ‘A Task-Oriented Evaluation Metric for Machine Translation’, in LREC 1998: First International Conference on Language Resources and Evaluation, Granada, Spain, pp. 21–25.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hovy, E., King, M. & Popescu-Belis, A. Principles of Context-Based Machine Translation Evaluation. Machine Translation 17, 43–75 (2002). https://doi.org/10.1023/A:1025510524115

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1025510524115

Navigation