Machine Translation

, Volume 27, Issue 3–4, pp 213–238 | Cite as

Quality estimation for machine translation: some lessons learned

  • Guillaume Wisniewski
  • Anil Kumar Singh
  • François Yvon
Article

Abstract

The dissemination of statistical machine translation (SMT) systems in the professional translation industry is still limited by the lack of reliability of SMT outputs, the quality of which varies to a great extent. A critical piece of information would be for MT systems to automatically assess their output translations with automatically derived quality measures. Predicting quality measures was indeed the goal of a shared task at the Workshop on SMT in 2012. In this contribution, we first report our results for this shared task, detailing the features that we found to be the most predictive of quality. In the latter part, we reexamine the shared task data and protocol and show that several factors actually contributed to the difficulty of the task, and discuss alternative evaluation designs.

References

  1. Albrecht J, Hwa R (2007) A re-examination of machine learning approaches for sentence-level mt evaluation. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Association for Computational Linguistics, Prague, Czech Republic, pp 880–887Google Scholar
  2. Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguist 34(4):555–596. doi:10.1162/coli.07-034-R2 CrossRefGoogle Scholar
  3. Bach N, Huang F, Al-Onaizan Y (2011) Goodness: a method for measuring machine translation confidence. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 211–219Google Scholar
  4. Bartko J (1966) The intraclass correlation coefficient as a measure of reliability. Psychol Rep 19:3–11CrossRefGoogle Scholar
  5. Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of coling, (2004) COLING. Geneva, Switzerland, pp 315–321Google Scholar
  6. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATHGoogle Scholar
  7. Brown PF, Cocke J, Pietra SD, Pietra VJD, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85Google Scholar
  8. Burman P, Nolan D (1995) A general akaike-type criterion for model selection in robust regression. Biometrika 82(4):877–886MathSciNetCrossRefMATHGoogle Scholar
  9. Burstein J, Kukich K, Wolff S, Lu C, Chodorow M, Braden-Harder L, Harris MD (1998) Automated scoring using a hybrid feature identification technique. In: Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, vol 1. Association for Computational Linguistics, Montreal, Quebec, Canada, pp 206–210. doi:10.3115/980845.980879
  10. Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 10–51Google Scholar
  11. Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1996) Support vector regression machines. In: Mozer M, Jordan MI, Petsche T (eds) NIPS. MIT Press, Cambridge, MA, pp 155–161Google Scholar
  12. Efron B, Tibshirani R (1993) An introduction to the Bootstrap. Chapman and Hall/CRC monographs on statistics and applied probability series, Chapman & Hall, New YorkGoogle Scholar
  13. Felice M, Specia L (2012) Linguistic features for quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 96–103Google Scholar
  14. Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382CrossRefGoogle Scholar
  15. de Gispert A, Blackwood G, Iglesias G, Byrne W (2012) N-gram posterior probability confidence measures for statistical machine translation: an empirical study. Mach Transl 27(2): 85–114Google Scholar
  16. Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift and local learning by distribution matching. MIT Press, Cambridge, MAGoogle Scholar
  17. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATHGoogle Scholar
  18. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18CrossRefGoogle Scholar
  19. Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, New YorkGoogle Scholar
  20. Kanungo T, Orr D (2009) Predicting the readability of short web summaries. In: Proceedings of the second ACM international conference on web search and data mining, ACM, New York, NY, USA, pp 202–211Google Scholar
  21. Koehn P (2010) Statistical machine translation. Cambridge University Press, CambridgeMATHGoogle Scholar
  22. Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between european languages. In: Proceedings on the workshop on statistical machine translation, Association for Computational Linguistics, New York, pp 102–121Google Scholar
  23. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Association for Computational Linguistics, Prague, Czech Republic, pp 177–180Google Scholar
  24. Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 181–190Google Scholar
  25. Le HS, Oparin I, Allauzen A, Gauvain JL, Yvon F (2011) Structured output layer neural network language model. In: Proceedings of IEEE international conference on acoustic, speech and signal processing, Prague, Czech Republic, pp 5524–5527Google Scholar
  26. Le HS, Lavergne T, Allauzen A, Apidianaki M, Gong L, Max A, Sokolov A, Wisniewski G, Yvon F (2012) Limsi @ wmt12. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 330–337Google Scholar
  27. Narendra PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. IEEE Trans Comput 26(9):917–922CrossRefMATHGoogle Scholar
  28. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Lear Res 12:2825–2830Google Scholar
  29. Popović M (2011) Morphemes and pos tags for n-gram based evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 104–107Google Scholar
  30. Popovic M (2012) Morpheme- and pos-based ibm1 and language model scores for translation quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 133–137Google Scholar
  31. Popović M, Vilar D, Avramidis E, Burchardt A (2011) Evaluation without references: Ibm1 scores as evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 99–103Google Scholar
  32. Quinlan RJ (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, pp 343–348Google Scholar
  33. Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2009) Dataset shift in machine learning. MIT Press, CambridgeGoogle Scholar
  34. Quirk C (2004) Training a sentence-level machine translation confidence metric. In: Proceedings of the 4th international conference on language resources and evaluation, pp 825–828Google Scholar
  35. Rifkin RM, Lippert RA (2007) Notes on regularized least squares. Tech Rep MIT-CSAIL-TR-2007-025, MIT-CSAILGoogle Scholar
  36. Rubino R, Foster J, Wagner J, Roturier J, Samad Zadeh Kaljahi R, Hollowood F (2012) Dcu-symantec submission for the wmt 2012 quality estimation task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 138–144Google Scholar
  37. Schmid H (1995) Improvements in part-of-speech tagging with an application to german. In: Proceedings of the ACL SIGDAT-Workshop, ACL, Dublin, IrelandGoogle Scholar
  38. Shrout P, Fleiss J (1979) Intraclass correlation: uses in assessing rater reliability. Psychol Bull 86:420–428CrossRefGoogle Scholar
  39. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of AMTA, pp 223–231Google Scholar
  40. Somers H (2003) Computers and translation: a translator’s guide. John Benjamins Publishing Company, AmsterdamGoogle Scholar
  41. Soricut R, Echihabi A (2010) Trustrank: inducing trust in automatic translations via ranking. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 612–621Google Scholar
  42. Soricut R, Bach N, Wang Z (2012) The sdl language weaver systems in the wmt12 quality estimation shared task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 145–151Google Scholar
  43. Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the 15th conference of EAMT, Leuven, Belgium, pp 73–80Google Scholar
  44. Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50CrossRefGoogle Scholar
  45. Wang Y, Witten IH (1997) Induction of model trees for predicting continuous classes. In: 9th European conference on machine learning, Springer, pp 128–137Google Scholar
  46. Xiong D, Zhang M, Li H (2010) Error detection for statistical machine translation using linguistic features. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 604–611Google Scholar
  47. Zhuang Y, Wisniewski G, Yvon F (2012) Non-linear models for confidence estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 157–162Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Guillaume Wisniewski
    • 1
  • Anil Kumar Singh
    • 2
  • François Yvon
    • 1
  1. 1.LIMSI—Université Paris SudOrsayFrance
  2. 2.LIMSI—CNRS OrsayFrance

Personalised recommendations