# Quality estimation for machine translation: some lessons learned

- 448 Downloads
- 2 Citations

## Abstract

The dissemination of statistical machine translation (SMT) systems in the professional translation industry is still limited by the lack of reliability of SMT outputs, the quality of which varies to a great extent. A critical piece of information would be for MT systems to automatically assess their output translations with automatically derived quality measures. Predicting quality measures was indeed the goal of a shared task at the Workshop on SMT in 2012. In this contribution, we first report our results for this shared task, detailing the features that we found to be the most predictive of quality. In the latter part, we reexamine the shared task data and protocol and show that several factors actually contributed to the difficulty of the task, and discuss alternative evaluation designs.

## Keywords

Machine Translation Ridge Regression Quality Estimation Mean Absolute Error Target Sentence## Notes

### Acknowledgments

The authors wish to thank LE Hai Son for helping us with the SOUL language model. This work was partially funded by the French National Research Agency under project ANR-CONTINT-TRACE.

## References

- Albrecht J, Hwa R (2007) A re-examination of machine learning approaches for sentence-level mt evaluation. In: Proceedings of the 45th annual meeting of the association of computational linguistics, Association for Computational Linguistics, Prague, Czech Republic, pp 880–887Google Scholar
- Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Comput Linguist 34(4):555–596. doi: 10.1162/coli.07-034-R2 CrossRefGoogle Scholar
- Bach N, Huang F, Al-Onaizan Y (2011) Goodness: a method for measuring machine translation confidence. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Association for Computational Linguistics, Portland, Oregon, USA, pp 211–219Google Scholar
- Bartko J (1966) The intraclass correlation coefficient as a measure of reliability. Psychol Rep 19:3–11CrossRefGoogle Scholar
- Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of coling, (2004) COLING. Geneva, Switzerland, pp 315–321Google Scholar
- Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefzbMATHGoogle Scholar
- Brown PF, Cocke J, Pietra SD, Pietra VJD, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85Google Scholar
- Burman P, Nolan D (1995) A general akaike-type criterion for model selection in robust regression. Biometrika 82(4):877–886MathSciNetCrossRefzbMATHGoogle Scholar
- Burstein J, Kukich K, Wolff S, Lu C, Chodorow M, Braden-Harder L, Harris MD (1998) Automated scoring using a hybrid feature identification technique. In: Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 17th international conference on computational linguistics, vol 1. Association for Computational Linguistics, Montreal, Quebec, Canada, pp 206–210. doi: 10.3115/980845.980879
- Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 10–51Google Scholar
- Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V (1996) Support vector regression machines. In: Mozer M, Jordan MI, Petsche T (eds) NIPS. MIT Press, Cambridge, MA, pp 155–161Google Scholar
- Efron B, Tibshirani R (1993) An introduction to the Bootstrap. Chapman and Hall/CRC monographs on statistics and applied probability series, Chapman & Hall, New YorkGoogle Scholar
- Felice M, Specia L (2012) Linguistic features for quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 96–103Google Scholar
- Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76(5):378–382CrossRefGoogle Scholar
- de Gispert A, Blackwood G, Iglesias G, Byrne W (2012) N-gram posterior probability confidence measures for statistical machine translation: an empirical study. Mach Transl 27(2): 85–114Google Scholar
- Gretton A, Smola A, Huang J, Schmittfull M, Borgwardt K, Schölkopf B (2009) Covariate shift and local learning by distribution matching. MIT Press, Cambridge, MAGoogle Scholar
- Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182zbMATHGoogle Scholar
- Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. SIGKDD Explor 11:10–18CrossRefGoogle Scholar
- Hastie T, Tibshirani R, Friedman JH (2003) The elements of statistical learning. Springer, New YorkGoogle Scholar
- Kanungo T, Orr D (2009) Predicting the readability of short web summaries. In: Proceedings of the second ACM international conference on web search and data mining, ACM, New York, NY, USA, pp 202–211Google Scholar
- Koehn P (2010) Statistical machine translation. Cambridge University Press, CambridgezbMATHGoogle Scholar
- Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between european languages. In: Proceedings on the workshop on statistical machine translation, Association for Computational Linguistics, New York, pp 102–121Google Scholar
- Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Association for Computational Linguistics, Prague, Czech Republic, pp 177–180Google Scholar
- Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 181–190Google Scholar
- Le HS, Oparin I, Allauzen A, Gauvain JL, Yvon F (2011) Structured output layer neural network language model. In: Proceedings of IEEE international conference on acoustic, speech and signal processing, Prague, Czech Republic, pp 5524–5527Google Scholar
- Le HS, Lavergne T, Allauzen A, Apidianaki M, Gong L, Max A, Sokolov A, Wisniewski G, Yvon F (2012) Limsi @ wmt12. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 330–337Google Scholar
- Narendra PM, Fukunaga K (1977) A branch and bound algorithm for feature subset selection. IEEE Trans Comput 26(9):917–922CrossRefzbMATHGoogle Scholar
- Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Lear Res 12:2825–2830Google Scholar
- Popović M (2011) Morphemes and pos tags for n-gram based evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 104–107Google Scholar
- Popovic M (2012) Morpheme- and pos-based ibm1 and language model scores for translation quality estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 133–137Google Scholar
- Popović M, Vilar D, Avramidis E, Burchardt A (2011) Evaluation without references: Ibm1 scores as evaluation metrics. In: Proceedings of the sixth workshop on statistical machine translation, Association for Computational Linguistics, Edinburgh, Scotland, pp 99–103Google Scholar
- Quinlan RJ (1992) Learning with continuous classes. In: 5th Australian joint conference on artificial intelligence, pp 343–348Google Scholar
- Quionero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND (2009) Dataset shift in machine learning. MIT Press, CambridgeGoogle Scholar
- Quirk C (2004) Training a sentence-level machine translation confidence metric. In: Proceedings of the 4th international conference on language resources and evaluation, pp 825–828Google Scholar
- Rifkin RM, Lippert RA (2007) Notes on regularized least squares. Tech Rep MIT-CSAIL-TR-2007-025, MIT-CSAILGoogle Scholar
- Rubino R, Foster J, Wagner J, Roturier J, Samad Zadeh Kaljahi R, Hollowood F (2012) Dcu-symantec submission for the wmt 2012 quality estimation task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 138–144Google Scholar
- Schmid H (1995) Improvements in part-of-speech tagging with an application to german. In: Proceedings of the ACL SIGDAT-Workshop, ACL, Dublin, IrelandGoogle Scholar
- Shrout P, Fleiss J (1979) Intraclass correlation: uses in assessing rater reliability. Psychol Bull 86:420–428CrossRefGoogle Scholar
- Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of AMTA, pp 223–231Google Scholar
- Somers H (2003) Computers and translation: a translator’s guide. John Benjamins Publishing Company, AmsterdamGoogle Scholar
- Soricut R, Echihabi A (2010) Trustrank: inducing trust in automatic translations via ranking. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 612–621Google Scholar
- Soricut R, Bach N, Wang Z (2012) The sdl language weaver systems in the wmt12 quality estimation shared task. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 145–151Google Scholar
- Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the 15th conference of EAMT, Leuven, Belgium, pp 73–80Google Scholar
- Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Mach Transl 24(1):39–50CrossRefGoogle Scholar
- Wang Y, Witten IH (1997) Induction of model trees for predicting continuous classes. In: 9th European conference on machine learning, Springer, pp 128–137Google Scholar
- Xiong D, Zhang M, Li H (2010) Error detection for statistical machine translation using linguistic features. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp 604–611Google Scholar
- Zhuang Y, Wisniewski G, Yvon F (2012) Non-linear models for confidence estimation. In: Proceedings of the seventh workshop on statistical machine translation, Association for Computational Linguistics, Montréal, Canada, pp 157–162Google Scholar