Data-driven annotation of binary MT quality estimation corpora based on human post-editions

Abstract

Advanced computer-assisted translation (CAT) tools include automatic quality estimation (QE) mechanisms to support post-editors in identifying and selecting useful suggestions. Based on supervised learning techniques, QE relies on high-quality data annotations obtained from expensive manual procedures. However, as the notion of MT quality is inherently subjective, such procedures might result in unreliable or uninformative annotations. To overcome these issues, we propose an automatic method to obtain binary annotated data that explicitly discriminate between useful (suitable for post-editing) and useless suggestions. Our approach is fully data-driven and bypasses the need for explicit human labelling. Experiments with different language pairs and domains demonstrate that it yields better models than those based on the adaptation into binary datasets of the available QE corpora. Furthermore, our analysis suggests that the learned thresholds separating useful from useless translations are significantly lower than as suggested in the existing guidelines for human annotators. Finally, a verification experiment with several translators operating with a CAT tool confirms our empirical findings.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    Henceforth we will use the term target to indicate the output of an MT system.

  2. 2.

    http://www.statmt.org/wmt14/.

  3. 3.

    Possible editing operations include the insertion, deletion, and substitution of single words as well as shifts of word sequences.

  4. 4.

    http://www.statmt.org/wmt11/translation-task.html.

  5. 5.

    Such biases support the idea that labelling translations with quality scores is per se a highly subjective task.

  6. 6.

    http://www.matecat.com/.

  7. 7.

    Partitions with thresholds below 2 were also considered, including the most intuitive partition with cut-off set to 1. However, the resulting number of negative instances, if any, was too scarce, and the overall dataset too unbalanced, to make standard supervised learning methods effective.

  8. 8.

    The partition most closely related to our task (i.e. 1-1-1) was impossible to produce since none of the examples was labelled with 1 by all the annotators. Even for 1-1-X, the negative class contains only one example. Moreover, based on human scores it was impossible to create balanced datasets to compare with.

  9. 9.

    Note that, since for the CAT dataset only HTER labels are available, only HTER-based partitions could be performed.

  10. 10.

    Such assumption is supported by the fact that reference sentences are, by definition, free translations manually produced, independently and without any influence from the target.

  11. 11.

    Monolingual stem-to-stem exact matches between TGT and correct_translation are inferred by computing the HTER, as in Blain et al. (2012).

  12. 12.

    All ROUGE scores, described in Lin (2004), have been calculated using the software available at http://www.berouge.com.

  13. 13.

    Such partitions are: average effort scores \(=\) 3, human scores \(=\) 3-3-3, HTER score \(=\) 0.45 for WMT-12 and HTER score \(=\) 0.3 for CAT.

  14. 14.

    Each threshold corresponds to the HTER value t that maximizes the number of rewritings above t and the number of post-editions below t. To set t we computed such counts for \(0<\hbox {HTER}<1\) with a step of 0.001.

  15. 15.

    Hence, independently from the HTER, some instances previously marked as positive examples are now considered as negative and vice-versa.

  16. 16.

    PET is the time required to transform the target into a publishable sentence.

  17. 17.

    FPR \(=\) FP/(FP+TN), where FP and TN respectively stand for the number of false positives and true negatives.

  18. 18.

    FDR \(=\) 1-precision \(=\) FP/(TP+FP) where TP stands for the number of true positives (Benjamini and Hochberg 1995).

  19. 19.

    For the WMT dataset, twenty classifiers are trained on partitions based on average effort scores (AES), human scores (HS) and HTER, while one is trained on data resulting from our automatic annotation method. Regarding the CAT dataset, ten classifiers are trained on HTER-based partitions, while 1 is trained on automatically-labelled data.

  20. 20.

    As regards the WMT-12 training data (see Table 1), the distribution of positive/negative instances in the training sets is: 1194/638 for classifier 3 AES, 1095/737 for classifier 0.35 HTER, 1418/414 for classifier AA. For the CAT data, the distribution is: 470/476 for classifier 0.30 HTER and 494/452 for classifier AA.

  21. 21.

    ExpertRating: http://www.expertrating.com.

  22. 22.

    Nothing guarantees that the translations obtained from System-100 are actually good and those obtained from System-30 are bad. However, the large difference in BLEU score between the two systems will likely lead to suggestions that require different amounts of corrections.

  23. 23.

    Even when measured in a controlled lab environment, post-editing time has a high variability due to a myriad of factors that are impossible to control. For this reason, the Forward Search algorithm (Atkinson and Riani 2000; Atkinson et al. 2004) has been run to remove possible outliers (e.g. four rewritings and five post-editions in total for Translator 1). For our experiments, we used the FSDA Matlab toolbox (Riani et al. 2012).

References

  1. Atkinson AC, Riani M (2000) Robust diagnostic regression analysis. Springer Series in Statistics. Springer, New York

  2. Atkinson AC, Riani M, Cerioli A (2004) Exploring multivariate data with the forward search. Springer Series in Statistics. Springer, New York

  3. Bach N, Huang F, Al-Onaizan Y (2011) Goodness: a method for measuring machine translation confidence. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA, pp 211–219

  4. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodological) 57:289–300

    MATH  MathSciNet  Google Scholar 

  5. Blain F, Schwenk H, Senellart J (2012) Incremental adaptation using translation information and post-editing analysis. In: Proceedings of the international workshop on spoken language translation, Hong-Kong, China, pp 234–241

  6. Blatz J, Fitzgerald E, Foster G, Gandrabur S, Goutte C, Kulesza A, Sanchis A, Ueffing N (2004) Confidence estimation for machine translation. In: Proceedings of the 20th International Conference on Computational Linguistics, Switzerland, Geneva, pp 315–321

  7. Bojar O, Buck C, Callison-Burch C, Federmann C, Haddow B, Koehn P, Monz C, Post M, Soricut R, Specia L (2013) Findings of the 2013 workshop on statistical machine translation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 1–44

  8. Callison-Burch C, Koehn P, Monz C, Post M, Soricut R, Specia L (2012) Findings of the 2012 Workshop on statistical machine translation. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, WMT-2012, pp 10–51

  9. Carl M, Dragsted B, Elming J, Hardt D, Jakobsen AL (2011) The process of post-editing: a pilot study. In: Proceedings of the 8th international NLPSC workshop. Special theme: Human-machine interaction in translation, Copenhagen, Denmark, pp 131–142

  10. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27:1–27:27. doi:10.1145/1961189.1961199

    Article  Google Scholar 

  11. Chen CY, Yeh JY, Ke HR (2010) Plagiarism detection using ROUGE and WordNet. J Comput 2(3):34–44

    Google Scholar 

  12. Cohn T, Specia L (2013) Modelling annotator bias with multi-task gaussian processes: an application to machine translation quality estimation. In: Proceedings of the 51st annual meeting of the association for computational linguistics, Sofia, Bulgaria, pp 32–42

  13. Camargo de Souza JG, Turchi M, Negri M (2014) Machine translation quality estimation across domains. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, Dublin City University and Association for Computational Linguistics. Dublin, Ireland, pp 409–420, http://www.aclweb.org/anthology/C14-1040

  14. Efron B (1979) Bootstrap methods: another look at the jackknife. Ann Stat 7(1):1–26

    Article  MATH  MathSciNet  Google Scholar 

  15. Federico M, Cattelan A, Trombetti M (2012) Measuring user productivity in machine translation enhanced computer assisted translation. In: Proceedings of the Tenth conference of the association for machine translation in the Americas, San Diego, California

  16. Federico M, Bertoldi N, Cettolo M, Negri M, Turchi M, Trombetti M, Cattelan A, Farina A, Lupinetti D, Martines A, Massidda A, Schwenk H, Barrault L, Blain F, Koehn P, Buck C, Germann U (2014) The MateCat tool. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: system demonstrations, Dublin City University and Association for Computational Linguistics, Dublin, Ireland, pp 129–132. http://www.aclweb.org/anthology/C14-2028

  17. Garcia I (2011) Translating by post-editing: is it the way forward? Mach Transl 25(3):217–237

    Article  Google Scholar 

  18. Graham Y, Baldwin T, Moffat A, Zobel J (2013) Continuous measurement scales in human evaluation of machine translation. In: Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, Sofia, Bulgaria, pp 33–41

  19. Green S, Heer J, Manning CD (2013) The efficacy of human post-editing for language translation. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, Paris, France, pp 439–448

  20. Guerberof A (2009) Productivity and quality in MT post-editing. In: Proceedings of Machine Translation Summit XII—Workshop: Beyond translation memories: new tools for translators MT, Ottawa, Ontario, Canada

  21. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  MATH  Google Scholar 

  22. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit X, Phuket, Thailand, pp 79–86

  23. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R et al (2007) Moses: Open source toolkit for statistical machine translation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, Czech Republic, Prague, pp 177–180

  24. Koponen M (2012) Comparing human perceptions of post-editing effort with post-editing operations. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 181–190

  25. Koponen M, Aziz W, Ramos L, Specia L (2012) Post-editing time as a measure of cognitive effort. In: Proceedings of the AMTA 2012 workshop on post-editing technology and practice, San Diego, CA, USA

  26. Läubli S, Fishel M, Massey G, Ehrensberger-Dow M, Volk M (2013) Assessing post-editing efficiency in a realistic translation environment. In: Proceedings of Machine Translation Summit XIV Workshop on Post-editing Technology and Practice, Nice, France, pp 83–91

  27. Lesk M (1986) Automated sense disambiguation using machine-readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference on systems documentation (SIGDOC86), Canada, Toronto, pp 24–26

  28. Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. In: Proceedings of the ACL workshop on text summarization branches out, Barcelona, Spain, pp 74–81

  29. Mehdad Y, Negri M, Federico M (2012) Match without a referee: Evaluating MT adequacy without reference translations. In: Proceedings of the seventh workshop on statistical machine translation, Montréal, Canada, pp 171–180

  30. O’Brien S (2011) Towards predicting post-editing productivity. Mach Transl 25(3):197–215

    Article  Google Scholar 

  31. Papadopoulos H, Proedrou K, Vovk V, Gammerman A (2002) Inductive confidence machines for regression. In: Proceedings of the 13th European conference on machine learning, Helsinki, Finland, pp 345–356

  32. Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Pennsylvania, Philadelphia, pp 311–318

  33. Porter M (2001) Snowball: a language for stemming algorithms. http://snowball.tartarus.org/texts/introduction.html, Accessed 01 Aug 2014

  34. Potet M, Esperança-Rodier E, Besacier L, Blanchon H (2012) Collection of a large database of french-english SMT output corrections. In: Proceedings of the eighth international conference on language resources and evaluation, Istanbul, Turkey, pp 4043–4048

  35. Potthast M, Barrón-Cedeño A, Eiselt A, Stein B, Rosso P (2010) Overview of the 2nd international competition on plagiarism detection. In: Notebook Papers of CLEF (2010) LABs and Workshops, Padua, Italy

  36. Press WH, Teukolsky SA, Vetterling WT, Flannery BP (2007) Numerical recipes: the art of scientific computing, 3rd edn. Cambridge University Press, New York

    Google Scholar 

  37. Quirk CB (2004) Training a sentence-level machine translation confidence measure. In: Proceedings of the fourth international conference on language resources and evaluation, pp 825–828

  38. Riani M, Perrotta D, Torti F (2012) FSDA: a MATLAB toolbox for robust analysis and interactive data exploration. Chemom Intell Lab Syst 116:17–32

    Article  Google Scholar 

  39. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, Massachusetts, USA, Cambridge, pp 223–231

  40. Soricut R, Echihabi A (2010) TrustRank: inducing trust in automatic translations via ranking. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, pp 612–621

  41. Specia L (2011) Exploiting objective annotations for measuring translation post-editing effort. In: Proceedings of the 15th conference of the European association for machine translation, Belgium, Leuven, pp 73–80

  42. Specia L, Cancedda N, Dymetman M, Turchi M, Cristianini N (2009a) Estimating the sentence-level quality of machine translation systems. In: Proceedings of the 13th annual conference of the European Association for machine translation, Barcelona, Spain, pp 28–35

  43. Specia L, Turchi M, Wang Z, Shawe-Taylor J, Saunders C (2009b) Improving the confidence of machine translation quality estimates. In: Proceedings of machine translation Summit XII, Ottawa, Ontario, Canada

  44. Specia L, Raj D, Turchi M (2010) Machine translation evaluation versus quality estimation. Machine Transl 24(1):39–50

    Article  Google Scholar 

  45. Specia L, Shah K, C de Souza JG, Cohn T (2013) QuEst-a translation quality estimation framework. In: Proceedings of the 51st annual meeting of the association for computational linguistics: system demonstrations, Sofia, Bulgaria, pp 79–84

  46. Steinberger R, Pouliquen B, Widiger A, Ignat C, Erjavec T, Tufis D, Varga D (2006) The JRC-Acquis: a multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058

  47. Turchi M, Negri M, Federico M (2013) Coping with the subjectivity of human judgements in MT quality estimation. In: Proceedings of the eighth workshop on statistical machine translation, Sofia, Bulgaria, pp 240–251

  48. Turchi M, Anastasopoulos A, C de Souza JG, Negri M (2014) Adaptive quality estimation for machine translation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 1: Long Papers), Association for Computational Linguistics. Baltimore, Maryland, pp 710–720. http://www.aclweb.org/anthology/P14-1067

  49. Zhechev V (2012) Machine translation infrastructure and post-editing performance at Autodesk. In: AMTA 2012 workshop on post-editing technology and practice, San Diego, USA, pp 87–96

Download references

Acknowledgments

This work has been partially supported by the EC-funded project MateCat (ICT-2011.4.2-287688).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Marco Turchi.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Turchi, M., Negri, M. & Federico, M. Data-driven annotation of binary MT quality estimation corpora based on human post-editions. Machine Translation 28, 281–308 (2014). https://doi.org/10.1007/s10590-014-9162-z

Download citation

Keywords

  • Statistical MT
  • Quality estimation
  • Productivity
  • Use of post-editing data