Machine Translation

, Volume 28, Issue 3, pp 281–308

Data-driven annotation of binary MT quality estimation corpora based on human post-editions

Article

DOI: 10.1007/s10590-014-9162-z

Cite this article as:
Turchi, M., Negri, M. & Federico, M. Machine Translation (2014) 28: 281. doi:10.1007/s10590-014-9162-z

Abstract

Advanced computer-assisted translation (CAT) tools include automatic quality estimation (QE) mechanisms to support post-editors in identifying and selecting useful suggestions. Based on supervised learning techniques, QE relies on high-quality data annotations obtained from expensive manual procedures. However, as the notion of MT quality is inherently subjective, such procedures might result in unreliable or uninformative annotations. To overcome these issues, we propose an automatic method to obtain binary annotated data that explicitly discriminate between useful (suitable for post-editing) and useless suggestions. Our approach is fully data-driven and bypasses the need for explicit human labelling. Experiments with different language pairs and domains demonstrate that it yields better models than those based on the adaptation into binary datasets of the available QE corpora. Furthermore, our analysis suggests that the learned thresholds separating useful from useless translations are significantly lower than as suggested in the existing guidelines for human annotators. Finally, a verification experiment with several translators operating with a CAT tool confirms our empirical findings.

Keywords

Statistical MT Quality estimation Productivity   Use of post-editing data 

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.Fondazione Bruno KesslerPovoItaly

Personalised recommendations