Searching for Poor Quality Machine Translated Text: Learning the Difference between Human Writing and Machine Translations

  • Dave Carter
  • Diana Inkpen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7310)

Abstract

As machine translation (MT) tools have become mainstream, machine translated text has increasingly appeared on multilingual websites. Trustworthy multilingual websites are used as training corpora for statistical machine translation tools; large amounts of MT text in training data may make such products less effective. We performed three experiments to determine whether a support vector machine (SVM) could distinguish machine translated text from human written text (both original text and human translations). Machine translated versions of the Canadian Hansard were detected with an F-measure of 0.999. Machine translated versions of six Government of Canada web sites were detected with an F-measure of 0.98. We validated these results with a decision tree classifier. An experiment to find MT text on Government of Ontario web sites using Government of Canada training data was unfruitful, with a high rate of false positives. Machine translated text appears to be learnable and detectable when using a similar training corpus.

Keywords

Support Vector Machine Natural Language Processing Machine Translation Original Text Statistical Machine Translation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Helft, M.: Googles Computing Power Refines Translation Tool. In: New York Times (March 8, 2010), A1, Retrieved from http://www.nytimes.com/2010/03/09/technology/09translate.html?nl=technology&emc=techupdateema1
  2. 2.
    Baroni, M., Bernardini, S.: A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text. Literary and Linguistic Computing 21(3), 259–274 (2006)CrossRefGoogle Scholar
  3. 3.
    Kurokawa, D., Goutte, C., Isabelle, P.: Automatic detection of translated text and its impact on machine translation. In: MT Summit XII: Proceedings of the Twelfth Machine Translation Summit, Ottawa, Ontario, Canada, August 26-30, pp. 81–88 (2009)Google Scholar
  4. 4.
    Gellerstam, M.: Translationese in Swedish Novels Translated from English. In: Wollin, L., Lindquist, H. (eds.) Translation Studies in Scandinavia: Proceedings from the Scandinavian Symposium on Translation Theory (SSOTT) II, Lund, June 14-15, pp. 88–95 (1985)Google Scholar
  5. 5.
    Santos, D.: On the use of parallel texts in the comparison on languages. Actas do XI Encontro da Associação Portuguesa de Linguística, Lisboa, 2-4 de Outubro de 1995, 217–239 (1995)Google Scholar
  6. 6.
    Santos, D.: On grammatical translationese. In: Koskenniemi, K. (ed.) Short Papers Presented at the Tenth Scandinavian Conference on Computational Linguistics, Helsinki, pp. 29–30 (1995)Google Scholar
  7. 7.
    Koppel, M., Ordan, N.: Translationese and Its Dialects. In: Proceedings of ACL, Portland OR, pp. 1318–1326 (June 2011)Google Scholar
  8. 8.
    Carpuat, M.: One Translation per Discourse. In: Agirre, E., Márquez, L., Wicentowski, R. (eds.) SEW-2009 Semantic Evaluations: Recent Achievements and Future Directions, pp. 19–27 (2009)Google Scholar
  9. 9.
    Lembersky, G., Ordan, N., Wintner, S.: Language models for machine translation: original vs. translated texts. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, Scotland, UK, July 27-31, pp. 363–374 (2011)Google Scholar
  10. 10.
    Ilisei, I., Inkpen, D.: Translationese Traits in Romanian Newspapers: A Machine Learning Approach. In: Gelbukh, A. (ed.) International Journal of Computational Linguistics and Applications (2011) (in press)Google Scholar
  11. 11.
    Ilisei, I., Inkpen, D., Pastor, G.C., Mitkov, R.: Identification of Translationese: A Machine Learning Approach. In: Gelbukh, A. (ed.) CICLing 2010. LNCS, vol. 6008, pp. 503–511. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  12. 12.
    Popescu, M.: Studying Translationese at the Character Level. In: Proceedings of Recent Advances in Natural Language Processing, pp. 634–639 (2011)Google Scholar
  13. 13.
    Uchimoto, K., Hayashida, N., Ishida, T., Isahara, H.: Automatic detection and semi-automatic revision of non-machine-translatable parts of a sentence. In: LREC-2006: Fifth International Conference on Language Resources and Evaluation. Proceedings, Genoa, Italy, May 22-28, pp. 703–708 (2006)Google Scholar
  14. 14.
    Russell, G.: Automatic detection of translation errors: the TransCheck system. In: Translating and the Computer 27: Proceedings of the Twenty-Seventh International Conference on Translating and the Computer, London, 17, November 24-25, Aslib, London (2005)Google Scholar
  15. 15.
    Melamed, D.: Automatic detection of omissions in translations. In: Coling 1996: The 16th International Conference on Computational Linguistics: Proceedings, Center for Sprogteknologi, Copenhagen, August 5-9, pp. 764–769 (1996)Google Scholar
  16. 16.
    Somers, H., Gaspari, F., Niño, A.: Detecting inappropriate use of free online machine translation by language students. A special case of plagiarism detection. In: EAMT-2006: 11th Annual Conference of the European Association for Machine Translation, Oslo, Norway, June 19-20, pp. 41–48 (2006)Google Scholar
  17. 17.
    Germann, U. (ed.): Aligned Hansards of the 36th Parliament of Canada Release 2001-1a (2001), Retrieved from http://www.isi.edu/natural-language/download/hansard/

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Dave Carter
    • 1
    • 2
  • Diana Inkpen
    • 1
  1. 1.School of Electrical Engineering and Computer ScienceUniversity of OttawaOttawaCanada
  2. 2.Institute for Information TechnologyNational Research Council CanadaCanada

Personalised recommendations