Training, Enhancing, Evaluating and Using MT Systems with Comparable Data

Part of the Theory and Applications of Natural Language Processing book series (NLP)


This chapter describes how semi-parallel and parallel data extracted from comparable corpora can be used in enhancing machine translation (MT) systems: what are the methods used for this task in statistical and rule-based machine translation systems; what kinds of showcases exist that illustrate the usage of such enhanced MT systems. The impact of data extracted from comparable corpora on MT quality is evaluated for 17 language pairs, and detailed studies involving human evaluation are carried out for 11 language pairs. At first, baseline statistical machine translation (SMT) systems were built using traditional SMT techniques. Then they were improved by the integration of additional data extracted from the comparable corpora. Comparative evaluation was performed to measure improvements. Comparable corpora were also used to enrich the linguistic knowledge of rule-based machine translation (RBMT) systems by applying terminology extraction technology. Finally, SMT systems were adjusted for a narrow domain and included domain-specific knowledge such as terminology, named entities (NEs), domain-specific language models (LMs), etc.


  1. Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. Proceedings of the 12thConference of the European Chapter of the Association for Computational Linguistics (pp. 16–23), Athens, Greece.Google Scholar
  2. Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.CrossRefGoogle Scholar
  3. Aleksić, V., & Thurmair, Gr. (2011). Personal Translator at WMT 2011. Proceedings of the WMT Edinburgh, UK.Google Scholar
  4. Babych, B., & Hartley, A. (2008). Sensitivity of automated MT evaluation metrics on higher quality MT output: BLEU vs task-based evaluation methods. Proceedings of LREC, Marrakech.Google Scholar
  5. Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL 2005), June 2005, Michigan.Google Scholar
  6. Bertoldi, N., Haddow, B., & Fouet, J. B. (2009). Improved minimum error rate training in moses. The Prague Bulletin of Mathematical Linguistics, 91, 7–16.CrossRefGoogle Scholar
  7. Bojar, O., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Koehn, P., & Monz, C. (2018). Findings of the 2018 Conference on Machine Translation (WMT18) (pp. 272–303). WMT (shared task) 2018.Google Scholar
  8. Bontchev, B., & Vassileva, D. (2009). Courseware authoring for adaptive e-learning. Proceedings of the 2009 International Conference on Education Technology and Computer (ICETC ’09) (pp. 176–180). IEEE Computer Society, Washington, DC.Google Scholar
  9. Bulterman, D. C. A., & Hardman, L. (2005). Structured multimedia authoring. ACM Transactions on Multimedia Computing, Communication and Applications, 1, 89–109.CrossRefGoogle Scholar
  10. Callison-Burch, Ch., Koehn, Ph., Monz, Ch., & Schroeder, J. (2009). Findings of the 2009 workshop on statistical machine translation. Proceedings of the 4th Workshop on SMT, Athens.Google Scholar
  11. Capuano, N., Pierri, A., Colace, F., Gaeta, M., & Mangione, G. R. (2009). A mash-up authoring tool for e-learning based on pedagogical templates. Proceedings of the First ACM International Workshop on Multimedia Technologies for Distance Learning (MTDL ’09) (pp. 87–94). ACM, New York, NY.Google Scholar
  12. Carrera, J., Beregovaya, O., & Yanishevsky, A. (2009). Machine Translation for Cross-Language Social Media. Accessed April 23, 2013 from
  13. Clark, E., & Araki, K. (2011). Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia – Social and Behavioral Sciences, 27, 2–11.CrossRefGoogle Scholar
  14. Deltour, R., & Roisin, C. (2006). The limsee3 multimedia authoring model. Proceedings of the 2006 ACM Symposium on Document Engineering (DocEng ‘06) (pp. 173–175). ACM, New York, NY.Google Scholar
  15. Désilets, A., Gonzalez, L., Paquet, S., & Stojanovic, M. (2006). Translation the Wiki Way. The Conference Wiki of the 2006 International Symposium on Wikis. Odense, Denmark.Google Scholar
  16. Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research (HLT 2002) (pp. 138–145), San Diego.Google Scholar
  17. Escudero, H., & Fuentes, R. (2010). Exchanging courses between different Intelligent Tutoring Systems: A generic course generation authoring tool. Knowledge-Based Systems, 23(8), 864–874.CrossRefGoogle Scholar
  18. Flournoy, R., & Duran, C. (2009). Machine translation and document localization at Adobe: From pilot to production. Proceedings of the Twelfth Machine Translation Summit, Ottawa, Canada.Google Scholar
  19. Flournoy, R., & Rueppel, J. (2010). One technology: Many solutions. AMTA 2010: The Ninth Conference of the Association for Machine Translation in the Americas, Denver, CO, 6p.Google Scholar
  20. Forcada, M. (2006). Open-source machine translation: An opportunity for minor languages. 5th SALTMIL Workshop on Minority Languages (pp. 1–7).Google Scholar
  21. Garcia, I. (2009). Beyond translation memory: Computers and the professional translator. The Journal of Specialised Translation, 12, 199–214.Google Scholar
  22. Hamon, O., Popescu-Belis, A., Choukri, K., Dabbadie, M., Hartley, A., Mustafa El Hadi, W., et al. (2006). CESTA: First conclusions of the technolangue mt evaluation campaign. Proceedings of the LREC, Genova, Italy.Google Scholar
  23. Hewavitharana, S., & Vogel, S. (2008). Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. Proceedings of the Workshop on Comparable Corpora, LREC’08 (pp. 7–10).Google Scholar
  24. Hovy, E., King, M., & Popescu-Belis, A. (2002). Principles of context-based machine translation evaluation. Machine Translation, 17(1), 43–75.CrossRefGoogle Scholar
  25. Hutchins, J. (2003). Machine translation and computer-based translation tools: What’s available and how it’s used. A New Spectrum of Translation Studies. University of Valladolid.Google Scholar
  26. Intel Corporation. (2012). Enabling Multilingual Collaboration through Machine Translation (IT@Intel White Paper). Accessed March 30, 2013 from
  27. Irvine, A., & Callison-Burch, Ch. (2013). Combining bilingual and comparable corpora for low resource machine translation. Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 262–270).Google Scholar
  28. Jiang, J., Way, A., & Haque, R. (2012). Translating user-generated content in the social networking space. Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA-2012), San Diego, CA.Google Scholar
  29. King, M., Popescu-Belis, A., & Hovy, E. (2003). FEMTI: Creating and using a framework for MT evaluation. Proceedings of MT Summit, New Orleans.Google Scholar
  30. Koehn, P., & Schroeder, J. (2007). Experiments in domain adaptation for statistical machine translation. Proceedings of the Second Workshop on Statistical Machine Translation, Prague.Google Scholar
  31. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session.Google Scholar
  32. Lewis, W., Wendt, C., & Bullock, D. (2010). Achieving domain specificity in SMT without overt siloing. Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010).Google Scholar
  33. Lu, B., Jiang, T., Chow, K., & Tsou, B. K. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora (pp. 42–48), Valletta, Malta.Google Scholar
  34. Mehm, F., Reuter, C., Göbel, S., & Steinmetz, R. (2012). Future trends in game authoring tools. Entertainment Computing-ICEC 2012 (Vol. 7522, pp. 536–541),Springer, Heidelberg.Google Scholar
  35. Mitchell, L., & Roturier, J. (2012). Evaluation of machine-translated user generated content: A pilot study based on user ratings. Proceedings of the 16th EAMT Conference, 28–30 May 2012, Trento, Italy.Google Scholar
  36. Mugwanya, R., & Marsden, G. (2010). Mobile learning content authoring tools (MLCATs): A systematic review. Proceedings E-Infrastructures and E-Services on Developing Countries – Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering (pp. 20–31).Google Scholar
  37. Müller, W., Iurgel, I., Otero, N., & Massler, U. (2010). Teaching English as a second language utilizing authoring tools for interactive digital storytelling. ICIDS’10 Proceedings of the Third Joint Conference on Interactive Digital Storytelling (pp. 222–227).CrossRefGoogle Scholar
  38. Munteanu, D., & Marcu, D. (2006). Improving machine translation performance by ex-ploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.CrossRefGoogle Scholar
  39. Najeh, H., Kolovratnik, D., Vaeyrynen, J., Steinberger, R., & Varga,D. (2014). DCEP-digital corpus of the European parliament. Proceedings of LREC 2014 (Language Resources and Evaluation Conference) (pp. 3164–3171).Google Scholar
  40. O’Brien, S. (2005). Methodologies for measuring the correlations between post-editing effort and machine translatability. Machine Translation, 19(1), 37–58.CrossRefGoogle Scholar
  41. Och, F. J. (2003) Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (Vol. 1, pp. 160–167).Google Scholar
  42. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of ACL-2002: 40th Annual meeting of the Association for Computational Linguistics (pp. 311–318).
  43. Pecina, P., Toral, A., Papavassiliou, V., Prokopidis, P., & van Genabith, J. (2012). Domain adaptation of statistical machine translation using web-crawled resources: A case study. Proceedings of the EAMT 2012, Trento, Italy.Google Scholar
  44. Pinnis, M. (2012). Latvian and lithuanian named entity recognition with TildeNER. Proceedings of LREC 2012, 21–27 May, 2012, Istanbul, Turkey.Google Scholar
  45. Pinnis, M., & Skadiņš, R. (2012). MT Adaptation for Under-Resourced Domains –What Works and What Not. Baltic HLT2012.Google Scholar
  46. Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., et al. (2012a). Toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of ACL 2012, System Demonstrations Track, Jeju Island, Republic of Korea, 8–14 July 2012.Google Scholar
  47. Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., & Gornostay, T. (2012b). Term extraction, tagging and mapping tools for under-resourced languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering, Madrid, Spain.Google Scholar
  48. Pinnis, M., Skadiņa, I., & Vasiļjevs, A. (2013). Domain adaptation in statistical machine translation using comparable corpora: Case study for english latvian it localisation. Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics CICLING 2013.Google Scholar
  49. Plitt, M., & Masselot, F. (2010). A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context. The Prague Bulletin of Mathematical Lin-Guistics, 93, 7–16.Google Scholar
  50. Popescu-Belis, A. (2008). Reference-based vs. task-based evaluation of human language technology. Proceedings of LREC.Google Scholar
  51. Rirdance, S., & Vasiljevs, A. (Eds.). (2006). Towards consolidation of European terminology resources. Experience and recommendations from EuroTermBank project. Riga: EuroTermBank Consortium.Google Scholar
  52. Roturier, J., & Bensadoun, A. (2011). Evaluation of MT systems to translate user generated content. Proceedings of Machine Translation Summit XIII (pp. 244–251), Xiamen, China.Google Scholar
  53. Scherp, A., & Boll, S. (2005). Context-driven smart authoring of multimedia content with xSMART. Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA ’05) (pp. 802–803). ACM, New York, NY.Google Scholar
  54. Schmidtke, D. (2008). Microsoft office localization: Use of language and translation technology. Available at:
  55. Schwenk, H., & Koehn, P. (2008). Large and diverse language models for statistical machine translation. IJCNLP2008.Google Scholar
  56. Skadiņa, I., Aker, A., Giouli, V., Tufis, D., Gaizauskas, R., Mieriņa, M., et al. (2010). A collection of comparable corpora for under-resourced languages. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications(Vol. 219, pp. 161–168), IOS Press.Google Scholar
  57. Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiş, D., Verlič, M., et al. (2012). Collecting and using comparable corpora for statistical machine translation. Proceedings of LREC’12 (pp. 438–445), Istanbul, Turkey, 21–27 May 2012.Google Scholar
  58. Skadiņš, R., Goba, K., & Šics, V. (2010). Improving SMT for baltic languages with factored models. Proceedings of the Fourth International Conference Baltic HLT 2010 (pp. 125–132), October 7–8, 2010, Riga, Latvia.Google Scholar
  59. Skadiņš, R., Puriņš, M., Skadiņa, I., & Vasiļjevs,A. (2011). Evaluation of SMT in localization to under-resourced inflected language. Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011 (pp. 35–40), May 30–31, 2011, Leuven, Belgium.Google Scholar
  60. Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. Proceedings of Association for Machine Translation in the Americas.Google Scholar
  61. Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. Proceedings of WMT09.Google Scholar
  62. Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.CrossRefGoogle Scholar
  63. Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), Trento, Italy.Google Scholar
  64. Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., TufisD., et al. (2006). The jrcacquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation.Google Scholar
  65. Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely available translation memory in 22 languages. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), Istanbul, 21–27 May 2012.Google Scholar
  66. Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., et al. (2014). An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation Journal (LRE), 48(4), 679–707.CrossRefGoogle Scholar
  67. Su, F., & Babych, B. (2012). Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-) parallel translation equivalents. Proceedings of the EACL’12 Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRBMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 10–19), Avignon, France, 23–27 April 2012.Google Scholar
  68. Thurmair, Gr., & Aleksić, V. (2012). Creating term and lexicon entries from phrase tables. Proceedings of the EAMT 2012,Trento, Italy.Google Scholar
  69. Tiedemann, J. (2009). News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent Advances in Natural Language Processing (Vol. V, pp. 237–248). Amsterdam/ Philadelphia: John Benjamins.CrossRefGoogle Scholar
  70. Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012).Google Scholar
  71. Tyers, F., & Alperen, M. (2010). South-East European Times: A parallel corpus of Balkan languages. Proceedings of Workshop “Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages”.Google Scholar
  72. Vasiļjevs, A., Skadiņš, R., & Tiedemann, J. (2012). LetsMT!: A cloud-based platform for do-it-yourself machine translation. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL2012) (pp. 43–48), Jeju, Republic of Korea, 10 July 2012, System Demonstrations.Google Scholar
  73. Watson, C., Li, F. W. B., & Lau, R. W. H. (2010). A pedagogical interface for authoring adaptive e-learning courses. Proceedings of the Second ACM International Workshop on Multimedia Technologies for Distance Learning (MTDL ’10) (pp. 13–18). ACM, New York, NY.Google Scholar
  74. White, J., O’Connell, T., & O’Mara, F. (1994). The ARPA MT evaluation methodologies: Evolution, lessons, and future approaches. Proceedings of the 1st Conference of the Association for Machine Translation in the Americas (pp. 193–205). Columbia.Google Scholar
  75. Xu, J., Zens, R., & Ney, H. (2006) Partitioning parallel documents using binary segmentation. Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL): Proceedings of the Workshop on Statistical Machine Translation (pp. 78–85), New York City, NY, June 2006.Google Scholar
  76. Xu, J., Deng, Y., Gao, Y., & Ney, H. (2007) Domain dependent machine translation. Proceedings of the Machine Translation Summit XI, Copenhagen, Danmark, September 2007.Google Scholar
  77. Zhang, X. (2011). Two-level parallel text extraction from comparable corpora. Diploma thesis of Univeristy of Saarland.Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of LeedsLeedsUK
  2. 2.Deutsche Forschungszentrum für Künstliche Intelligenz GmbH (DFKI)SaarbrückenGermany
  3. 3.TildeRigaLatvia
  4. 4.LinguatecMünchenGermany
  5. 5.ZemantaLjubljanaSlovenia

Personalised recommendations