Reciprocal Enrichment Between Basque Wikipedia and Machine Translation
In this chapter, we define a collaboration framework that enables Wikipedia editors to generate new articles while they help development of Machine Translation (MT) systems by providing post-edition logs. This collaboration framework was tested with editors of Basque Wikipedia. Their post-editing of Computer Science articles has been used to improve the output of a Spanish to Basque MT system called Matxin. For the collaboration between editors and researchers, we selected a set of 100 articles from the Spanish Wikipedia. These articles would then be used as the source texts to be translated into Basque using the MT engine. A group of volunteers from Basque Wikipedia reviewed and corrected the raw MT translations. This collaboration ultimately produced two main benefits: (i) the change logs that would potentially help improve the MT engine by using an automated statistical post-editing system, and (ii) the growth of Basque Wikipedia. The results show that this process can improve the accuracy of an Rule Based MT (RBMT) system in nearly 10 % benefiting from the post-edition of 50,000 words in the Computer Science domain. We believe that our conclusions can be extended to MT engines involving other less-resourced languages lacking large parallel corpora or frequently updated lexical knowledge, as well as to other domains.
KeywordsMachine Translation Object Constraint Language Statistical Machine Translation Sentence Pair Parallel Corpus
This research was supported in part by the Spanish Ministry of Education and Science (OpenMT2, TIN2009-14675-C03-01) and by the Basque Government (Berbatek project, IE09–262). We are indebted to all the collaborators in the project and especially to the editors of the Basque Wikipedia. Elhuyar and Julen Ruiz helped us to collect resources for the customization of the RBMT engine to the domain of Computer Science.
- 1.Alegria I, Diaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2007) Transfer-based MT from Spanish into Basque: reusability, standardization and open source. In: CICLing 2007. Lecture notes in computer science, vol 4394. Springer, Berlin/New York, pp 374–384Google Scholar
- 2.Alegria I, Diaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2011) Matxin-Informatika: Versión del traductor Matxin adaptada al dominio de la informática. In: Proceedings of the XXVII Congreso SEPLN, Huelva, Spain, pp 321–322Google Scholar
- 3.Boitet C, Huynh CP, Nguyen HT, Bellynck V (2010) The iMAG concept: multilingual access gateway to an elected web sites with incremental quality increase through collaborative post-edition of MT pretranslations. In: Proceedings of Traitement Automatique du Langage Naturel, TALN, MontréalGoogle Scholar
- 4.Diaz de Ilarraza A, Labaka G, Sarasola K (2008) Statistical post-editing: a valuable method in domain adaptation of RBMT systems. In: Proceedings of MATMT2008 workshop: mixing approaches to machine translation, Euskal Herriko Unibersitatea, Donostia, pp 35–40Google Scholar
- 5.Dugast L, Senellart J, Koehn P (2007) Statistical post-editing on SYSTRAN’s rule-based translation system. In: Proceedings of the second workshop on statistical machine translation, Prague, pp 220–223Google Scholar
- 6.Dugast L, Senellart J, Koehn P (2009) Statistical post editing and dictionary extraction: Systran/Edinburgh submissions for ACL-WMT2009. In: Proceedings of the fourth workshop on statistical machine translation, Athens, pp 110–114Google Scholar
- 7.Isabelle P, Goutte C, Simard M (2007) Domain adaptation of MT systems through automatic post-editing. In: Proceedings of the MT Summit XI, Copenhagen, pp 255–261Google Scholar
- 8.Lagarda AL, Alabau V, Casacuberta F, Silva R, Díaz-de-Liaño E (2009) Statistical post-editing of a rule-based machine translation system. In: Proceedings of NAACL HLT 2009. Human language technologies: the 2009 annual conference of the North American chapter of the ACL, Short Papers, Boulder, pp 217–220Google Scholar
- 10.Potet M, Esperança-Rodier E, Blanchon H, Besacier L (2011) Preliminary experiments on using users’ post-editions to enhance a SMT system. In: Forcada ML, Depraetere H, Vandeghinste V (eds) Proceedings of the 15th conference of the European association for machine translation, Leuven, Belgium, pp 161–168Google Scholar
- 11.Simard M, Ueffing N, Isabelle P, Kuhn R (2007) Rule-based translation with statistical phrase-based post-editing. In: Proceedings of the second workshop on statistical machine translation, Prague, pp 203–206Google Scholar
- 12.Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2007) A study of translation edit rate with targeted human annotation. In: Proceedings of the 7th Biennial conference of the association for machine translation in the Americas (AMTA), Cambridge, Massachusetts, USA, pp 223–231Google Scholar