Skip to main content
Log in

Matxin, an open-source rule-based machine translation system for Basque

Machine Translation

Abstract

We present the first publicly available machine translation (MT) system for Basque. The fact that Basque is both a morphologically rich and less-resourced language makes the use of statistical approaches difficult, and raises the need to develop a rule-based architecture which can be combined in the future with statistical techniques. The MT architecture proposed reuses several open-source tools and is based on a unique XML format to facilitate the flow between the different modules, which eases the interaction among different developers of tools and resources. The result is the rule-based Matxin MT system, an open-source toolkit, whose first implementation translates from Spanish to Basque. We have performed innovative work on the following tasks: construction of a dependency analyser for Spanish, use of rich linguistic information to translate prepositions and syntactic functions (such as subject and object markers), construction of an efficient module for verbal chunk transfer, and design and implementation of modules for ordering words and phrases, independently of the source language.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

  • Abney S (1991) Parsing by chunks. In: Berwick R, Abney S, Tenny C (eds) Principle-based parsing. Kluwer, Boston, pp 257–278

    Chapter  Google Scholar 

  • Agirre E, Arregi X, Arriola J, Artola X, Díaz de Ilarraza A, Sarasola K (1995) Different issues in the design of a general-purpose Lexical Database for Basque. In: First workshop on Applications of natural language to databases, Versailles, pp 299–313

  • Agirre E, Atutxa A, Labaka G, Lersundi M, Mayor A, Sarasola K (2009) Use of rich linguistic information to translate prepositions and grammar cases to Basque. In: EAMT-2009: proceedings of the 13th annual conference of the European Association for Machine Translation. Barcelona, Spain, pp 58–65

  • Alam YS (2004) Decision Trees for sense disambiguation of prepositions case of over. In: Moldovan D, Girju R (eds) HLT-NAACL 2004: workshop on Computational lexical semantics, Boston, MA, USA, pp 52–59

  • Alcázar A. (2007) Consumer Eroski parallel corpus. Int J Basque Linguist Philol 41(2): 1–10

    Google Scholar 

  • Alegria I, Artola X, Sarasola K, Urkia M (1996) Automatic morphological analysis of Basque. Lit Linguist Comput 11(4): 193–203

    Article  Google Scholar 

  • Alegria I, Artola X, Sarasola K (1997) Improving a robust morphological analyser using lexical transducers. In: Mitkov R, Nicolov N (eds) Recent advances in natural language processing, vol 136 of Current issues in linguistic theory (CILT). John Benjamins, Amsterdam, pp 97–110

    Google Scholar 

  • Aldezabal I, Aranzabe M, Gojenola K, Sarasola K, Atutxa A (2002) Learning argument/adjunct distinction for Basque. In: Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition. Philadelphia, PA, pp 42–50

  • Alegria I, Urkia M (2002) Morfologia konputazionala. Euskararen morfologiaren deskribapena. UEU, Basque Country

    Google Scholar 

  • Alegria I, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2005a) An FST grammar for verb chain transfer in a Spanish–Basque MT System. In: Finite-state methods and natural language processing, vol 4002, Germany, pp 295–296

  • Alegria I, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K, Forcada M, Ortiz S, Padró L (2005b) An Open Architecture for Transfer-based Machine Translatio between Spanish and Basque. In: Proceedings of the MT Summit X. workshop: OSMaTran, Open-Source Machine Translation, Phuket, Thailand, pp 7–14

  • Alegria I, Arregi X, Artola X, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2008a) Strategies for sustainable MT for Basque: incremental design, reusability, standardization and open-source. In: Proceedings of the IJCNLP-08 workshop on NLP for less privileged languages, Hyderabad, India, pp 59–64

  • Alegria I, Casillas A, Díaz de Ilarraza A, Igartua J, Labaka G, Lersundi M, Mayor A, Sarasola K (2008b) Spanish-to-Basque MultiEngine Machine Translation for a restricted domain. In: AMTA-2008: MT at work: proceedings of the eighth conference of the Association for Machine Translation in the Americas, Waikiki, Hawai’i, pp 37–45

  • Armentano-Oller C, Carrasco RC, Corbí-Bellot AM, Forcada ML, Ginestí-Rosell M, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez-Martínez F, Scalco MA (2006) Open-source Portuguese–Spanish machine translation. In: Proceedings of the 7th international workshop on Computational processing of written and spoken Portuguese, PROPOR, Rio de Janeiro, Brazil, pp 50–59

  • Arrieta K, Díaz de Ilarraza A, Hernáez I, Iturraspe U, Leturia I, Navas E, Sarasola K (2008) AnHitz, development and integration of language, speech and visual technologies for Basque. In: Second international symposium on Universal communication, Osaka, Japan, pp 338–343

  • Atserias J, Comelles E, Mayor A (2005) TXALA, un analizador libre de dependencias para el castellano. Procesamiento del Lenguaje Natural 35: 455–456

    Google Scholar 

  • Atserias J, Casas B, Comelles E, González M, Padró L, Padró M (2006) FreeLing 1.3: syntactic and semantic services in an open-source NLP library. In: Proceedings of the 5th international conference on Language resources and evaluation (LREC’06), Genoa, Italy, pp 48–55

  • Boitet C, Bey Y, Tomokiyo M, Cao C, Blanchon H (2006) IWSLT-06: experiments with commercial MT systems and lessons from subjective evaluations. In: Proceedings of the international workshop on Spoken language translation, IWSLT-06, Kyoto, Japan, pp 23–30

  • Brants T, Skut W, Uszkoreit H (1999) Syntactic annotation of a German Newspaper corpus. In: Proceedings of the ATALA Treebank workshop, Paris, France, pp 69–76

  • Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference, pp 249–256

  • Calvo H, Gelbukh A (2006) DILUCT: an open-source Spanish dependency parser based on rules, heuristics, and selectional preferences. In: Natural language processing and information systems, LNCS 3999. Springer, Heidelberg, pp 164–175

  • Carrera J, Castellón I, Lloberes M, Padró L, Tinkova N (2008) Dependency grammars in FreeLing. Procesamiento del Lenguaje Natural 41: 21–28

    Google Scholar 

  • Carreras X, Chao I, Padró L, Padró M (2004) FreeLing: An open-source suite of language analyzers. In: Proceedings of the 4th international conference on Language resources and evaluation (LREC’04), Lisbon, Portugal

  • Civit M (2003) Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Universidad de Barcelona, Barcelona, Spain

  • Coheur L, Mamede N, Bés GG (2004) From a surface analysis to a dependency structure. In: Workshop on recent advances in dependency grammar (Coling 2004), Geneva, Switzerland, pp 77–81

  • Díaz de Ilarraza A, Mayor A, Sarasola K (1999) Reusability of wide-coverage linguistic resources in the construction of an English–Basque Machine Translation System. IAI Working Paper 36, University of the Saarland, Saarbrücken, Germany

  • Díaz de Ilarraza A, Lersundi M, Mayor A, Sarasola K (2000a) Etiquetado semiautomático del rasgo semántico de animicidad para su uso en un sistema de traducción automática. Procesamiento del Lenguaje Natural 26: 147–152

    Google Scholar 

  • Díaz de Ilarraza A, Mayor A, Sarasola K (2000b) Building a Lexicon for an English–Basque machine translation system from heterogeneous wide-coverage dictionaries. In: MT 2000: machine translation and multilingual applications in the new millennium, proceedings, Exeter, UK, pp 2.1–2.9

  • Díaz de Ilarraza A, Mayor A, Sarasola K (2000c) Reusability of wide-coverage linguistic resources in the construction of a multilingual machine translation system. In: MT 2000: machine translation and multilingual applications in the new millennium, proceedings, Exeter, UK, pp 12.1–12.9

  • Díaz de Ilarraza A, Mayor A, Sarasola K (2001) Inclusión del par castellano–euskara en un prototipo de traducción automática multilingüe. In: Proceedings of the second international workshop on Spanish language processing and language technologies, Jaén, Spain, pp 107–111

  • Díaz de Ilarraza A, Mayor A, Sarasola K (2002) Semiautomatic labelling of semantic features. In: Proceedings of the 19th international conference on Computational linguistics (COLING 2002), Taipei, Taiwan, pp 1–7

  • Du J, He Y, Penkale S, Way A (2009) MaTrEx: the DCU MT system for WMT 2009. In: Fourth workshop on Statistical machine translation, Athens, Greece, pp 95–99.

  • Elhuyar (2000) Elhuyar Hiztegia. Elhuyar Hizkuntz Zerbitzuak

  • Forcada M, Bonev BI, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sanchez G, Sánchez-Martínez F, Armentano-Oller C, Montava MA, Tyers FM, Ginestí-Rosell M (2009) Documentation of the open-source shallow-transfer machine translation platform apertium. Technical report, Departament de Llenguatges i Sistemes Informatics. Universitat d’Alacant. http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf

  • Ginestí-Rosell M, Ramírez-Sánchez G, Ortiz-Rojas S, Tyers F, Forcada M (2009) Development of a Basque to Spanish machine translation system.. Procesamiento del Lenguaje Natural 43: 187–195

    Google Scholar 

  • Goutte C (2006) Automatic evaluation of Machine Translation quality. Presentation at the European Community. http://www.xrce.xerox.com/Publications/Attachments/2006-002/MTeval.pdf

  • Hulden M (2009) Foma: a Finite-state compiler and library. In: Proceedings of EACL 2009, pp, 29–32

  • Hutchins W, Somers HL (1992) An introduction to machine translation. Academic Press, London

    MATH  Google Scholar 

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Machine Translation Summit X, Phuket, Thailand, pp 79–86

  • Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between European languages. In: Proceedings of the workshop on Statistical machine translation, New York City, NY, USA, pp 102–121

  • Koskenniemi K (1983) Two-level morphology: a general computational model for word-form recognition and production. Department of General Linguistics, University of Helsinki. Publications, No. 11

  • Labaka G (2010) EUSMT: incorporating linguistic information into SMT for a morphologically rich language. PhD thesis, University of the Basque Country, Donostia, Basque Country

  • Labaka G, Stroppa N, Way A, Sarasola K (2007) Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation. In: Machine Translation Summit XI: proceedings, Copenhagen, Denmark, pp 297–304

  • Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. Technical report, EAGLES Expert Advisory Group on Language Engineering Standards, Istituto di Linguistica Computazionale, Pisa, Italy

  • Lloberes M, Castellón I, Padró L (2010) Spanish FreeLing Dependency grammar. In: Proceedings of the international conference on Language resources and evaluation, LREC 2010, Valletta, Malta, pp 693–699

  • Mamidi R (2004) Disambiguating prepositions for machine translation using lexical semantic resources. In: Proceedings of the ‘National Seminar on Theoretical and Applied Aspects of Lexical Semantics’ organized by Centre of Advanced Study in Linguistics. Hyderabad, India

  • Mayor A (2007) Matxin: Erregeletan oinarritutako itzulpen automatikoko sistema baten eraikuntza estaldura handiko baliabide linguistikoak berrerabiliz (Matxin: construction of a rule-based MT system reusing wide coverage linguistic resources). PhD thesis, University of the Basque Country, Donostia, Basque Country

  • Mayor A, Tyers FM (2009) Matxin: moving towards language independence. In: FreeRBMT’2009, Alacant, Spain, pp 11–17

  • Naskar SK, Bandyopadhyay S (2006) Handling of prepositions in English to Bengali Machine Translation. In: Proceedings of the EACL workshop on Prepositions, Trento, Italy, pp 89–94

  • Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of Machine Translation. In: 40th annual meeting of the association for computational linguistics (ACL), Philadelphia, PA, USA, pp 311–318

  • Przybocki M, Sanders G, Le A (2006) Edit distance: a metric for Machine Translation evaluation. In: Proceedings of LREC-2006: fifth international conference on language resources and evaluation, Genoa, Italy, pp 2038–2043

  • Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, MA, USA, pp 223–231

  • Streiter O, Scannell KP, Stuflesser M (2006) Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers. Mach Transl 20(4): 267–289

    Article  Google Scholar 

  • Stroppa N, Groves D, Way A, Sarasola K (2006) Example-Based Machine Translation of the Basque Language. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas, ‘Visions for the Future of Machine Translation’, Cambridge, MA, USA, pp 232–241

  • Tantug AC, Oflazer K, El-Kahlout ID (2008) BLEU+: a tool for fine-grained BLEU computation. In: Proceedings of the sixth international language resources and evaluation (LREC’08), Marrakech, Morocco, pp 1493–1499. http://www.lrec-conf.org/proceedings/lrec2008/

  • Trujillo A (1992a) Locations in the Machine Translation of Prepositional Phrases. In: Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Proceedings of the Conference. Montréal, Canada, pp 13–20

  • Trujillo A (1992b) Spatial lexicalization in the translation of prepositional phrases. In: 30th annual meeting of the association for computational linguistics, proceedings of the conference, Newark, Delaware, USA, pp 306–308

  • Turian JP, Shen L, Melamed I (2003) Evaluation of machine translation and its evaluation. In: Proceedings of the MT Summit IX, New Orleans, USA, pp 386–393

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aingeru Mayor.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mayor, A., Alegria, I., Díaz de Ilarraza, A. et al. Matxin, an open-source rule-based machine translation system for Basque. Machine Translation 25, 53–82 (2011). https://doi.org/10.1007/s10590-011-9092-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-011-9092-y

Keywords

Navigation