Advertisement

Machine Translation

, 25:53 | Cite as

Matxin, an open-source rule-based machine translation system for Basque

  • Aingeru MayorEmail author
  • Iñaki Alegria
  • Arantza Díaz de Ilarraza
  • Gorka Labaka
  • Mikel Lersundi
  • Kepa Sarasola
Article

Abstract

We present the first publicly available machine translation (MT) system for Basque. The fact that Basque is both a morphologically rich and less-resourced language makes the use of statistical approaches difficult, and raises the need to develop a rule-based architecture which can be combined in the future with statistical techniques. The MT architecture proposed reuses several open-source tools and is based on a unique XML format to facilitate the flow between the different modules, which eases the interaction among different developers of tools and resources. The result is the rule-based Matxin MT system, an open-source toolkit, whose first implementation translates from Spanish to Basque. We have performed innovative work on the following tasks: construction of a dependency analyser for Spanish, use of rich linguistic information to translate prepositions and syntactic functions (such as subject and object markers), construction of an efficient module for verbal chunk transfer, and design and implementation of modules for ordering words and phrases, independently of the source language.

Keywords

Rule-based machine translation Reusability Open-source Lesser-used languages 

References

  1. Abney S (1991) Parsing by chunks. In: Berwick R, Abney S, Tenny C (eds) Principle-based parsing. Kluwer, Boston, pp 257–278CrossRefGoogle Scholar
  2. Agirre E, Arregi X, Arriola J, Artola X, Díaz de Ilarraza A, Sarasola K (1995) Different issues in the design of a general-purpose Lexical Database for Basque. In: First workshop on Applications of natural language to databases, Versailles, pp 299–313Google Scholar
  3. Agirre E, Atutxa A, Labaka G, Lersundi M, Mayor A, Sarasola K (2009) Use of rich linguistic information to translate prepositions and grammar cases to Basque. In: EAMT-2009: proceedings of the 13th annual conference of the European Association for Machine Translation. Barcelona, Spain, pp 58–65Google Scholar
  4. Alam YS (2004) Decision Trees for sense disambiguation of prepositions case of over. In: Moldovan D, Girju R (eds) HLT-NAACL 2004: workshop on Computational lexical semantics, Boston, MA, USA, pp 52–59Google Scholar
  5. Alcázar A. (2007) Consumer Eroski parallel corpus. Int J Basque Linguist Philol 41(2): 1–10Google Scholar
  6. Alegria I, Artola X, Sarasola K, Urkia M (1996) Automatic morphological analysis of Basque. Lit Linguist Comput 11(4): 193–203CrossRefGoogle Scholar
  7. Alegria I, Artola X, Sarasola K (1997) Improving a robust morphological analyser using lexical transducers. In: Mitkov R, Nicolov N (eds) Recent advances in natural language processing, vol 136 of Current issues in linguistic theory (CILT). John Benjamins, Amsterdam, pp 97–110Google Scholar
  8. Aldezabal I, Aranzabe M, Gojenola K, Sarasola K, Atutxa A (2002) Learning argument/adjunct distinction for Basque. In: Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition. Philadelphia, PA, pp 42–50Google Scholar
  9. Alegria I, Urkia M (2002) Morfologia konputazionala. Euskararen morfologiaren deskribapena. UEU, Basque CountryGoogle Scholar
  10. Alegria I, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2005a) An FST grammar for verb chain transfer in a Spanish–Basque MT System. In: Finite-state methods and natural language processing, vol 4002, Germany, pp 295–296Google Scholar
  11. Alegria I, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K, Forcada M, Ortiz S, Padró L (2005b) An Open Architecture for Transfer-based Machine Translatio between Spanish and Basque. In: Proceedings of the MT Summit X. workshop: OSMaTran, Open-Source Machine Translation, Phuket, Thailand, pp 7–14Google Scholar
  12. Alegria I, Arregi X, Artola X, Díaz de Ilarraza A, Labaka G, Lersundi M, Mayor A, Sarasola K (2008a) Strategies for sustainable MT for Basque: incremental design, reusability, standardization and open-source. In: Proceedings of the IJCNLP-08 workshop on NLP for less privileged languages, Hyderabad, India, pp 59–64Google Scholar
  13. Alegria I, Casillas A, Díaz de Ilarraza A, Igartua J, Labaka G, Lersundi M, Mayor A, Sarasola K (2008b) Spanish-to-Basque MultiEngine Machine Translation for a restricted domain. In: AMTA-2008: MT at work: proceedings of the eighth conference of the Association for Machine Translation in the Americas, Waikiki, Hawai’i, pp 37–45Google Scholar
  14. Armentano-Oller C, Carrasco RC, Corbí-Bellot AM, Forcada ML, Ginestí-Rosell M, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez-Martínez F, Scalco MA (2006) Open-source Portuguese–Spanish machine translation. In: Proceedings of the 7th international workshop on Computational processing of written and spoken Portuguese, PROPOR, Rio de Janeiro, Brazil, pp 50–59Google Scholar
  15. Arrieta K, Díaz de Ilarraza A, Hernáez I, Iturraspe U, Leturia I, Navas E, Sarasola K (2008) AnHitz, development and integration of language, speech and visual technologies for Basque. In: Second international symposium on Universal communication, Osaka, Japan, pp 338–343Google Scholar
  16. Atserias J, Comelles E, Mayor A (2005) TXALA, un analizador libre de dependencias para el castellano. Procesamiento del Lenguaje Natural 35: 455–456Google Scholar
  17. Atserias J, Casas B, Comelles E, González M, Padró L, Padró M (2006) FreeLing 1.3: syntactic and semantic services in an open-source NLP library. In: Proceedings of the 5th international conference on Language resources and evaluation (LREC’06), Genoa, Italy, pp 48–55Google Scholar
  18. Boitet C, Bey Y, Tomokiyo M, Cao C, Blanchon H (2006) IWSLT-06: experiments with commercial MT systems and lessons from subjective evaluations. In: Proceedings of the international workshop on Spoken language translation, IWSLT-06, Kyoto, Japan, pp 23–30Google Scholar
  19. Brants T, Skut W, Uszkoreit H (1999) Syntactic annotation of a German Newspaper corpus. In: Proceedings of the ATALA Treebank workshop, Paris, France, pp 69–76Google Scholar
  20. Callison-Burch C, Osborne M, Koehn P (2006) Re-evaluating the role of BLEU in machine translation research. In: EACL-2006: 11th conference of the European chapter of the association for computational linguistics, proceedings of the conference, pp 249–256Google Scholar
  21. Calvo H, Gelbukh A (2006) DILUCT: an open-source Spanish dependency parser based on rules, heuristics, and selectional preferences. In: Natural language processing and information systems, LNCS 3999. Springer, Heidelberg, pp 164–175Google Scholar
  22. Carrera J, Castellón I, Lloberes M, Padró L, Tinkova N (2008) Dependency grammars in FreeLing. Procesamiento del Lenguaje Natural 41: 21–28Google Scholar
  23. Carreras X, Chao I, Padró L, Padró M (2004) FreeLing: An open-source suite of language analyzers. In: Proceedings of the 4th international conference on Language resources and evaluation (LREC’04), Lisbon, PortugalGoogle Scholar
  24. Civit M (2003) Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. PhD thesis, Universidad de Barcelona, Barcelona, SpainGoogle Scholar
  25. Coheur L, Mamede N, Bés GG (2004) From a surface analysis to a dependency structure. In: Workshop on recent advances in dependency grammar (Coling 2004), Geneva, Switzerland, pp 77–81Google Scholar
  26. Díaz de Ilarraza A, Mayor A, Sarasola K (1999) Reusability of wide-coverage linguistic resources in the construction of an English–Basque Machine Translation System. IAI Working Paper 36, University of the Saarland, Saarbrücken, GermanyGoogle Scholar
  27. Díaz de Ilarraza A, Lersundi M, Mayor A, Sarasola K (2000a) Etiquetado semiautomático del rasgo semántico de animicidad para su uso en un sistema de traducción automática. Procesamiento del Lenguaje Natural 26: 147–152Google Scholar
  28. Díaz de Ilarraza A, Mayor A, Sarasola K (2000b) Building a Lexicon for an English–Basque machine translation system from heterogeneous wide-coverage dictionaries. In: MT 2000: machine translation and multilingual applications in the new millennium, proceedings, Exeter, UK, pp 2.1–2.9Google Scholar
  29. Díaz de Ilarraza A, Mayor A, Sarasola K (2000c) Reusability of wide-coverage linguistic resources in the construction of a multilingual machine translation system. In: MT 2000: machine translation and multilingual applications in the new millennium, proceedings, Exeter, UK, pp 12.1–12.9Google Scholar
  30. Díaz de Ilarraza A, Mayor A, Sarasola K (2001) Inclusión del par castellano–euskara en un prototipo de traducción automática multilingüe. In: Proceedings of the second international workshop on Spanish language processing and language technologies, Jaén, Spain, pp 107–111Google Scholar
  31. Díaz de Ilarraza A, Mayor A, Sarasola K (2002) Semiautomatic labelling of semantic features. In: Proceedings of the 19th international conference on Computational linguistics (COLING 2002), Taipei, Taiwan, pp 1–7Google Scholar
  32. Du J, He Y, Penkale S, Way A (2009) MaTrEx: the DCU MT system for WMT 2009. In: Fourth workshop on Statistical machine translation, Athens, Greece, pp 95–99.Google Scholar
  33. Elhuyar (2000) Elhuyar Hiztegia. Elhuyar Hizkuntz ZerbitzuakGoogle Scholar
  34. Forcada M, Bonev BI, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sanchez G, Sánchez-Martínez F, Armentano-Oller C, Montava MA, Tyers FM, Ginestí-Rosell M (2009) Documentation of the open-source shallow-transfer machine translation platform apertium. Technical report, Departament de Llenguatges i Sistemes Informatics. Universitat d’Alacant. http://xixona.dlsi.ua.es/~fran/apertium2-documentation.pdf
  35. Ginestí-Rosell M, Ramírez-Sánchez G, Ortiz-Rojas S, Tyers F, Forcada M (2009) Development of a Basque to Spanish machine translation system.. Procesamiento del Lenguaje Natural 43: 187–195Google Scholar
  36. Goutte C (2006) Automatic evaluation of Machine Translation quality. Presentation at the European Community. http://www.xrce.xerox.com/Publications/Attachments/2006-002/MTeval.pdf
  37. Hulden M (2009) Foma: a Finite-state compiler and library. In: Proceedings of EACL 2009, pp, 29–32Google Scholar
  38. Hutchins W, Somers HL (1992) An introduction to machine translation. Academic Press, LondonzbMATHGoogle Scholar
  39. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Machine Translation Summit X, Phuket, Thailand, pp 79–86Google Scholar
  40. Koehn P, Monz C (2006) Manual and automatic evaluation of machine translation between European languages. In: Proceedings of the workshop on Statistical machine translation, New York City, NY, USA, pp 102–121Google Scholar
  41. Koskenniemi K (1983) Two-level morphology: a general computational model for word-form recognition and production. Department of General Linguistics, University of Helsinki. Publications, No. 11Google Scholar
  42. Labaka G (2010) EUSMT: incorporating linguistic information into SMT for a morphologically rich language. PhD thesis, University of the Basque Country, Donostia, Basque CountryGoogle Scholar
  43. Labaka G, Stroppa N, Way A, Sarasola K (2007) Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation. In: Machine Translation Summit XI: proceedings, Copenhagen, Denmark, pp 297–304Google Scholar
  44. Leech G, Wilson A (1996) EAGLES recommendations for the morphosyntactic annotation of corpora. Technical report, EAGLES Expert Advisory Group on Language Engineering Standards, Istituto di Linguistica Computazionale, Pisa, ItalyGoogle Scholar
  45. Lloberes M, Castellón I, Padró L (2010) Spanish FreeLing Dependency grammar. In: Proceedings of the international conference on Language resources and evaluation, LREC 2010, Valletta, Malta, pp 693–699Google Scholar
  46. Mamidi R (2004) Disambiguating prepositions for machine translation using lexical semantic resources. In: Proceedings of the ‘National Seminar on Theoretical and Applied Aspects of Lexical Semantics’ organized by Centre of Advanced Study in Linguistics. Hyderabad, IndiaGoogle Scholar
  47. Mayor A (2007) Matxin: Erregeletan oinarritutako itzulpen automatikoko sistema baten eraikuntza estaldura handiko baliabide linguistikoak berrerabiliz (Matxin: construction of a rule-based MT system reusing wide coverage linguistic resources). PhD thesis, University of the Basque Country, Donostia, Basque CountryGoogle Scholar
  48. Mayor A, Tyers FM (2009) Matxin: moving towards language independence. In: FreeRBMT’2009, Alacant, Spain, pp 11–17Google Scholar
  49. Naskar SK, Bandyopadhyay S (2006) Handling of prepositions in English to Bengali Machine Translation. In: Proceedings of the EACL workshop on Prepositions, Trento, Italy, pp 89–94Google Scholar
  50. Papineni K, Roukos S, Ward T, Zhu W (2002) BLEU: a method for automatic evaluation of Machine Translation. In: 40th annual meeting of the association for computational linguistics (ACL), Philadelphia, PA, USA, pp 311–318Google Scholar
  51. Przybocki M, Sanders G, Le A (2006) Edit distance: a metric for Machine Translation evaluation. In: Proceedings of LREC-2006: fifth international conference on language resources and evaluation, Genoa, Italy, pp 2038–2043Google Scholar
  52. Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation, Cambridge, MA, USA, pp 223–231Google Scholar
  53. Streiter O, Scannell KP, Stuflesser M (2006) Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers. Mach Transl 20(4): 267–289CrossRefGoogle Scholar
  54. Stroppa N, Groves D, Way A, Sarasola K (2006) Example-Based Machine Translation of the Basque Language. In: AMTA 2006: Proceedings of the 7th conference of the Association for Machine Translation in the Americas, ‘Visions for the Future of Machine Translation’, Cambridge, MA, USA, pp 232–241Google Scholar
  55. Tantug AC, Oflazer K, El-Kahlout ID (2008) BLEU+: a tool for fine-grained BLEU computation. In: Proceedings of the sixth international language resources and evaluation (LREC’08), Marrakech, Morocco, pp 1493–1499. http://www.lrec-conf.org/proceedings/lrec2008/
  56. Trujillo A (1992a) Locations in the Machine Translation of Prepositional Phrases. In: Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Proceedings of the Conference. Montréal, Canada, pp 13–20Google Scholar
  57. Trujillo A (1992b) Spatial lexicalization in the translation of prepositional phrases. In: 30th annual meeting of the association for computational linguistics, proceedings of the conference, Newark, Delaware, USA, pp 306–308Google Scholar
  58. Turian JP, Shen L, Melamed I (2003) Evaluation of machine translation and its evaluation. In: Proceedings of the MT Summit IX, New Orleans, USA, pp 386–393Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  • Aingeru Mayor
    • 1
    Email author
  • Iñaki Alegria
    • 1
  • Arantza Díaz de Ilarraza
    • 1
  • Gorka Labaka
    • 1
  • Mikel Lersundi
    • 1
  • Kepa Sarasola
    • 1
  1. 1.IXA GroupUniversity of the Basque CountryDonostiaBasque Country

Personalised recommendations