Advertisement

Machine Translation

, Volume 20, Issue 4, pp 267–289 | Cite as

Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers

  • Oliver Streiter
  • Kevin P. Scannell
  • Mathias Stuflesser
Original Paper

Abstract

This research begins by distinguishing a small number of “central” languages from the “noncentral languages”, where centrality is measured by the extent to which a given language is supported by natural language processing tools and research. We analyse the conditions under which noncentral language projects (NCLPs) and central language projects are conducted. We establish a number of important differences which have far-reaching consequences for NCLPs. In order to overcome the difficulties inherent in NCLPs, traditional research strategies have to be reconsidered. Successful styles of scientific cooperation, such as those found in open-source software development or in the development of the Wikipedia, provide alternative views of how NCLPs might be designed. We elaborate the concepts of free software and software pools and argue that NCLPs, in their own interests, should embrace an open-source approach for the resources they develop and pool these resources together with other similar open-source resources. The expected advantages of this approach are so important that we suggest that funding organizations put it as sine qua non condition into project contracts.

Keywords

Minority languages Open-source Free software Software pools 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AbiSource (2006) AbiWord: word processing for everyone. http://www.abisource.com/. Accessed 23 October, 2006
  2. Agirre E, Aldezabal I, Alegria I, Arregi X, Arriola JM, Artola X, Díaz de Ilarraza A, Ezeiza N, Gojenola K, Sarasola K, Soroa A (2002) Towards the definition of a basic toolkit for HLT. In: LREC (2002), pp 42–48Google Scholar
  3. Armentano-Oller C, Corbí-Bellot AM, Forcada ML, Ginestí-Rosell M, Bonev B, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez-Martínez F (2005) An opensource shallow-transfer machine translation toolbox: consequences of its release and availability. In: Proceedings of the open source machine translation workshop at MT summit X, Pukhet, Thailand, pp 12–16Google Scholar
  4. Berment V (2004) Méthodes pour informatiser des langues et des groupes de langues peu dotées [Methods for computerizing under-resourced languages and groups of languages]. Thèse de doctorat, Université Joseph Fourier, Saint-Martin-d’Hères, FranceGoogle Scholar
  5. Bird S (2004) UNESCO international mother language day, language log, February 21, 2004. Available at http://www.itre.cis.upenn.edu/myl/languagelog/archives/000481.html. Accessed September 26, 2006
  6. Bird S, Loper ED (2006) Natural language toolkit. http://www.nltk.sourceforge.net/. Accessed October 22, 2006
  7. Bretz A (2006) Custom eclipse builder. http://www.ceb.sourceforge.net/privateproperties.html. Accessed September 28, 2006
  8. Bungeroth J, Ney H (2004) Statistical sign language translation. In: Workshop on the representation and processing of sign languages, held in conjunction with the 4th international conference on language resources and evaluation, LREC 2004, Lisbon, Portugal, pp 105–108Google Scholar
  9. Caplan P and Guenther R (2005). Practical preservation: the PREMIS experience. Libr Trends 54: 111–124 CrossRefGoogle Scholar
  10. Crystal D (2001) Weaving a web of linguistic diversity, Guardian Weekly, 25 January 2001. Available at http://www.guardian.co.uk/GWeekly/Story/0,3939,427939,00.html. Accessed September 12, 2006
  11. Csató EA and Nathan D (2003). Multimedia and documentation of endangered languages. In: Austin, PK (eds) Language documentation and description, vol 1, pp 73–84. Hans Rausing Endangered Languages Project, SOAS, London Google Scholar
  12. Debian (2006a) Debian – the universal operating system. http://www.debian.org. Accessed September 28, 2006
  13. Debian (2006b) Debian worldwide mirror sites. http://www.debian.org/mirror/list. Accessed September 28, 2006
  14. Díaz de Ilarraza A, Gurrutxaga A, Hernaez I, Lopez de Gereñu N, Sarasola K (2003) HIZKING21: integrating language engineering resources and tools into systems with linguistic capabilities. In: TALN (2003), pp 243–252Google Scholar
  15. Eisenlohr P (2004). Language revitalization and new technologies: cultures of electronic mediation and the refiguring of communities. Annu Rev Anthropol 33: 21–45 CrossRefGoogle Scholar
  16. Fink (2006) Fink. http://www.fink.sourceforge.net/. Accessed September 28, 2006
  17. Forcada M (2006) Open source machine translation: an opportunity for minor languages. In: LREC (2006), pp 1–6Google Scholar
  18. Free Software Foundation (1991) GNU general public license. http://www.gnu.org/copyleft/gpl.html. Accessed October 26, 2006
  19. Free Software Foundation (2005) GNU lesser general public license. http://www.gnu.org/licenses/lgpl.html. Accessed October 26, 2006
  20. Free Software Foundation (2007) The GNU operating system – free as in freedom. http://www.gnu.org/. Accessed March 30, 2007
  21. GATE (2006) GATE – general architecture for text engineering. http://www.gate.ac.uk/. Accessed October 22, 2006
  22. Gaup B, Moshagen S, Omma T, Palismaa M, Pieski T, Trosterud T (2005) From Xerox to Aspell: a first prototype of a North Sámi speller based on TWOL technology. In: Finite-state methods and natural language processing, 5th international workshop, FSMMNLP 2005, Helsinki, Finland, pp 306–307Google Scholar
  23. Gentoo Foundation (2006) Gentoo Linux news. http://www.gentoo.org/. Accessed October 23, 2006
  24. Ide N, Suderman K (2002) Corpus encoding standard for XML. http://www.cs.vassar.edu/XCES/. Accessed February 15, 2006
  25. Koster CHA, Gradmann S (2004) The language belongs to the people! In: LREC (2004), pp 353–356Google Scholar
  26. Krauwer S (1998) ELSNET and ELRA: common past, common future. ELRA Newsl 3.2. Available at http://www.elsnet.org/dox/blark.html. Accessed April 30, 2007
  27. Krauwer S (2003) The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. In: SPECOM’ 2003 international workshop speech and computer, Moscow, Russia [pages not numbered]Google Scholar
  28. Kuhn TS (1962/1996). The structure of scientific revolutions. University of Chicago Press, Chicago, IL Google Scholar
  29. LDC [Linguistic Data Consortium] (2000) Linguistic exploration: new methods for creating, exploring and disseminating linguistic field data, held in conjunction with the annual meeting of the Linguistic Society of America, Chicago, USAGoogle Scholar
  30. Leuski A (2006) CocoAspell, Mac OS X interface for Aspell. http://www.cocoaspell.leuski.net/. Accessed October 23, 2006
  31. LISA [Localization Industry Standards Association] (2007) TMX – translation memory exchange. http://www.lisa.org/standards/tmx/. Accessed March 30, 2007
  32. Liu DY-C, Su SC-F, Lai LY-H, Sung EH-Y, Hsu JY-L, Hsieh SY-C, Streiter O (2006) From corpora to spell checkers: first steps in building an infrastructure for the collaborative development of African language resources. In: LREC workshop networking the development of language resources for African languages, Genova, Italy, pp 50–53Google Scholar
  33. LREC (1998) Workshop on language resources for European minority languages, held in conjunction with the first international conference on language resources and evaluation, Granada, SpainGoogle Scholar
  34. LREC (2000) Developing language resources for minority languages: Re-usability and strategic priorities, Workshop held in conjunction with the second international conference on language resources and evaluation, Athens, GreeceGoogle Scholar
  35. LREC (2002) Portability issues in human language technologies (HLT), Workshop held in conjunction with the third international conference on language resources and evaluation, Las Palmas de Gran Canaria, SpainGoogle Scholar
  36. LREC (2004) 4th international SALTMIL (ISCA SIG) LREC Workshop on first steps for language documentation of minority languages: Computational linguistic tools for morphology, lexicon and corpus compilation, Lisbon, PortugalGoogle Scholar
  37. LREC (2006) Satellite workshop W06: Strategies for developing machine translation for minority languages, Genova, ItalyGoogle Scholar
  38. LULCL (2005) Proceedings of the conference on lesser used languages & computer linguistics, Bozen-Bolzano, ItalyGoogle Scholar
  39. MacKay D (2007) The Dasher project. http://www.inference.phy.cam.ac.uk/dasher/. Accessed April 2, 2007
  40. Mandriva (nd) Welcome/home – Mandriva Linux. http://www.mandriva.com/. Accessed October 23, 2006
  41. Maxwell M, Hughes B (2006) Frontiers in linguistic annotation for lower-density languages. In: Frontiers in linguistically annotated corpora, COLING/ACL 2006 workshop, Sydney, Australia, pp 29–37Google Scholar
  42. Microsoft Corporation (2007a) Internet Explorer 6: worldwide. http://www.microsoft.com/windows/ie/ie6/worldwide/default.mspx. Accessed April 2, 2007
  43. Microsoft Corporation (2007b) Internet Explorer: worldwide sites. http://www.microsoft.com/windows/products/winfamily/ie/worldwide.mspx. Accessed October 30, 2007
  44. Morrissey S, Way A (2005) An example-based approach to translating sign language. In: Second workshop on example-based machine translation, MT summit X workshop, Phuket, Thailand, pp 109–116Google Scholar
  45. Mozilla (2006) Home of the Mozilla project. http://www.mozilla.org. Accessed October 23, 2006
  46. Mozilla (2007) Download a Firefox version that speaks your language! http://www.mozilla.com/firefox/all.html. Accessed April 2, 2007
  47. NIH-OER [National Institutes of Health Office of Extramural Research] (2006) NIH data sharing policy. http://www.grants.nih.gov/grants/policy/data_sharing/index.htm. Accessed September 12, 2006
  48. Opensource (2006) The open source definition (annotated). http://www.opensource.org/docs/definition.php. Accessed October 26, 2006
  49. Opensource (nd) Open source licenses. http://www.opensource.org/licenses/. Accessed October 26, 2006
  50. Prinsloo DJ, Heid U (2005) Creating word class tagged corpora for Northern Sotho by linguistically informed bootstrapping. In: LULCL (2005), pp 97–115Google Scholar
  51. Probst K, Levin L, Peterson E, Lavie A and Carbonell J (2002). MT for minority languages using elicitation-based learning of syntactic transfer rules. Mach Translat 17: 245–270 CrossRefGoogle Scholar
  52. Roux J (2004) Technologically challenged languages. Presentation at: Building the LR&E roadmap, joint COCOSDA & ICCWLRE meeting, Lisbon, Portugal. Available at http://www.lrec-conf.org/lrec2004/doc/presentation/Roux.pdf. Accessed October 26, 2006
  53. Sarasola K (2000) Strategic priorities for the development of language technology in minority languages. In: LREC (2000), pp 106–109Google Scholar
  54. Scannell KP (2006) Machine translation for closely related language pairs. In: LREC (2006), pp 103–107Google Scholar
  55. Scannel KP (2007) The crúbadán project: corpus building for under-resourced languages. In: Fairon C, Naets H, Kilgarriff A, de Schryver G-M (eds) Building and exploring web corpora. Proceedings of the 3rd Web as Corpus Workshop, September 2007, pp 5-15Google Scholar
  56. Somers H (1998) “New paradigms” in MT: the state of play now that the dust has settled. In: 10th European summer school in logic, language and information, workshop on machine translation, Saarbrücken, Germany, pp 22–33Google Scholar
  57. Stallman RM (1999) Various licenses and comments about them. http://www.gnu.org/philosophy/license-list.html. Accessed October 26, 2006
  58. Streiter O, De Luca EW (2003) Example-based NLP for minority languages: tasks, resources and tools. In: TALN (2003), pp 233–242Google Scholar
  59. Streiter O, Stuflesser M (2005) XNLRDF, the open source framework for multilingual computing. In: LULCL (2005), pp 189–207Google Scholar
  60. TALN (2003) Workshop TALN 2003: Traitement automatique des langues minoritaires et des petites langues [NLP for minority and small languages], Batz-sur-Mer, FranceGoogle Scholar
  61. TALN (2005) Atelier traitement des langues peu dotées [Workshop on processing under-resourced languages]. In: TALN 2005, 12ème conférence annuelle sur le traitement automatique des langues naturelles, actes tome 2: Ateliers, Dourdan, France, pp 205–318Google Scholar
  62. TEI (nd) TEI the text encoding initiative. http://www.tei-c.org/. Accessed February 15, 2006
  63. Trosterud T (2005) Grammar-based language technology for the Sámi languages. In: LULCL 2005, pp 133–147Google Scholar
  64. Uchechukwu C (2005) The Igbo language and computer linguistics: problems and prospects. In: LULCL (2005), pp 247–264Google Scholar
  65. Vossen PTJM, Fellbaum C (2007) The global WordNet association. http://www.globalwordnet.org/. Accessed March 13, 2007
  66. Webster A (2003) Digital race to save languages, BBC News, 20 March 2003. http://www.news.bbc.co.uk/2/hi/technology/2857041.stm. Accessed September 12, 2006
  67. XNLRDF (2005) XNLRDF, an open source natural language resource description framework. http://www.140.127.211.214/xnlrdf. Accessed October 22, 2006

Copyright information

© Springer Science+Business Media B.V. 2007

Authors and Affiliations

  • Oliver Streiter
    • 1
  • Kevin P. Scannell
    • 2
  • Mathias Stuflesser
    • 3
  1. 1.Department of Western Languages and LiteratureNational University of KaohsiungKaohsiungTaiwan, ROC
  2. 2.Department of Mathematics and Computer ScienceSaint Louis UniversitySaint LouisUSA
  3. 3.Institute for Specialised Communication and MultilingualismEuropean Academy Bozen/BolzanoBolzanoItaly

Personalised recommendations