Scaling an Irish FST Morphology Engine for Use on Unrestricted Text

  • Elaine Uí Dhonnchadha
  • Josef Van Genabith
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4002)


This paper details the steps involved in scaling-up a lexicalised finite-state morphology transducer for use on unrestricted text. Our starting point was a base-line inflectional morphology engine [1], with 81% token coverage measured against a 15 million word corpus of Irish texts [2]. Manually scaling the FST lexicon component of a morphology transducer is time-consuming, expensive and rarely, if ever, complete. In order to scale up the engine we used a combination of strategies including semi-automatic population of the finite-state lexicon from machine-readable dictionary resources and from printed resources using optical character recognition, the addition of derivational morphology and the development of morphological guessers. This paper details the coverage increase contributed by each step. The full system achieves token coverage of 93% which is extended to 100% through the use of morphological guessers.


Recognition Rate Lexical Item Optical Character Recognition Proper Noun Test Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Uí Dhonnchadha, E.: An analyser and generator for Irish inflectional morphology using finite state transducers. Master’s thesis, School of Computing, Dublin City University, Dublin, Ireland (2002)Google Scholar
  2. 2.
    ITÉ (accessed, November 2005),
  3. 3.
    Beesley, K.R., Karttunen, L.: Finite State Morphology. CSLI Studies in Computational Linguistics. CSLI Publications (2003)Google Scholar
  4. 4.
    Karttunen, L., Beesley, K.R.: Two-level rule compiler. Technical report, Xerox PARC (1992)Google Scholar
  5. 5.
    Oideachais, A.R.: Foclóir Póca English-Irish/Irish-English Dictionary. An Gúm, Baile Átha Cliath (1986)Google Scholar
  6. 6.
    Symbols (accessed, November 2005),
  7. 7.
    Uí Dhonnchadha, E., Nic Pháidín, C., Van Genabith, J.: Design, implementation and evaluation of an inflectional morphology finite-state transducer for Irish. MT - Machine Translation: Special Issue on Finite State Language Resources and Language Processing (in press)Google Scholar
  8. 8.
    Críostaí, B.: Graiméar Gaeilge na mBráithre Críostaí. An Gúm, Baile Átha Cliath (1999)Google Scholar
  9. 9.
    Ó Dónaill, N.: Foclóir Gaeilge Béarla. Oifig an tSoláthair, Baile Átha Cliath (1977)Google Scholar
  10. 10.
    Ó Droighneáin, M.: An Sloinnteoir Gaeilge agus an tAinmneoir. Coiscéim, Baile Átha Cliath (1991)Google Scholar
  11. 11.
    Ó Siochfhrada, N.: Foclóir Gaeilge/Béarla - Béarla/Gaeilge. An Comhlacht Oideachais, Baile Átha Cliath (1998)Google Scholar
  12. 12.
    Grefenstette, G., Schiller, A., Ait-Mokhtar, S.: Recognizing lexical patterns in text. In: van Eynde, F., Gibbon, D. (eds.) Lexicon Development for Speech and Language Processing. Kluwer Academic Publishers, Dordrecht (2000)Google Scholar
  13. 13.
    Kilgarriff, A., Rundell, M., Uí Dhonnchadha, E.: Efficient corpus creation for lexicography. Language Resources and Evaluation Journal (forthcoming)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Elaine Uí Dhonnchadha
    • 1
    • 2
  • Josef Van Genabith
    • 1
  1. 1.National Centre for Language Technology, School of ComputingDublin City UniversityDublin 9Ireland
  2. 2.Centre for Language and Communication StudiesTrinity College DublinIreland

Personalised recommendations