Abstract

Language software applications encounter new words, e.g., acronyms, technical terminology, names or compounds of such words. In order to add new words to a lexicon, we need to indicate their inflectional paradigm. We present a new generally applicable method for creating an entry generator, i.e. a paradigm guesser, for finite-state transducer lexicons. As a guesser tends to produce numerous suggestions, it is important that the correct suggestions be among the first few candidates. We prove some formal properties of the method and evaluate it on Finnish, English and Swedish full-scale transducer lexicons. We use the open-source Helsinki Finite-State Technology [1] to create finite-state transducer lexicons from existing lexical resources and automatically derive guessers for unknown words. The method has a recall of 82-87 % and a precision of 71-76 % for the three test languages. The model needs no external corpus and can therefore serve as a baseline.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
  2. 2.
    Mikheev, A.: Unsupervised Learning of Word-Category Guessing Rules. In: Proc. of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996), pp. 327–334 (1996)Google Scholar
  3. 3.
    Oflazer, K., Nirenburg, S., McShane, M.: Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning. Comp. Ling. 27(1), 59–85 (2001)CrossRefGoogle Scholar
  4. 4.
    Wicentowski, R.: Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. PhD Thesis, Baltimore, USA (2002)Google Scholar
  5. 5.
    Goldsmith, J.A.: Morphological Analogy: Only a Beginning (2007), http://hum.uchicago.edu/~jagoldsm/Papers/analogy.pdf
  6. 6.
    Creutz, M., Hirsimäki, T., Kurimo, M., Puurula, A., Pylkkönen, J., Siivola, V., Varjokallio, M., Arisoy, E., Saraçlar, M., Stolcke, A.: Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Trans. on Speech and Lang. Proc. 5(1), art. 3 (2007)Google Scholar
  7. 7.
    Kurimo, M., Creutz, M., Turunen, V.: Overview of Morpho Challenge in CLEF 2007. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 19–21. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Lindén, K.: A probabilistic model for guessing base forms of new words by analogy. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 106–116. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  9. 9.
    Kuenning, G.: Dictionaries for International Ispell (2007), http://www.lasr.cs.ucla.edu/geoff/ispell-dictionaries.html
  10. 10.
  11. 11.
    Koskenniemi, K.: Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. Department of General Linguistics, University of Helsinki, Publication No. 11 (1983)Google Scholar
  12. 12.
    Karlsson, F.: SWETWOL: A Comprehensive Morphological Analyser for Swedish. Nordic Journal of Linguistics 15(1), 1–45 (1992)CrossRefGoogle Scholar
  13. 13.
  14. 14.
    FreeLing 2.1–An Open Source Suite of Language Analyzers, http://garraf.epsevg.upc.es/freeling/
  15. 15.
    Westerberg, T.: Den stora svenska ordlistan (2008), http://www.dsso.se/
  16. 16.
    Mikheev, A.: Automatic Rule Induction for Unknown-Word Guessing. Comp. Ling. 23(3), 405–423 (1997)Google Scholar
  17. 17.
    Stroppa, N., Yvon, F.: An Analogical Learner for Morphological Analysis. In: Proc. of the 9th Conference on Computational Natural Language Learning (CoNLL), pp. 120–127 (2005)Google Scholar
  18. 18.
    Wicentowski, R.: Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model. In: Proc. of the Seventh Meeting of the ACL Special Interest Group in Computational Phonology, ACL, pp. 70–77 (2004)Google Scholar
  19. 19.
    Claveau, V., L’Homme, M.C.: Structuring Terminology using Analogy-Based Machine Learning. In: Proceedings of the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, pp. 17–18 (2005)Google Scholar
  20. 20.
    Baldwin, T.: Bootstrapping Deep Lexical Resources: Resources for Courses. In: Proc. of the ACL-SIGLEX Workshop on Deep Lexical Acquisition, ACL, pp. 67–76 (2005)Google Scholar
  21. 21.
    Daelemans, W., Zavrel, J., Sloot, K., Bosch, A.: TiMBL: Tilburg Memory-Based Learner, version 6.0, Reference Guide’, Technical Report–ILK07-03, Department of Communication and Information Sciences, Tilburg University (2003)Google Scholar
  22. 22.
    Pirinen, T.: Open Source Morphology for Finnish using Finite-State Methods (in Finnish). Technical Report. Department of Linguistics, University of Helsinki (2008)Google Scholar
  23. 23.
    Sakarovitch, J.: Éléments de théorie des automates. Vuibert (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Krister Lindén
    • 1
  1. 1.Department of General LinguisticsUniversity of HelsinkiFinland

Personalised recommendations