Corpus-Based Lexeme Ranking for Morphological Guessers

  • Krister Lindén
  • Jussi Tuovila
Part of the Communications in Computer and Information Science book series (CCIS, volume 41)

Abstract

Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a morphological lexicon, we need to determine their base form and indicate their inflectional paradigm. A base form and a paradigm define a lexeme. In this article, we evaluate a lexicon-based method augmented with data from a corpus or the internet for generating and ranking lexeme suggestions for new words. As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By generating lexeme suggestions with an entry generator and then further generating some key word forms for the lexemes, we can find support for the lexemes in a corpus. Our ranking methods have 56–79% average precision and 78–89% recall among the top 6 candidates, i.e., an F-score of 65–84%, indicating that the first correct entry suggestion is on the average found as the second or third candidate. The corpus-based ranking methods were found to be significant in practice as they save time for the lexicographer by increasing recall with 7–8% among the top candidates.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Mikheev, A.: Unsupervised Learning of Word Category Guessing Rules. In: Proc. of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996), pp. 327–334 (1996)Google Scholar
  2. 2.
    Mikheev, A.: Automatic Rule Induction for Unknown-Word Guessing. Computational Linguistics 23(3), 405–423 (1997)Google Scholar
  3. 3.
    Oflazer, K., Nirenburg, S., McShane, M.: Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning. Computational Linguistics 27(1), 59–85 (2001)CrossRefGoogle Scholar
  4. 4.
    Wicentowski, R.: Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. PhD Thesis. John Hopkins University, Baltimore, USA (2002)Google Scholar
  5. 5.
    Goldsmith, J.A.: Morphological Analogy: Only a Beginning (2007), http://hum.uchicago.edu/~jagoldsm/Papers/analogy.pdf
  6. 6.
    Creutz, M., Hirsimäki, T., Kurimo, M., Puurula, A., Pylkkönen, J., Siivola, V., Varjokallio, M., Arisoy, E., Saraçlar, M., Stolcke, A.: Morph-based speech recognition and modeling of out-of vocabulary words across languages. ACM Transactions on Speech and Language Processing 5(1), article 3 (2007)Google Scholar
  7. 7.
    Kurimo, M., Creutz, M., Turunen, V.: Overview of Morpho Challenge in CLEF 2007. In: Working Notes of the CLEF 2007 Workshop, pp. 19–21 (2007)Google Scholar
  8. 8.
    Lindén, K.: A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 106–116. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  9. 9.
    Kuenning, G.: Dictionaries for International Ispell (2007), http://www.lasr.cs.ucla.edu/geoff/ispelldictionaries.html
  10. 10.
  11. 11.
    Lindén, K.: Guessers for Finite-State Transducer Lexicons. In: CICling-2009, 10th International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, Mexico, March 1- 7 (2009) Google Scholar
  12. 12.
    Kotimaisten kielten tutkimuskeskuksen nykysuomen sanalista. Research Institute for the Languages of Finland (2007), http://kaino.kotus.fi/sanat/nykysuomi/
  13. 13.
    Sakarovitch, J.: Éléments de théorie des automates. Vuibert, Paris (2003)Google Scholar
  14. 14.
    HFST: Helsinki Finite-State Technology (2008), http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/index.shtml
  15. 15.
    Koskenniemi, K.: Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production. PhD Thesis. Department of General Linguistics, University of Helsinki, Publication No. 11 (1983)Google Scholar
  16. 16.
    Pirinen, T.: Open Source Morphology for Finnish using Finite-State Methods (in Finnish). Technical Report. Department of Linguistics, University of Helsinki (2008)Google Scholar
  17. 17.
    Forsberg, M., Hammarström, H., Ranta, A.: Morphological Lexicon Extraction from Raw Text Data. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 488–499. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  18. 18.
    Yarowsky, D., Wicentowski, R.: Minimally Supervised Morphological Analysis by Multimodal Alignment. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (2000)Google Scholar
  19. 19.
    Lindén, K.: Entry Generation by Analogy—Encoding New Words for Morphological Lexicons. Northern European Journal of Language Technology (May 2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Krister Lindén
    • 1
  • Jussi Tuovila
    • 1
  1. 1.University of HelsinkiHelsinkiFinland

Personalised recommendations