Skip to main content

A Probabilistic Model for Guessing Base Forms of New Words by Analogy

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2008)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4919))

Abstract

Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. Looking at English, one might assume that they appear in base form, i.e., the lexical look-up form. However, in more highly inflecting languages like Finnish or Swahili only 40-50 % of new words appear in base form. In order to index documents or discover translations for these languages, it would be useful to reduce new words to their base forms as well. We often have access to analyzes for more frequent words which shape our intuition for how new words will inflect. We formalize this into a probabilistic model for lemmatization of new words using analogy, i.e., guessing base forms, and test the model on English, Finnish, Swedish and Swahili demonstrating that we get a recall of 89-99 % with an average precision of 76-94 % depending on language and the amount of training material.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Creutz, M., Lagus, K., Lindén, K., Virpioja, S.: Morfessor and Hutmegs: Unsupervised Morpheme Segmentation for Highly-Inflecting and Compounding Languages. In: Proceedings of the Second Baltic Conference on Human Language Technologies, Tallinn, Estonia, April 4–8 (2005)

    Google Scholar 

  2. Goldsmith, J.: Morphological Analogy: Only a Beginning (2007), http://hum.uchicago.edu/~jagoldsm/Papers/analogy.pdf

  3. Kuenning, G.: Dictionaries for International Ispell (2007), http://www.lasr.cs.ucla.edu/geoff/ispell-dictionaries.html

  4. Kurimo, M., Creutz, M., Turunen, V.: Overview of Morpho Challenge in CLEF 200. In: Nardi, A., Peters, C. (eds.) Working Notes of the CLEF 2007 Workshop, udapest, Hungary, September 19–21 (2007)

    Google Scholar 

  5. Lindén, K.: Multilingual Modeling of Cross-lingual Spelling Variants. Journal of Information Retrieval 9, 295–310 (2006)

    Article  Google Scholar 

  6. Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: A General and Efficient Weighted Finite-State Transducer Library. LNCS (to appear)

    Google Scholar 

  7. Lombardy, S., Régis-Gianas, Y., Sakarovitch, J.: Introducing Vaucanson. Theoretical Computer Science 328, 77–96 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  8. Wicentowski, R.: Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. PhD Thesis. Baltimore, Maryland (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lindén, K. (2008). A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2008. Lecture Notes in Computer Science, vol 4919. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78135-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-78135-6_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-78134-9

  • Online ISBN: 978-3-540-78135-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics