Skip to main content

A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI)

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNAI,volume 5221)

Abstract

We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger [1] for tagging and The Icelandic Frequency Dictionary [2] corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections [3]. Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich languages.

Keywords

  • lemma
  • lemmatization
  • normalization
  • machine learning
  • BLARK
  • Icelandic
  • Lemmald
  • IceTagger

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-540-85287-2_20
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-540-85287-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.00
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Loftsson, H.: Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics 31(1), 47–72 (2008)

    CrossRef  Google Scholar 

  2. Pind, J., Magnússon, F., Briem, S.: Íslensk orðtíðnibók [The Icelandic Frequency Dictionary]. The Institute of Lexicography, University of Iceland, Reykjavik (1991)

    Google Scholar 

  3. Bjarnadóttir, K.: Modern Icelandic Inflections. In: Holmboe, H. (ed.) Nordisk Sprogteknologi 2005. Museum Tusculanums Forlag, Copenhagen (2005)

    Google Scholar 

  4. Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: CIKM 2004: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 625–633. ACM, New York (2004)

    CrossRef  Google Scholar 

  5. Braschler, B., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7(3-4), 291–316 (2004)

    CrossRef  MATH  Google Scholar 

  6. Airio, E.: Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9(3), 249–271 (2006)

    CrossRef  Google Scholar 

  7. Krauwer, S.: The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap. SPECOM-2003, Moscow, Russia, Accessed 01.04.2008 (2003), http://www.elsnet.org/dox/krauwer-specom2003.pdf

  8. Cassata, F.: Automatic thesaurus extraction for Icelandic. BSc Final Project, Department of Computer Science, Reykjavik University (2007)

    Google Scholar 

  9. Loftsson, H., Rögnvaldsson, E.: IceNLP: A Natural Language Processing Toolkit for Icelandic. In: Proceedings of Interspeech 2007, Special Session: Speech and language technology for less-resourced languages, Antwerp, Belgium (2007)

    Google Scholar 

  10. Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    CrossRef  Google Scholar 

  11. Jongejan, B., Haltrup, D.: The CST Lemmatiser. Center for Sprogteknologi, University of Copenhagen version 2.9 (2005)

    Google Scholar 

  12. Carlberger, J., Dalianis, H., Hassel, M., Knutsson, O.: Improving precision in information retrieval for Swedish using stemming. In: Proceedings of NODALIDA 2001 – 13th Nordic conference on computational linguistics (2001)

    Google Scholar 

  13. Dalianis, H., Jongejan, B.: Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST’s Lemmatiser. In: LREC 2006: Proceeding of the International Conference on Language Resources and Evaluation (2006)

    Google Scholar 

  14. Helgadóttir, S.: Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In: Holmboe, H. (ed.) Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag, Copenhagen (2005)

    Google Scholar 

  15. Manning, C.: Focusing on Linguistic Representations [abstract]. In: The Natural Language and Speech Processing Colloquium, Stanford, January 19 (2005)

    Google Scholar 

  16. Kenstowicz, M.: Phonology in Generative Grammar (Blackwell Textbooks in Linguistics). Blackwell Publishers, Malden (1993)

    Google Scholar 

  17. Prince, A., Smolensky, P.: Optimality Theory: Constraint Interaction in Generative Grammar. Manuscript, Rutgers University and University of Colorado at Boulder. ROA [ROA #537] (1993/2002), http://roa.rutgers.edu/

  18. Lezius, W., Rapp, R., Wettler, M.: A freely available Morphological Analyzer, Disambiguator, and Context Sensitive Lemmatizer for German. In: Proceedings of the COLING-ACL, pp. 743–747 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ingason, A.K., Helgadóttir, S., Loftsson, H., Rögnvaldsson, E. (2008). A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI). In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85287-2_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85286-5

  • Online ISBN: 978-3-540-85287-2

  • eBook Packages: Computer ScienceComputer Science (R0)