A Mixed Method Lemmatization Algorithm Using a Hierarchy of Linguistic Identities (HOLI)

  • Anton Karl Ingason
  • Sigrún Helgadóttir
  • Hrafn Loftsson
  • Eiríkur Rögnvaldsson
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5221)

Abstract

We present a new mixed method lemmatizer for Icelandic, Lemmald, which achieves good performance by relying on IceTagger [1] for tagging and The Icelandic Frequency Dictionary [2] corpus for training. We combine the advantages of data-driven machine learning with linguistic insights to maximize performance. To achieve this, we make use of a novel approach: Hierarchy of Linguistic Identities (HOLI), which involves organizing features and feature structures for the machine learning based on linguistic knowledge. Accuracy of the lemmatization is further improved using an add-on which connects to the Database of Modern Icelandic Inflections [3]. Given correct tagging, our system lemmatizes Icelandic text with an accuracy of 99.55%. We believe our method can be fruitfully adapted to other morphologically rich languages.

Keywords

lemma lemmatization normalization machine learning BLARK Icelandic Lemmald IceTagger 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Loftsson, H.: Tagging Icelandic text: A linguistic rule-based approach. Nordic Journal of Linguistics 31(1), 47–72 (2008)CrossRefGoogle Scholar
  2. 2.
    Pind, J., Magnússon, F., Briem, S.: Íslensk orðtíðnibók [The Icelandic Frequency Dictionary]. The Institute of Lexicography, University of Iceland, Reykjavik (1991)Google Scholar
  3. 3.
    Bjarnadóttir, K.: Modern Icelandic Inflections. In: Holmboe, H. (ed.) Nordisk Sprogteknologi 2005. Museum Tusculanums Forlag, Copenhagen (2005)Google Scholar
  4. 4.
    Korenius, T., Laurikkala, J., Järvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: CIKM 2004: Proceedings of the thirteenth ACM international conference on Information and knowledge management, pp. 625–633. ACM, New York (2004)CrossRefGoogle Scholar
  5. 5.
    Braschler, B., Ripplinger, B.: How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval 7(3-4), 291–316 (2004)CrossRefMATHGoogle Scholar
  6. 6.
    Airio, E.: Word normalization and decompounding in mono- and bilingual IR. Information Retrieval 9(3), 249–271 (2006)CrossRefGoogle Scholar
  7. 7.
    Krauwer, S.: The Basic Language Resource Kit (BLARK) as the First Milestone for the Language Resources Roadmap. SPECOM-2003, Moscow, Russia, Accessed 01.04.2008 (2003), http://www.elsnet.org/dox/krauwer-specom2003.pdf
  8. 8.
    Cassata, F.: Automatic thesaurus extraction for Icelandic. BSc Final Project, Department of Computer Science, Reykjavik University (2007)Google Scholar
  9. 9.
    Loftsson, H., Rögnvaldsson, E.: IceNLP: A Natural Language Processing Toolkit for Icelandic. In: Proceedings of Interspeech 2007, Special Session: Speech and language technology for less-resourced languages, Antwerp, Belgium (2007)Google Scholar
  10. 10.
    Porter, M.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  11. 11.
    Jongejan, B., Haltrup, D.: The CST Lemmatiser. Center for Sprogteknologi, University of Copenhagen version 2.9 (2005)Google Scholar
  12. 12.
    Carlberger, J., Dalianis, H., Hassel, M., Knutsson, O.: Improving precision in information retrieval for Swedish using stemming. In: Proceedings of NODALIDA 2001 – 13th Nordic conference on computational linguistics (2001)Google Scholar
  13. 13.
    Dalianis, H., Jongejan, B.: Hand-crafted versus Machine-learned Inflectional Rules: The Euroling-SiteSeeker Stemmer and CST’s Lemmatiser. In: LREC 2006: Proceeding of the International Conference on Language Resources and Evaluation (2006)Google Scholar
  14. 14.
    Helgadóttir, S.: Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic. In: Holmboe, H. (ed.) Nordisk Sprogteknologi 2004. Museum Tusculanums Forlag, Copenhagen (2005)Google Scholar
  15. 15.
    Manning, C.: Focusing on Linguistic Representations [abstract]. In: The Natural Language and Speech Processing Colloquium, Stanford, January 19 (2005)Google Scholar
  16. 16.
    Kenstowicz, M.: Phonology in Generative Grammar (Blackwell Textbooks in Linguistics). Blackwell Publishers, Malden (1993)Google Scholar
  17. 17.
    Prince, A., Smolensky, P.: Optimality Theory: Constraint Interaction in Generative Grammar. Manuscript, Rutgers University and University of Colorado at Boulder. ROA [ROA #537] (1993/2002), http://roa.rutgers.edu/
  18. 18.
    Lezius, W., Rapp, R., Wettler, M.: A freely available Morphological Analyzer, Disambiguator, and Context Sensitive Lemmatizer for German. In: Proceedings of the COLING-ACL, pp. 743–747 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Anton Karl Ingason
    • 1
  • Sigrún Helgadóttir
    • 2
  • Hrafn Loftsson
    • 3
  • Eiríkur Rögnvaldsson
    • 1
  1. 1.Department of IcelandicUniversity of IcelandReykjavikIceland
  2. 2.The Árni Magnusson Institute for Icelandic StudiesReykjavikIceland
  3. 3.School of Computer ScienceReykjavik UniversityReykjavikIceland

Personalised recommendations