Advertisement

Four Stemmers and a Funeral: Stemming in Hungarian at CLEF 2005

  • Anna Tordai
  • Maarten de Rijke
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4022)

Abstract

We developed algorithmic stemmers for Hungarian and used them for the ad-hoc monolingual task for CLEF 2005. Our goal was to determine what degree of stemming is the most effective. Although on average the stemmers did not perform as well as the the best n-gram, we found that stemming over a broad range of suffixes especially on nouns is highly useful.

Keywords

Word List Compound Word Truncation Line Stopword List Subjunctive Mood 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Szeged Corpus. A morpho-syntactically annotated and POS tagged Hungarian corpus (2005)Google Scholar
  2. 2.
    Di Nunzio, G.M., Ferro, N., Jones, G.J.F., Peters, C.: CLEF 2005: Ad Hoc Track Overview. In: Peters, C., Gey, F.C., Gonzalo, J., Müller, H., Jones, G.J.F., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D. (eds.) CLEF 2005. LNCS, vol. 4022, pp. 11–36. Springer, Heidelberg (2006), http://www.clef-campaign.org/2005/working_notes/workingnotes2005/dinunzio05.pdf CrossRefGoogle Scholar
  3. 3.
    Erjavec, T., Monachini, M.: Specifications and notation for lexicon encoding. Technical report, COP Project 106 MULTEXT - East, December 17 (1997)Google Scholar
  4. 4.
    Fissaha Adafre, S., van Hage, W.R., Kamps, J., de Melo, G.L., de Rijke, M.: The University of Amsterdam at CLEF 2004 (2004)Google Scholar
  5. 5.
    Hollink, V., Kamps, J., Monz, C., de Rijke, M.: Monolingual document retrieval for European languages (2003)Google Scholar
  6. 6.
    Korenius, T., Laurikkala, J., Jarvelin, K., Juhola, M.: Stemming and lemmatization in the clustering of finnish text documents. In: Proceedings of the Thirteenth ACM conference on Information and knowledge management, pp. 625–633 (2005)Google Scholar
  7. 7.
    Lucene. The Lucene search engine. http://jakarta.apache.org/lucene/
  8. 8.
    Megyesi, B.: The Hungarian language, http://www.speech.kth.se/~bea/hungarian.pdf
  9. 9.
    Paice, C.D.: Method for evaluation of stemming algorithms based on error counting. Journal of The American Society for Information Science 47(8), 632–649 (1996)CrossRefGoogle Scholar
  10. 10.
  11. 11.
    Snowball. The Snowball string processing language (2005), http://snowball.tartarus.org/
  12. 12.
    Tordai, A., de Rijke, M.: Hungarian monolingual retrieval at clef (2005), http://www.clef-campaign.org/2005/working_notes/workingnotes2005/tordai05.pdf

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Anna Tordai
    • 1
  • Maarten de Rijke
    • 1
  1. 1.Informatics InstituteUniversity of Amsterdam Kruislaan 403Amsterdam

Personalised recommendations