Skip to main content

Comparison of Different Lemmatization Approaches through the Means of Information Retrieval Performance

  • Conference paper
Text, Speech and Dialogue (TSD 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6231))

Included in the following conference series:

Abstract

This paper presents a quantitative performance analysis of two different approaches to the lemmatization of the Czech text data. The first one is based on manually prepared dictionary of lemmas and set of derivation rules while the second one is based on automatic inference of the dictionary and the rules from training data. The comparison is done by evaluating the mean Generalized Average Precision (mGAP) measure of the lemmatized documents and search queries in the set of information retrieval (IR) experiments. Such method is suitable for efficient and rather reliable comparison of the lemmatization performance since a correct lemmatization has proven to be crucial for IR effectiveness in highly inflected languages. Moreover, the proposed indirect comparison of the lemmatizers circumvents the need for manually lemmatized test data which are hard to obtain and also face the problem of incompatible sets of lemmas across different systems.

This research was supported by the Min. of Education of the Czech Republic, project. No. MŠMT LC536, by the grant of the University of West Bohemia, project No. SGS-2010-054 and by the Grant Agency of Academy of Sciences of the Czech Republic., project. No. 1ET101470416.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., Štěpánek, J., Havelka, J., Mikulová, M.: Prague Dependency Treebank 2.0. Linguistic Data Consortium, Philadelphia, USA (2006)

    Google Scholar 

  2. Hajič, J., Hladká, B.: Tagging Inflective Languages: Prediction of Morphological Categories for a Rich, Structured Tagset. In: Proceedings of COLING-ACL Conference, Montreal, Canada, pp. 483–490 (1998)

    Google Scholar 

  3. Kanis, J., Müller, L.: Automatic Lemmatizer Construction with Focus on OOV Words Lemmatization. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 132–139. Springer, Heidelberg (2005)

    Google Scholar 

  4. Ispell dictionaries and rules files, http://fmg-www.cs.ucla.edu/geoff/ispell-dictionaries.html

  5. Ircing, P., Müller, L.: Benefit of Proper Language Processing for Czech Speech Retrieval in the CL-SR Task at CLEF 2006. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 759–765. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  6. Ircing, P., Psutka, J., Vavruška, J.: What Can and Cannot Be Found in Czech Spontaneous Speech Using Document-Oriented IR Methods UWB at CLEF 2007 CL-SR Track. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 712–718. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  7. Ircing, P., Pecina, P., Oard, D.W., Wang, J., White, R.W., Hoidekr, J.: Information Retrieval Test Collection for Searching Spontaneous Czech Speech. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 439–446. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  8. Ponte, J.M., Croft, W.B.: A language Modeling Approach to Information Retrieval. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM, New York (1998)

    Chapter  Google Scholar 

  9. The Wilcoxon matched-pairs signed-ranks test, http://www.fon.hum.uva.nl/service/statistics/signed_rank_test.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kanis, J., Skorkovská, L. (2010). Comparison of Different Lemmatization Approaches through the Means of Information Retrieval Performance . In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2010. Lecture Notes in Computer Science(), vol 6231. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15760-8_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15760-8_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15759-2

  • Online ISBN: 978-3-642-15760-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics