Skip to main content

Stemming and Decompounding for German Text Retrieval

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2003)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2633))

Included in the following conference series:

Abstract

The stemming problem, i.e. finding a common stem for different forms of a term, has been extensively studied for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. Rarely do studies on stemming for any language cover more than one or two different approaches. This paper makes a major contribution that transcends its focus on German by investigating a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blustein, J.: IR STAT PAK. URL: http://www.csd.uwo.ca/~jamie/IRSP-overview.html (last visit 11/19/2002).

  2. Braschler, M., and Schäuble, P.: Experiments with the Eurospider Retrieval System for CLEF 2000. In Peters C. (Ed.): Cross-Language Information Retrieval and Evaluation, Workshop of the Cross-Language Evaluation Forum, CLEF 2000, pp. 140–148, 2001.

    Google Scholar 

  3. Choueka, Y.: Responsa: An Operational Full-Text Retrieval System With Linguistic Components for Large Corpora. In: Computational Lexicology and Lexicography: a Volume in Honor of B. Quemada, 1992.

    Google Scholar 

  4. Frakes, W. B.: Stemming Algorithms. In: Frakes, W. B. and Baeza-Yates, R. (Eds.): Information Retrieval, Data Structures & Algorithms, pp. 131–160. Prentice Hall, Eaglewood Cliffs, NJ, USA, 1992.

    Google Scholar 

  5. Frisch, E., and Kluck, M.: Pretest zum Projekt German Indexing and Retrieval Testdatabase (GIRT) unter Anwendung der Retrievalsysteme Messenger und freeWAISsf. IZ Arbeitsbericht Nr. 10, GESIS IZ Soz., Bonn, Germany, 1997. [in German].

    Google Scholar 

  6. Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. In Computational Linguistics, 27(2), pp. 153–198, MIT Press. URL: http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000/ (last visit 11/19/2002).

  7. Harman, D.: How Effective is Suffixing?. In Journal of the American Society for Information Science, 42(1), pp. 7–15, 1991.

    Article  Google Scholar 

  8. Harman, D.: The TREC Conferences. In Sparck-Jones, K. and Willett, P. (Eds.): Readings in Information Retrieval, Morgan Kaufmann Publishers, San Francisco, CA, USA 1997.

    Google Scholar 

  9. Hull, D. A.: Stemming Algorithms — A Case Study for Detailed Evaluation. In Journal of the American Society for Information Science 47(1), pp. 70–84, 1986.

    Article  Google Scholar 

  10. Hull, D. A: Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the 16th ACM SIGIR Conference, Pittsburg, USA, 1993.

    Google Scholar 

  11. Hull, D. A., G. Grefenstette, B. M. Schultze, E. Gaussier, H. Schütze, O. Pedersen: Xerox TREC-5 Site Report: Routing, Filtering, NLP and Spanish Tracks. In Proceedings of the Fifth Text Retrieval Conference (TREC 5), Gaithersburg, USA, 1996.

    Google Scholar 

  12. Kraaij, W. and Pohlmann, R.: Using Linguistic Knowledge in Information Retrieval. OTS Working Paper OTS-WP-CL-96-001, University of Utrecht, The Netherlands, 1996.

    Google Scholar 

  13. Kraaij, W. and Pohlmann, R.: Viewing Stemming as Recall Enhancement. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996.

    Google Scholar 

  14. Krause, J., and Womser-Hacker, C.: Das Deutsche Patent-informationssystem. Entwicklungstendenzen, Retrieval tests und Bewertungen. Carl Heymanns, 1990. [in German].

    Google Scholar 

  15. Lovins, J. B.: Development of a Stemming Algorithm. In Mechanical Translation and Computational Linguistics, 11(1–2), pp. 22–31, 1968.

    Google Scholar 

  16. Maas, D.: MPRO — Ein System zur Analyse und Synthese deutscher Wörter. In R. Hauser (ed.): Linguistische Verifikation, Max Niemeyer Verlag, Tübingen, 1996. [in German].

    Google Scholar 

  17. Mayfield, J., McNamee, P. and Piatko, C.: The JHU/APL HAIRCUT System at TREC-8. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication 500-246, pp. 445–451.

    Google Scholar 

  18. Monz, C., and de Rijke, M.: Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German and Italian. In Peters, C., Braschler, M., Gonzalo, J. and Kluck, M. (Eds): Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001, Lecture Notes in Computer Science, LNCS 2406, pp. 262–277, 2002.

    Chapter  Google Scholar 

  19. Moulinier, I., McCulloh, J. A., Lund, E.: West Group at CLEF 2000: Non-English Monolingual Retrieval. In Peters C. (Ed.): Cross-Language Information Retrieval and Evaluation, Workshop of the Cross-Language Evaluation Forum, CLEF 2000, pp. 253–260, 2001.

    Google Scholar 

  20. Popovic, M., and Willet, P.: The effectiveness of stemming for natural-language access to Slovene textual data. In Journal of the American Society for Information Science, 3(5), pp. 384–390, 1992.

    Article  Google Scholar 

  21. Porter, M. F.: An Algorithm for Suffix Stripping. In Program, 14(3), pages 130–137, 1980. Reprint in: Sparck Jones, K. and Willett, P. (Eds.): Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers, San Francisco, CA, USA. 1997.

    Google Scholar 

  22. Ripplinger, B.: Linguistic Knowledge in Cross-Language Information Retrieval, PhD Thesis, Herbert Utz Verlag, Munich, Germany, 2002.

    Book  Google Scholar 

  23. Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), pp. 944–952, 1999.

    Article  Google Scholar 

  24. Savoy, J.: Cross-Language Information Retrieval: Experiments Based on CLEF 2000 Corpora. Information Processing & Management, to appear, 2002.

    Google Scholar 

  25. Singhal, A., C. Buckley, and M. Mitra: Pivoted Document Length Normalization. In Proceedings of of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996.

    Google Scholar 

  26. Sheridan, P., and Ballerini, J. P.: Experiments in Multilingual Information Retrieval using the SPIDER System. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switz., 1996.

    Google Scholar 

  27. Tague-Sutcliffe, J.: The Pragmatics of Information Retrieval Experimentation, Revisited. In Sparck-Jones, K. and Willett, P. (Eds.): Readings in Information Retrieval, Morgan Kaufmann Publishers, San Francisco, CA, USA, 1997.

    Google Scholar 

  28. Tomlinson, S.: Stemming Evaluated in 6 Languages by Hummingbird SearchServer™ at CLEF 2001. In Peters, C., Braschler, M., Gonzalo, J. and Kluck, M. (Eds): Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001, Lecture Notes in Computer Science, LNCS 2406, pp. 278–287, 2002.

    Chapter  Google Scholar 

  29. Wechsler, M., Sheridan, P., and Schäuble, P.: Multi-language text indexing for internet retrieval. In Proceedings of the 5th RIAO Conference, Computer-Assisted Information Searching on the Internet, Montreal, Canada, pp. 217–232, 1997.

    Google Scholar 

  30. Womser-Hacker, C.: Der PADOK-Retrievaltest. In “Sprache und Computer” Band 10, Georg Olms Verlag, 1989. [in German].

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Braschler, M., Ripplinger, B. (2003). Stemming and Decompounding for German Text Retrieval. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_13

Download citation

  • DOI: https://doi.org/10.1007/3-540-36618-0_13

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-01274-0

  • Online ISBN: 978-3-540-36618-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics