Advertisement

Information Retrieval

, Volume 7, Issue 3–4, pp 291–316 | Cite as

How Effective is Stemming and Decompounding for German Text Retrieval?

  • Martin Braschler
  • Bärbel Ripplinger
Article

Abstract

Information retrieval systems operating on free text face difficulties when word forms used in the query and documents do not match. The usual solution is the use of a “stemming component” that reduces related word forms to a common stem. Extensive studies of such components exist for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. The major contribution of our work that goes beyond its focus on German lies in the investigation of a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.

stemming decompounding German evaluation morphological analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blustein J (1998) IRSTATPAK. URL: http://www.csd.uwo.ca/~jamie/IRSP-overview.html (last visit 11/19/2002).Google Scholar
  2. Braschler M and Schäuble P (2001) Experiments with the Eurospider retrieval system for CLEF 2000. In: Peters C, Ed., Cross-Language Information Retrieval and Evaluation, Workshop of the Cross-Language Evaluation Forum, CLEF 2000, pp. 140-148.Google Scholar
  3. Choueka Y (1992) Responsa: An operational full-text retrieval system with linguistic components for large corpora. In: Computational Lexicology and Lexicography: A Volume in Honor of B. Quemada.Google Scholar
  4. Frakes WB (1992) Stemming Algorithms. In: Frakes WB and Baeza-Yates R, Eds., Information Retrieval, Data Structures & Algorithms. Prentice Hall, Eaglewood Cliffs, NJ, USA, pp. 131–160.Google Scholar
  5. Frisch E and Kluck M (1997) Pretest zum Projekt German Indexing and Retrieval Testdatabase (GIRT) unter Anwendung der Retrievalsysteme Messenger und freeWAISsf. IZ Arbeitsbericht Nr. 10, GESIS IZ Soz., Bonn, Germany (in German).Google Scholar
  6. Goldsmith J (2001) Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):153–198. MIT Press. Available at: URL: http://humanities.uchicago.edu/faculty/goldsmith/ Linguistica2000/ (last visit 11/19/2002).Google Scholar
  7. Harman D (1991) How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15.Google Scholar
  8. Harman D (1997) The TREC conferences. In: Sparck-Jones K and Willett P, Eds., Readings in Information Retrieval. Morgan Kaufmann Publishers, San Francisco, CA, USA.Google Scholar
  9. Hull DA (1996) Stemming algorithms-A case study for detailed evaluation. Journal of the American Society for Information Science, 47(1):70–84.Google Scholar
  10. Hull DA (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th ACM SIGIR Conference. Pittsburg, USA.Google Scholar
  11. Hull DA, Grefenstette G, Schultze BM, Gaussier E, Sch¨utze H and Pedersen O (1996) Xerox TREC-5 site report: Routing, filtering, NLP and Spanish tracks. In: Proceedings of the Fifth Text Retrieval Conference (TREC 5). Gaithersburg, USA.Google Scholar
  12. Kraaij W and Pohlmann R (1996) Using linguistic knowledge in information retrieval. OTS Working Paper OTS-WP-CL-96-001, University of Utrecht, The Netherlands.Google Scholar
  13. Kraaij W and Pohlmann R (1996) Viewing stemming as recall enhancement. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Zurich, Switzerland.Google Scholar
  14. Krause J and Womser-Hacker C (1990) Das Deutsche Patent-informationssystem. Entwicklungstendenzen, Retrievaltests und Bewertungen. Carl Heymanns (in German).Google Scholar
  15. Lovins JB (1968) Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11(1/2):22–31.Google Scholar
  16. Maas D (1996) MPRO-Ein System zur Analyse und Synthese deutscher Wörter. In: Hauser R, Ed., Linguistische Verifikation, Max Niemeyer Verlag, Tübingen (in German).Google Scholar
  17. Mayfield J, McNamee P and Piatko C (1999) The JHU/APL HAIRCUT system at TREC-8. In: Proceedings of the Eighth Text REtrieval Conference (TREC-8). NIST Special Publication 500-246, pp. 445-451.Google Scholar
  18. Monz C and de Rijke M (2002) Shallow morphological analysis in monolingual information retrieval for Dutch, German and Italian. In: Peters C, Braschler M, Gonzalo J, and Kluck M, Eds., Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001, Lecture Notes in Computer Science, LNCS 2406, pp. 262-277.Google Scholar
  19. Moulinier I, McCulloh JA and Lund E: West Group at CLEF 2000 (2001) Non-English monolingual retrieval. In: Peters C, Ed., Cross-Language Information Retrieval and Evaluation, Workshop of the Cross-Language Evaluation Forum, CLEF 2000, pp. 253-260.Google Scholar
  20. Popovic M and Willet P (1992) The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 3(5):384–390.Google Scholar
  21. Porter MF (1980) An algorithm for suffix stripping. Program, 14(3):130–137. Reprint in: Sparck Jones K and Willett P., Eds., Readings in Information Retrieval, Morgan Kaufmann Publishers, San Francisco, CA, USA, pp. 313-316.Google Scholar
  22. Ripplinger B(2002) Linguistic knowledge in cross-language information retrieval. PhD Thesis, Herbert Utz Verlag, Munich, Germany.Google Scholar
  23. Savoy J (1999) A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10):944–952.Google Scholar
  24. Savoy J (2003) Cross-language information retrieval: Experiments based on CLEF 2000 Corpora. Information Processing & Management, 39(1):75–115.Google Scholar
  25. Singhal A, Buckley C and Mitra M (1996) Pivoted document length normalization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Zurich, Switzerland.Google Scholar
  26. Sheridan P and Ballerini JP (1996) Experiments in multilingual information retrieval using the SPIDER system. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Zurich, Switzerland.Google Scholar
  27. Tague-Sutcliffe J (1997) The pragmatics of information retrieval experimentation, revisited. In: Sparck-Jones K and Willett P, Eds., Readings in Information Retrieval. Morgan Kaufmann Publishers, San Francisco, CA, USA.Google Scholar
  28. Tomlinson S (2002) Stemming evaluated in 6 Languages by Hummingbird SearchServerTM at CLEF 2001. In: Peters C, Braschler M, Gonzalo J and Kluck M, Eds., Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001, Lecture Notes in Computer Science, LNCS 2406, pp. 278-287.Google Scholar
  29. Wechsler M, Sheridan P and Schäuble P (1997) Multi-language text indexing for internet retrieval. In: Proceedings of the 5th RIAO Conference, Computer-Assisted Information Searching on the Internet, Montreal, Canada, pp. 217-232.Google Scholar
  30. Womser-Hacker C (1989) Der PADOK-Retrievaltest. In: “Sprache und Computer” Band 10, Georg Olms Verlag (in German).Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Martin Braschler
    • 1
    • 2
  • Bärbel Ripplinger
    • 3
  1. 1.Eurospider Information Technology AGZürich
  2. 2.Institut Interfacultaire d'InformatiqueSwitzerland; Université de NeuchâtelNeuchâtelSwitzerland
  3. 3.Eurospider Information Technology AGZürichSwitzerland

Personalised recommendations