Abstract
The stemming problem, i.e. finding a common stem for different forms of a term, has been extensively studied for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. Rarely do studies on stemming for any language cover more than one or two different approaches. This paper makes a major contribution that transcends its focus on German by investigating a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blustein, J.: IR STAT PAK. URL: http://www.csd.uwo.ca/~jamie/IRSP-overview.html (last visit 11/19/2002).
Braschler, M., and Schäuble, P.: Experiments with the Eurospider Retrieval System for CLEF 2000. In Peters C. (Ed.): Cross-Language Information Retrieval and Evaluation, Workshop of the Cross-Language Evaluation Forum, CLEF 2000, pp. 140–148, 2001.
Choueka, Y.: Responsa: An Operational Full-Text Retrieval System With Linguistic Components for Large Corpora. In: Computational Lexicology and Lexicography: a Volume in Honor of B. Quemada, 1992.
Frakes, W. B.: Stemming Algorithms. In: Frakes, W. B. and Baeza-Yates, R. (Eds.): Information Retrieval, Data Structures & Algorithms, pp. 131–160. Prentice Hall, Eaglewood Cliffs, NJ, USA, 1992.
Frisch, E., and Kluck, M.: Pretest zum Projekt German Indexing and Retrieval Testdatabase (GIRT) unter Anwendung der Retrievalsysteme Messenger und freeWAISsf. IZ Arbeitsbericht Nr. 10, GESIS IZ Soz., Bonn, Germany, 1997. [in German].
Goldsmith, J.: Unsupervised Learning of the Morphology of a Natural Language. In Computational Linguistics, 27(2), pp. 153–198, MIT Press. URL: http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000/ (last visit 11/19/2002).
Harman, D.: How Effective is Suffixing?. In Journal of the American Society for Information Science, 42(1), pp. 7–15, 1991.
Harman, D.: The TREC Conferences. In Sparck-Jones, K. and Willett, P. (Eds.): Readings in Information Retrieval, Morgan Kaufmann Publishers, San Francisco, CA, USA 1997.
Hull, D. A.: Stemming Algorithms — A Case Study for Detailed Evaluation. In Journal of the American Society for Information Science 47(1), pp. 70–84, 1986.
Hull, D. A: Using Statistical Testing in the Evaluation of Retrieval Experiments. In Proceedings of the 16th ACM SIGIR Conference, Pittsburg, USA, 1993.
Hull, D. A., G. Grefenstette, B. M. Schultze, E. Gaussier, H. Schütze, O. Pedersen: Xerox TREC-5 Site Report: Routing, Filtering, NLP and Spanish Tracks. In Proceedings of the Fifth Text Retrieval Conference (TREC 5), Gaithersburg, USA, 1996.
Kraaij, W. and Pohlmann, R.: Using Linguistic Knowledge in Information Retrieval. OTS Working Paper OTS-WP-CL-96-001, University of Utrecht, The Netherlands, 1996.
Kraaij, W. and Pohlmann, R.: Viewing Stemming as Recall Enhancement. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996.
Krause, J., and Womser-Hacker, C.: Das Deutsche Patent-informationssystem. Entwicklungstendenzen, Retrieval tests und Bewertungen. Carl Heymanns, 1990. [in German].
Lovins, J. B.: Development of a Stemming Algorithm. In Mechanical Translation and Computational Linguistics, 11(1–2), pp. 22–31, 1968.
Maas, D.: MPRO — Ein System zur Analyse und Synthese deutscher Wörter. In R. Hauser (ed.): Linguistische Verifikation, Max Niemeyer Verlag, Tübingen, 1996. [in German].
Mayfield, J., McNamee, P. and Piatko, C.: The JHU/APL HAIRCUT System at TREC-8. In Proceedings of the Eighth Text REtrieval Conference (TREC-8), NIST Special Publication 500-246, pp. 445–451.
Monz, C., and de Rijke, M.: Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German and Italian. In Peters, C., Braschler, M., Gonzalo, J. and Kluck, M. (Eds): Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001, Lecture Notes in Computer Science, LNCS 2406, pp. 262–277, 2002.
Moulinier, I., McCulloh, J. A., Lund, E.: West Group at CLEF 2000: Non-English Monolingual Retrieval. In Peters C. (Ed.): Cross-Language Information Retrieval and Evaluation, Workshop of the Cross-Language Evaluation Forum, CLEF 2000, pp. 253–260, 2001.
Popovic, M., and Willet, P.: The effectiveness of stemming for natural-language access to Slovene textual data. In Journal of the American Society for Information Science, 3(5), pp. 384–390, 1992.
Porter, M. F.: An Algorithm for Suffix Stripping. In Program, 14(3), pages 130–137, 1980. Reprint in: Sparck Jones, K. and Willett, P. (Eds.): Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers, San Francisco, CA, USA. 1997.
Ripplinger, B.: Linguistic Knowledge in Cross-Language Information Retrieval, PhD Thesis, Herbert Utz Verlag, Munich, Germany, 2002.
Savoy, J.: A stemming procedure and stopword list for general French corpora. Journal of the American Society for Information Science, 50(10), pp. 944–952, 1999.
Savoy, J.: Cross-Language Information Retrieval: Experiments Based on CLEF 2000 Corpora. Information Processing & Management, to appear, 2002.
Singhal, A., C. Buckley, and M. Mitra: Pivoted Document Length Normalization. In Proceedings of of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 1996.
Sheridan, P., and Ballerini, J. P.: Experiments in Multilingual Information Retrieval using the SPIDER System. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switz., 1996.
Tague-Sutcliffe, J.: The Pragmatics of Information Retrieval Experimentation, Revisited. In Sparck-Jones, K. and Willett, P. (Eds.): Readings in Information Retrieval, Morgan Kaufmann Publishers, San Francisco, CA, USA, 1997.
Tomlinson, S.: Stemming Evaluated in 6 Languages by Hummingbird SearchServer™ at CLEF 2001. In Peters, C., Braschler, M., Gonzalo, J. and Kluck, M. (Eds): Evaluation of Cross-Language Information Retrieval Systems. CLEF 2001, Lecture Notes in Computer Science, LNCS 2406, pp. 278–287, 2002.
Wechsler, M., Sheridan, P., and Schäuble, P.: Multi-language text indexing for internet retrieval. In Proceedings of the 5th RIAO Conference, Computer-Assisted Information Searching on the Internet, Montreal, Canada, pp. 217–232, 1997.
Womser-Hacker, C.: Der PADOK-Retrievaltest. In “Sprache und Computer” Band 10, Georg Olms Verlag, 1989. [in German].
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Braschler, M., Ripplinger, B. (2003). Stemming and Decompounding for German Text Retrieval. In: Sebastiani, F. (eds) Advances in Information Retrieval. ECIR 2003. Lecture Notes in Computer Science, vol 2633. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36618-0_13
Download citation
DOI: https://doi.org/10.1007/3-540-36618-0_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-01274-0
Online ISBN: 978-3-540-36618-8
eBook Packages: Springer Book Archive