Putting Successor Variety Stemming to Work

  • Benno Stein
  • Martin Potthast
Part of the Studies in Classification, Data Analysis, and Knowledge Organization book series (STUDIES CLASS)

Abstract

Stemming algorithms find canonical forms for inflected words, e. g. for declined nouns or conjugated verbs. Since such a unification of words with respect to gender, number, time, and case is a language-specific issue, stemming algorithms operationalize a set of linguistically motivated rules for the language in question. The most well-known rule-based algorithm for the English language is from [Porter (1980)].

The paper presents a statistical stemming approach which is based on the analysis of the distribution of word prefixes in a document collection, and which thus is widely language-independent. In particular, our approach tackles the problem of index construction for multi-lingual documents. Related work for statistical stemming either focuses on stemming quality (such as [Bachin et al. (2002) or Bordag (2005)]) or investigates runtime performance ([Mayfield and McNamee (2003)] for example), but neither provides a reasonable tradeoff between both. For selected retrieval tasks under vector-based document models we report on new results related to stemming quality and collection size dependency.

Interestingly, successor variety stemming has neither been investigated under similarity concerns for index construction nor is it applied as a technology in current retrieval applications. The results show that this disregard is not justified.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ABDOU, S., RUCH, P. and SAVOY, J. (2005): Evaluation of Stemming, Query Expansion and Manual Indexing Approaches for the Genomic Task. In: Proc.TREC’05.Google Scholar
  2. BACCHIN, M., FERRO, N. and MELUCCI, M. (2002): Experiments to Evaluate a Statistical Stemming Algorithm. In: Proc. of CLEF’02.Google Scholar
  3. BORDAG, S. (2005): Unsupervised Knowledge-free Morpheme Boundary Detection. In: Proc. of RANLP’05.Google Scholar
  4. BRASCHLER, M. and RIPPLINGER, B. (2004): How Effective is Stemming and Decompounding for German Text Retrieval? Information Retrieval, 7,3–4, 291–316.CrossRefGoogle Scholar
  5. FRAKES, W.B. (1984): Term Conflation for Information Retrieval. In: Proc. SIGIR’84.Google Scholar
  6. FRAKES, W.B. and BAEZA-YATES, R. (1992): Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Upper Saddle River.Google Scholar
  7. GUSFIELD, D. (1997): Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.Google Scholar
  8. HARMAN, D. (1991): How Effective is Suffixing? J. of the ASIS&T., 42,1, 7–15.Google Scholar
  9. KROVETZ, R. (1993): Viewing Morphology as an Inference Process. In: Proc. SIGIR’93.Google Scholar
  10. LOVINS, J.B. (1968): Development of a Stemming Algorithm. Mechanical Translation and Computation Linguistics, 11,1, 23–31.Google Scholar
  11. MAYFIELD, J. and MCNAMEE, P. (2003): Single n-gram Stemming. In: Proc. SIGIR’03.Google Scholar
  12. MORRISON, D.R. (1968): PATRICIA—Practical Algorithm to Retrieve Information Coded in Alphanumeric. Journal of the ACM, 15,4, 514–534.CrossRefGoogle Scholar
  13. PAICE, C.D. (1990): Another Stemmer. In: SIGIR Forum, 24,3, 56–61.CrossRefGoogle Scholar
  14. PORTER, M.F. (1980): An Algorithm for Suffix Stripping. Program, 14,3, 130–137.CrossRefGoogle Scholar
  15. PORTER, M. (2001): Snowball. http://snowball.tartarus.org/.Google Scholar
  16. ROSE, T.G., STEVENSON, M. and WHITEHEAD, M. (2002): The Reuters Corpus Volume 1-From Yesterday’s News to Tomorrow’s Language Resources. In: Proc. of LREC’02.Google Scholar
  17. STEIN, B., MEYER ZU EISSEN, S. and WISSBROCK, F. (2003): On Cluster Validity and the Information Need of Users. In: Proc. of AIA’03.Google Scholar
  18. WEINER, P. (1973): Linear Pattern Matching Algorithm. In: Proc. of the 14th IEEE Symp. on Switching and Automata Theory.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Benno Stein
    • 1
  • Martin Potthast
    • 1
  1. 1.Faculty of Media, Media SystemsBauhaus University WeimarWeimarGermany

Personalised recommendations