Distribution Based Stemmer Refinement

  • B. L. Narayan
  • Sankar K. Pal
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3776)


Stemming is a common preprocessing task applied to text corpora. Errors in this process may be refined either manually or based on a corpus. We describe a novel corpus-based stemming technique which models the given words as being generated from a multinomial distribution over the topics available in the corpus. A sequential hypothesis testing like procedure helps us group together distributionally similar words. This stemmer refines any given stemmer and its strength can be controlled with the help of two thresholds. A refinement based on the 20 Newsgroups data set shows that the proposed method splits equivalence classes appropriately.


Equivalence Class Multinomial Distribution Vector Space Model Similar Word Text Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 26–30 (2003)CrossRefGoogle Scholar
  2. 2.
    Johnson, N.L., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distributions. Wiley Interscience, Hoboken (1997)zbMATHGoogle Scholar
  3. 3.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)zbMATHCrossRefGoogle Scholar
  4. 4.
    Kraaij, W., Pohlmann, R.: Viewing stemming as recall enhancement. In: Frei, H.P., Harman, D., Schauble, P., Wilkinson, R. (eds.) Proceedings of the 17th ACM SIGIR conference, Zurich, pp. 40–48 (1996)Google Scholar
  5. 5.
    Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)zbMATHGoogle Scholar
  6. 6.
    Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)Google Scholar
  7. 7.
    Krovetz, R.: Viewing morphology as an inference process. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th ACM SIGIR conference, Pittsburgh, pp. 191–202 (1993)Google Scholar
  8. 8.
    Paice, C.D.: A method for the evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47, 28–40 (1996)CrossRefGoogle Scholar
  9. 9.
    Yamout, F., Demachkieh, R., Hamdan, G., Sabra, R.: Further enhancement to Porter algorithm. In: Proceedings of the KI 2004 Workshop on Machine Learning and Interaction for Text-based Information Retrieval, Germany, pp. 7–24 (2004)Google Scholar
  10. 10.
    Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL., pp. 183–190 (1993)Google Scholar
  11. 11.
    Wald, A.: Sequential Analysis. Wiley and Sons, New York (1947)zbMATHGoogle Scholar
  12. 12.
  13. 13.
    Xu, J., Croft, W.B.: Corpus-based stemming using coocurrence of word variants. ACM Transactions on Information Systems 16, 61–81 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • B. L. Narayan
    • 1
  • Sankar K. Pal
    • 1
  1. 1.Machine Intelligence UnitIndian Statistical InstituteCalcuttaIndia

Personalised recommendations