Abstract
Stemming is a common preprocessing task applied to text corpora. Errors in this process may be refined either manually or based on a corpus. We describe a novel corpus-based stemming technique which models the given words as being generated from a multinomial distribution over the topics available in the corpus. A sequential hypothesis testing like procedure helps us group together distributionally similar words. This stemmer refines any given stemmer and its strength can be controlled with the help of two thresholds. A refinement based on the 20 Newsgroups data set shows that the proposed method splits equivalence classes appropriately.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 26–30 (2003)
Johnson, N.L., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distributions. Wiley Interscience, Hoboken (1997)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)
Kraaij, W., Pohlmann, R.: Viewing stemming as recall enhancement. In: Frei, H.P., Harman, D., Schauble, P., Wilkinson, R. (eds.) Proceedings of the 17th ACM SIGIR conference, Zurich, pp. 40–48 (1996)
Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Krovetz, R.: Viewing morphology as an inference process. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th ACM SIGIR conference, Pittsburgh, pp. 191–202 (1993)
Paice, C.D.: A method for the evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47, 28–40 (1996)
Yamout, F., Demachkieh, R., Hamdan, G., Sabra, R.: Further enhancement to Porter algorithm. In: Proceedings of the KI 2004 Workshop on Machine Learning and Interaction for Text-based Information Retrieval, Germany, pp. 7–24 (2004)
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL., pp. 183–190 (1993)
Wald, A.: Sequential Analysis. Wiley and Sons, New York (1947)
http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
Xu, J., Croft, W.B.: Corpus-based stemming using coocurrence of word variants. ACM Transactions on Information Systems 16, 61–81 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Narayan, B.L., Pal, S.K. (2005). Distribution Based Stemmer Refinement. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2005. Lecture Notes in Computer Science, vol 3776. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11590316_108
Download citation
DOI: https://doi.org/10.1007/11590316_108
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30506-4
Online ISBN: 978-3-540-32420-1
eBook Packages: Computer ScienceComputer Science (R0)