Distribution Based Stemmer Refinement

Narayan, B. L.; Pal, Sankar K.

doi:10.1007/11590316_108

B. L. Narayan¹⁹ &
Sankar K. Pal¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 3776))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

1428 Accesses
1 Citations

Abstract

Stemming is a common preprocessing task applied to text corpora. Errors in this process may be refined either manually or based on a corpus. We describe a novel corpus-based stemming technique which models the given words as being generated from a multinomial distribution over the topics available in the corpus. A sequential hypothesis testing like procedure helps us group together distributionally similar words. This stemmer refines any given stemmer and its strength can be controlled with the help of two thresholds. A refinement based on the 20 Newsgroups data set shows that the proposed method splits equivalence classes appropriately.

Download to read the full chapter text

Chapter PDF

Analyzing the Stemming Paradigm

Statistical Stemmers: A Reproducibility Study

An Efficient Corpus-Based Stemmer

Article 07 June 2017

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Frakes, W.B., Fox, C.J.: Strength and similarity of affix removal stemming algorithms. ACM SIGIR Forum 37, 26–30 (2003)
Article Google Scholar
Johnson, N.L., Kotz, S., Balakrishnan, N.: Discrete Multivariate Distributions. Wiley Interscience, Hoboken (1997)
MATH Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18, 613–620 (1975)
Article MATH Google Scholar
Kraaij, W., Pohlmann, R.: Viewing stemming as recall enhancement. In: Frei, H.P., Harman, D., Schauble, P., Wilkinson, R. (eds.) Proceedings of the 17th ACM SIGIR conference, Zurich, pp. 40–48 (1996)
Google Scholar
Vapnik, V.N.: The nature of statistical learning theory. Springer, New York (1995)
MATH Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Google Scholar
Krovetz, R.: Viewing morphology as an inference process. In: Korfhage, R., Rasmussen, E., Willett, P. (eds.) Proceedings of the 16th ACM SIGIR conference, Pittsburgh, pp. 191–202 (1993)
Google Scholar
Paice, C.D.: A method for the evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47, 28–40 (1996)
Article Google Scholar
Yamout, F., Demachkieh, R., Hamdan, G., Sabra, R.: Further enhancement to Porter algorithm. In: Proceedings of the KI 2004 Workshop on Machine Learning and Interaction for Text-based Information Retrieval, Germany, pp. 7–24 (2004)
Google Scholar
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 31st Annual Meeting of the ACL., pp. 183–190 (1993)
Google Scholar
Wald, A.: Sequential Analysis. Wiley and Sons, New York (1947)
MATH Google Scholar
http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
Xu, J., Croft, W.B.: Corpus-based stemming using coocurrence of word variants. ACM Transactions on Information Systems 16, 61–81 (1998)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Machine Intelligence Unit, Indian Statistical Institute, 203, B. T. Road, Calcutta, 700108, India
B. L. Narayan & Sankar K. Pal

Authors

B. L. Narayan
View author publications
You can also search for this author in PubMed Google Scholar
Sankar K. Pal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Soft Computing Research, Machine Intelligence Unit, Indian Statistical Institute, India
Sankar K. Pal
Machine Intelligence Unit, Indian Statistical Institute, 203 B. T. Road, 700108, Kolkata
Sanghamitra Bandyopadhyay
Machine Intelligence Unit, Indian Statistical Institute, 700 108, Kolkata, India
Sambhunath Biswas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Narayan, B.L., Pal, S.K. (2005). Distribution Based Stemmer Refinement. In: Pal, S.K., Bandyopadhyay, S., Biswas, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2005. Lecture Notes in Computer Science, vol 3776. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11590316_108

Download citation

DOI: https://doi.org/10.1007/11590316_108
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30506-4
Online ISBN: 978-3-540-32420-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Distribution Based Stemmer Refinement

Abstract

Chapter PDF