Analysis and Algorithms for Stemming Inversion

  • Ingo Feinerer
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6458)

Abstract

Stemming is a fundamental technique for processing large amounts of data in information retrieval and text mining. However, after processing the reversal of this process is often desirable, e.g., for human interpretation, or methods which operate on sequences of characters. We present a formal analysis of the stemming inversion problem, and show that the underlying optimization problem capturing conceptual groups as known from under- and overstemming, is of high computational complexity. We present efficient heuristic algorithms for practical application in information retrieval and test our approach on real data.

Keywords

Stemming inversion 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Annett, M., Kondrak, G.: A comparison of sentiment analysis techniques: Polarizing movie blogs. In: Bergler, S. (ed.) Canadian AI. LNCS (LNAI), vol. 5032, pp. 25–35. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  2. 2.
    Dawson, J.L.: Suffix removal for word conflation. Bulletin of the Association for Literary and Linguistic Computing 2(3), 33–46 (1974)Google Scholar
  3. 3.
    Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. Journal of Statistical Software 25(5), 1–54 (2008), http://www.jstatsoft.org/v25/i05
  4. 4.
    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)MATHGoogle Scholar
  5. 5.
    Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-means clustering algorithm (AS R39: 81V30 p355-356). Applied Statistics 28, 100–108 (1979)CrossRefGoogle Scholar
  6. 6.
    Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W. (eds.) Complexity of Computer Computations, pp. 85–103 (1972)Google Scholar
  7. 7.
    Krovetz, R.: Viewing morphology as an inference process. Artificial Intelligence 118(1–2), 277–294 (2000)CrossRefMATHGoogle Scholar
  8. 8.
    Levenshtein, V.: Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8), 707–710 (1966)MathSciNetMATHGoogle Scholar
  9. 9.
    Lewis, D.: Reuters-21578 text categorization test collection (1997), http://www.daviddlewis.com/resources/testcollections/reuters21578/
  10. 10.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. J. of Machine Learning Research 2, 419–444 (2002)MATHGoogle Scholar
  11. 11.
    Lovins, J.B.: Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 22–31 (1968)Google Scholar
  12. 12.
    MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press, Berkeley (1967)Google Scholar
  13. 13.
    Paice, C.D.: Another stemmer. SIGIR Forum 24(3), 56–61 (1990)CrossRefGoogle Scholar
  14. 14.
    Paice, C.D.: Method for evaluation of stemming algorithms based on error counting. Journal of the American Society for Information Science 47(8), 632–649 (1996)CrossRefGoogle Scholar
  15. 15.
    Porter, M.: An algorithm for suffix stripping. Program 3, 130–137 (1980)CrossRefGoogle Scholar
  16. 16.
    R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2010), http://www.R-project.org ISBN 3-900051-07-0
  17. 17.
    Stone, P.J.: Thematic text analysis: new agendas for analyzing text content. In: Text Analysis for the Social Sciences. ch. 2, Lawrence Erlbaum Associates, Mahwah (1997)Google Scholar
  18. 18.
    Strzalkowski, T., Vauthey, B.: Information retrieval using robust natural language processing. In: Proc. of the 30th annual meeting on ACL, Association for Computational Linguistics, Morristown, NJ, USA, pp. 104–111 (1992)Google Scholar
  19. 19.
    Uyar, A.: Google stemming mechanisms. J. of Inf. Sci. 35(5), 499–514 (2009)Google Scholar
  20. 20.
    Weiss, S., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, Heidelberg (2004)MATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Ingo Feinerer
    • 1
  1. 1.Vienna University of TechnologyAustria

Personalised recommendations