The Effiectiveness of a Graph-Based Algorithm for Stemming

  • Michela Bacchin
  • Nicola Ferro
  • Massimo Melucci
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2555)

Abstract

In Information Retrieval (IR), stemming enables a matching of query and document terms which are related to a same meaning but which can appear in different morphological variants. In this paper we will propose and evaluate a statistical graph-based algorithm for stemming. Considering that a word is formed by a stem (prefix) and a derivation (suffix), the key idea is that strongly interlinked prefixes and suffixes form a community of sub-strings. Discovering these communities means searching for the best word splits which give the best word stems. We conducted some experiments on CLEF 2001 test subcollections for Italian language. The results show that stemming improve the IR effectiveness. They also show that effectiveness level of our algorithm is comparable to that of an algorithm based on a-priori linguistic knowledge. This is an encouraging result, particularly in a multi-lingual context.

References

  1. 1.
    M. Agosti, M. Bacchin, N. Ferro and M. Melucci. University of Padua at CLEF 2002: Experiments to evaluate a statistical stemming algorithm. In Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF 2002 workshop, Lecture Notes in Computer Science series, Springer Verlag (forthcoming).Google Scholar
  2. 4.
    C. Cleverdon. The Cranfield Tests on Index Language Devices. In K. Sparck Jones and P. Willett (Eds.). Readings in Information Retrieval, pages 47–59, Morgan Kaufmann, 1997.Google Scholar
  3. 5.
    W.B. Frakes and R. Baeza-Yates. Information Retrieval: data structures and algorithms. Prentice Hall, 1992.Google Scholar
  4. 6.
    J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):154–198, 2001. au]7._M. Hafer and S. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1994.CrossRefMathSciNetGoogle Scholar
  5. 8.
    D. Harman. How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15, 1991.CrossRefGoogle Scholar
  6. 9.
    J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, September 1999.MATHCrossRefMathSciNetGoogle Scholar
  7. 10.
    R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), 1993.Google Scholar
  8. 11.
    J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22–31, 1968.Google Scholar
  9. 13.
    C.D. Manning and H. Schütze. Foundations of statistical natural language processing. The MIT Press, 1999.Google Scholar
  10. 14.
    C.D. Paice. Another Stemmer. In ACM SIGIR Forum, 24, 56–61, 1990.CrossRefGoogle Scholar
  11. 15.
    M. Popovic and P. Willett. The effectiveness of stemming for natural-language access to sloven textual data. Journal of the American Society for Information Science, 43(5):383–390, 1992.CrossRefGoogle Scholar
  12. 16.
    C. Peters and M. Braschler. Cross-Language System Evaluation: the CLEF Campaigns. Journal of the American Society for Information Science and Technology, 52(12):1067–1072, 2001.CrossRefGoogle Scholar
  13. 17.
    M. Porter. Snowball: A language for stemming algorithms. http://snowball.sourceforge.net, 2001.
  14. 18.
    M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.Google Scholar
  15. 19.
    G. Salton and M. McGill. Introduction to modern Information Retrieval. McGraw-Hill, New York, NY, 1983.Google Scholar
  16. 20.
    G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988.CrossRefGoogle Scholar
  17. 21.
    Institut interfacultaire d’informatique. CLEF and Multilingual information retrieval. University of Neuchatel. http://www.unine.ch/info/clef/, 2002.
  18. 22.
    C. Buckley. Trec eval. ftp://ftp.cs.cornell.edu/pub/smart/, 2002.

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Michela Bacchin
    • 1
  • Nicola Ferro
    • 1
  • Massimo Melucci
    • 1
  1. 1.Department of Information EngineeringUniversity of PaduaPadovaItaly

Personalised recommendations