Improving the Automatic Retrieval of Text Documents

  • Maristella Agosti
  • Michela Bacchin
  • Nicola Ferro
  • Massimo Melucci
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2785)

Abstract

This paper reports on a statistical stemming algorithm based on link analysis. Considering that a word is formed by a prefix (stem) and a suffix, the key idea is that the interlinked prefixes and suffixes form a community of sub-strings. Thus, discovering these communities means searching for the best word splits that give the best word stems. The algorithm has been used in our participation in the CLEF 2002 Italian monolingual task. The experimental results show that stemming improves text retrieval effectiveness. They also show that the effectiveness level of our algorithm is comparable to that of an algorithm based on a-priori linguistic knowledge.

Keywords

Italian Text Retrieval Information Retrieval Web Information Gathering Stemming Link-based Analysis 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    M. Agosti, M. Bacchin, and M. Melucci. Report on the Construction of an Italian Test Collection. Position paper at the Workshop on Multi-lingual Information Retrieval at the ACM International Conference on Research and Development in Information Retrieval (SIGIR), Berkeley, CA, USA, 1999. 280Google Scholar
  2. [2]
    A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the World Wide Web. In Proceedings of the World Wide Web Conference, pages 415-429, Hong Kong, 2001. ACM Press. 285Google Scholar
  3. [3]
    C. Cleverdon. The Cranfield Tests on Index Language Devices. In K. Sparck Jones and P. Willett (Eds.). Readings in Information Retrieval, pages 47-59, Morgan Kaufmann, 1997.Google Scholar
  4. [4]
    W. B. Frakes and R. Baeza-Yates. Information Retrieval: data structures and algorithms. Prentice Hall, 1992. 282Google Scholar
  5. [5]
    J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):154–198, 2001. 283CrossRefMathSciNetGoogle Scholar
  6. [6]
    M. Hafer and S. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1994. 283CrossRefGoogle Scholar
  7. [7]
    D. Harman. How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15, 1991. 282, 286CrossRefGoogle Scholar
  8. [8]
    J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, September 1999. 283, 285CrossRefMATHMathSciNetGoogle Scholar
  9. [9]
    R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), 1993. 282Google Scholar
  10. [10]
    J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22–31, 1968. 282Google Scholar
  11. [11]
    The Jakarta Project. Lucene. http://jakarta.apache.org/lucene/docs/index.html, 2002. 286
  12. [12]
    C. D. Manning and H. Schütze. Foundations of statistical natural language processing. The MIT Press, 1999. 283Google Scholar
  13. [13]
    C.D. Paice. Another Stemmer. In A CM SIGIR Forum, 24, 56–61, 1990. 282CrossRefGoogle Scholar
  14. [14]
    M. Popovic and P. Willett. The effectiveness of stemming for natural-language access to Slovene textual data. Journal of the American Society for Information Science, 43(5):383–390, 1992. 282Google Scholar
  15. [15]
    M. Porter. Snowball: A language for stemming algorithms. http://snowball.sourceforge.net, 2001. 287
  16. [16]
    M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980. 282CrossRefGoogle Scholar
  17. [17]
    G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, NY, 1983. 286MATHGoogle Scholar
  18. [18]
    G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988. 286CrossRefGoogle Scholar
  19. [19]
    Institut interfacultaire d’informatique. CLEF and Multilingual information retrieval. University of Neuchatel. http://www.unine.ch/info/clef/, 2002. 286
  20. [20]
    C. Buckley. Treceval. ftp://ftp.cs.cornell.edu/pub/smart/, 2002.
  21. [21]
    E. M. Voorhees. Special Issue on the Sixth Text Retrieval Conference (TREC-6). Information Processing and Management. Volume 36, Number 1, 2000. 281Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Maristella Agosti
    • 1
  • Michela Bacchin
    • 1
  • Nicola Ferro
    • 1
  • Massimo Melucci
    • 1
  1. 1.Department of Information EngineeringUniversity of PaduaPadovaItaly

Personalised recommendations