Exploring the Stability of IDF Term Weighting

  • Xin Fu
  • Miao Chen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4993)

Abstract

TF∙IDF has been widely used as a term weighting schemes in today’s information retrieval systems. However, computation time and cost have become major concerns for its application. This study investigated the similarities and differences between IDF distributions based on the global collection and on different samples and tested the stability of the IDF measure across collections. A more efficient algorithm based on random samples generated a good approximation to the IDF computed over the entire collection, but with less computation overhead. This practice may be particularly informative and helpful for analysis on large database or dynamic environment like the Web.

Keywords

term weighting term frequency inverse document frequency stability feature oriented samples random samples 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  2. 2.
    Salton, G.: Automatic information organization and retrieval. McGraw-Hill, New York (1968)Google Scholar
  3. 3.
    Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)CrossRefGoogle Scholar
  4. 4.
    Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503–520 (2004)CrossRefGoogle Scholar
  5. 5.
    Wang, J., Rölleke, T.: Context-specific frequencies and discriminativeness for the retrieval of structured documents. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 579–582. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Blake, C.: A Comparison of document, sentence and term event spaces. In: Coling & ACL joint conference, Sydney, Australia (2006)Google Scholar
  7. 7.
    Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995)CrossRefGoogle Scholar
  8. 8.
    Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law (2005), Available at: http://aps.arxiv.org/PS_cache/cond-mat/pdf/0412/0412004.pdf
  9. 9.
    Oard, D., Marchionini, G.: A conceptual framework for text filtering (1996), Available at: http://hcil.cs.umd.edu/trs/96-10/node10.html#SECTION00051000000000000000
  10. 10.
    McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), Available at: http://www.cs.cmu.edu/_mccallum/bow
  11. 11.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  12. 12.
    Spärck Jones, K.: IDF term weighting and IR research lessons. Journal of Documentation 60, 521–523 (2004)CrossRefGoogle Scholar
  13. 13.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley, New York (1991)MATHGoogle Scholar
  14. 14.
    Snedecor, G.W., Cochran, W.G.: Statistical Methods. Iowa State University Press, Ames (1989)MATHGoogle Scholar
  15. 15.
    Conner-Linton, J.: Chi square tutorial (2003), Available at: http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Xin Fu
    • 1
  • Miao Chen
    • 1
  1. 1.University of North CarolinaChapel HillUSA

Personalised recommendations