AIRS 2008: Information Retrieval Technology pp 10-21 | Cite as
Exploring the Stability of IDF Term Weighting
Abstract
TF∙IDF has been widely used as a term weighting schemes in today’s information retrieval systems. However, computation time and cost have become major concerns for its application. This study investigated the similarities and differences between IDF distributions based on the global collection and on different samples and tested the stability of the IDF measure across collections. A more efficient algorithm based on random samples generated a good approximation to the IDF computed over the entire collection, but with less computation overhead. This practice may be particularly informative and helpful for analysis on large database or dynamic environment like the Web.
Keywords
term weighting term frequency inverse document frequency stability feature oriented samples random samplesPreview
Unable to display preview. Download preview PDF.
References
- 1.Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
- 2.Salton, G.: Automatic information organization and retrieval. McGraw-Hill, New York (1968)Google Scholar
- 3.Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972)CrossRefGoogle Scholar
- 4.Robertson, S.: Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503–520 (2004)CrossRefGoogle Scholar
- 5.Wang, J., Rölleke, T.: Context-specific frequencies and discriminativeness for the retrieval of structured documents. In: Lalmas, M., MacFarlane, A., Rüger, S.M., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 579–582. Springer, Heidelberg (2006)CrossRefGoogle Scholar
- 6.Blake, C.: A Comparison of document, sentence and term event spaces. In: Coling & ACL joint conference, Sydney, Australia (2006)Google Scholar
- 7.Church, K.W., Gale, W.A.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995)CrossRefGoogle Scholar
- 8.Newman, M.E.J.: Power laws, Pareto distributions and Zipf’s law (2005), Available at: http://aps.arxiv.org/PS_cache/cond-mat/pdf/0412/0412004.pdf
- 9.Oard, D., Marchionini, G.: A conceptual framework for text filtering (1996), Available at: http://hcil.cs.umd.edu/trs/96-10/node10.html#SECTION00051000000000000000
- 10.McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), Available at: http://www.cs.cmu.edu/_mccallum/bow
- 11.Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
- 12.Spärck Jones, K.: IDF term weighting and IR research lessons. Journal of Documentation 60, 521–523 (2004)CrossRefGoogle Scholar
- 13.Cover, T.M., Thomas, J.A.: Elements of Information Theory. John Wiley, New York (1991)MATHGoogle Scholar
- 14.Snedecor, G.W., Cochran, W.G.: Statistical Methods. Iowa State University Press, Ames (1989)MATHGoogle Scholar
- 15.Conner-Linton, J.: Chi square tutorial (2003), Available at: http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html