Advertisement

Information Retrieval

, Volume 17, Issue 2, pp 153–176 | Cite as

A nonparametric term weighting method for information retrieval based on measuring the divergence from independence

  • İlker KocabaşEmail author
  • Bekir Taner Dinçer
  • Bahar Karaoğlan
Article

Abstract

In this article, we introduce an out-of-the-box automatic term weighting method for information retrieval. The method is based on measuring the degree of divergence from independence of terms from documents in terms of their frequency of occurrence. Divergence from independence has a well-establish underling statistical theory. It provides a plain, mathematically tractable, and nonparametric way of term weighting, and even more it requires no term frequency normalization. Besides its sound theoretical background, the results of the experiments performed on TREC test collections show that its performance is comparable to that of the state-of-the-art term weighting methods in general. It is a simple but powerful baseline alternative to the state-of-the-art methods with its theoretical and practical aspects.

Keywords

Information retrieval Nonparametric index term weighting Statistical dependence Pearson’s Chi-Square statistics 

Notes

Acknowledgments

Authors are thankful to anonymous reviewers for their valuable comments and advices that make this a better paper, and also to Craig Macdonald, Giambattista Amati, and Iadh Ounis for their kind helps. Index term weighting by DFI is developed under the project titled “Design of A Statistical Information Retrieval System”, and supported by TUBITAK, The Scientific and Technological Research Council of Turkey, with Project No:107E192. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsor.

References

  1. Agresti, A. (2002). Categorical data analysis. New Jersey: Wiley-Interscience .CrossRefzbMATHGoogle Scholar
  2. Amati, G., van Rijsbergen, C. (2002). Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems (TOIS), 20(4), 357–389.CrossRefGoogle Scholar
  3. Arnold, B. C. (1983). Pareto distributions. Fairland, Maryland: International Cooperative Publishing House.zbMATHGoogle Scholar
  4. Bookstein, A., Swanson, D. (1974). Probabilistic models for automatic indexing. Journal of the American Society for Information Science (JASIS),25, 312–318.CrossRefGoogle Scholar
  5. Bradley, J. V. (1968). Distribution free statistical tests. Englewood Cliffs, NJ: Prentice HallzbMATHGoogle Scholar
  6. Church, K. W. (1995). One term or two? In: SIGIR’95: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (pp 310–318). Seattle, US.Google Scholar
  7. Church, K. W., Gale, W. (1995). Inverse document frequency (IDF): A measure of deviations from Poisson. In: D. Yarowsky, & K. Church (Eds.), Proceedings of the ACL 3rd workshop on very large corpora, ACL, MIT (pp 121–130).Google Scholar
  8. Clarke, C. L. A., Craswell, N., Soboroff, I., & Cormack, G. V. (2010). Overview of the trec 2010 web track. In: Proceedings of the 19th text retrieval conference (TREC’10), Gaithersburg, MD, USA.Google Scholar
  9. Clinchant, A., Gaussier, E. (2010). Information-based models for ad hoc ir. In: Proceeding of the 33rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’10), (pp 234–241).Google Scholar
  10. Cooper, W., & Maron, M. (1978). Foundations of probabilistic and utility-theoretic indexing. Journal of Association for Computing Machinery,25, 67–80.CrossRefzbMATHMathSciNetGoogle Scholar
  11. Cormack, G. V., Smucker, M. D., & Clarke, C. L. A. (2010). Efficient and effective spam filtering and re-ranking for large web datasets URL http://arxiv.org/abs/1004.5168, retrieved from http://arxiv.org/abs/1004.5168v1, 1004.5168.
  12. Croft, W. B., & Harper, D. J. (1979). Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4), 285–295.CrossRefGoogle Scholar
  13. Damerau, F. (1965). An experiment in automatic indexing. American Documentation,16, 283–289.CrossRefGoogle Scholar
  14. Dinçer, B. T. (2012). Irra at trec 2012: Index term weighting based on divergence from independence model. In: Proceedings of the 21th text retrieval conference (TREC’12), Gaithersburg, MD.Google Scholar
  15. Dinçer, B. T., Kocabaş, I., & Karaoğlan, B. (2009). Irra at trec 2009: Index term weighting based on divergence from independence model. In: Proceedings of the 18th text retrieval conference (TREC’09), Gaithersburg, MD.Google Scholar
  16. Dinçer, B. T., Kocabaş, I., & Karaoğlan, B. (2010). Irra at trec 2010: Index term weighting based on divergence from independence model. In: Proceedings of the 19th text retrieval conference (TREC’10), Gaithersburg, MD.Google Scholar
  17. Fuhr, N. (1989). Models for retrieval with probabilistic indexing. Journal of Information Processing and Management, 25(1), 55–72.CrossRefMathSciNetGoogle Scholar
  18. Harter, S. (1975). A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science (JASIS), 26, 197–216.CrossRefGoogle Scholar
  19. Harter, S. (1975). A probabilistic approach to automatic keyword indexing. Part II: An algorithm for probabilistic indexing. Journal of the American Society for Information Science (JASIS), 26, 280–289.CrossRefGoogle Scholar
  20. He, B., & Ounis, I. (2003). A study of parameter tuning for term frequency normalization. In: Proceedings of the 12th international conference on information and knowledge management, New Orleans, LA.Google Scholar
  21. He, B., & Ounis, I. (2005). Term frequency normalisation tuning for BM25 and DFR model. In: Proceedings of the 27th European conference on information retrieval (ECIR’05) (pp. 200–214).Google Scholar
  22. Hiemstra, D. (2000). A probabilistic justification for using tf.idf term weighting in information retrieval. International Journal on Digital Libraries, 3(2), 131–139.CrossRefGoogle Scholar
  23. Hintikka, J. (1970). On semantic information. In: J. Hintikka, & P. Suppes (Eds.), Information and inference (pp. 3–27). Dordrecht: Synthese Library.CrossRefGoogle Scholar
  24. Hoaglin, D. C., Mosteller, F., & Tukey, J. W. (Eds.) (1983). Understanding robust and exploratory data analysis. Wiley series in probability and mathematical statistics. Wiley-InterscienceGoogle Scholar
  25. Hollander, M., & Wolfe, D. A. (1999). Nonparametric statistical methods. Hoboken, NJ: WileyzbMATHGoogle Scholar
  26. Hull, D. (1993). Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’93), (pp 329–338).Google Scholar
  27. Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development 2(2):159–165, doi: http://dx.doi.org/10.1147/rd.22.0159 Google Scholar
  28. Macdonald, C., He, B., Plachouras, V., & Ounis, I. (2005). University of Glasgow at TREC 2005: Experiments in terabyte and enterprise tracks with terrier. In: Proceedings of TREC 2005.Google Scholar
  29. Margulis, E. (1992). N-poisson document modelling. In: Proceedings of the 15th International ACM SIGIR conference on research and development in information retrieval (ACM–SIGIR’92) (pp 177–189).Google Scholar
  30. Maron, M. E., & Kuhns, J. L. (1960). On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery, 7(3), 216–244.CrossRefGoogle Scholar
  31. Mosteller, F., & Tukey, J. (1977). Data analysis and regression. Reading, MA: Addison-Wesley.Google Scholar
  32. Ounis, I., Lioma, C., Macdonald, C., & Plachouras, V. (2007). Research directions in Terrier. Novatica/UPGRADE special issue on web information Access, Ricardo Baeza-Yates et al (Eds.), Invited paper.Google Scholar
  33. Ponte, J., & Croft, B. (1998). A language modeling approach in information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’98). (pp 275–281).Google Scholar
  34. Popper, K. (1995). The logic of scientific discovery. London: Routledge.Google Scholar
  35. Robertson, S. (2004). Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation, 60(5), 503–520.CrossRefGoogle Scholar
  36. Robertson, S., & Walker, S. (1994). Some simple approximations to the 2-Poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’94) (pp 232–241).Google Scholar
  37. Robertson, S. E., & Jones, K. S. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science (JASIS),27, 129–146.CrossRefGoogle Scholar
  38. Robertson, S. E., van Rijsbergen, C. J., & Porter, M. (1981). Probabilistic models of indexing and searching. In: S. E. Robertson, C. J. van Rijsbergen, & P. Williams (Eds.), Information retrieval research, chap 4 (pp. 35–56). Oxford: Butterworths.Google Scholar
  39. Robertson, S. E., Walker, S., & Beaulieu, M. (1999). Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. In: The seventh text REtrieval conference (TREC-7), NIST Special Publication 500:242 (pp 253–264).Google Scholar
  40. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.CrossRefGoogle Scholar
  41. Shannon, C. E. (1949). The mathematical theory of communication. In: C. E. Shannon, & W. Weaver (Eds.), The mathematical theory of communication (pp. 3–91). Urbana: The University of Illinois Press.Google Scholar
  42. Singhal, A., Buckley, C., Mitra, M., & Mitra, A. (1996). Pivoted document length normalization. In: Proceedings of the 19th international ACM SIGIR conference on research and development in information retrieval (SIGIR’96), (pp 21–29).Google Scholar
  43. Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21.CrossRefGoogle Scholar
  44. Wolfowitz, J. (1942). Additive partition functions and a class of statistical hypotheses. Annuals of Statistics, 13, 247–279.CrossRefzbMATHMathSciNetGoogle Scholar
  45. Wong, S., & Yao, Y. (1995). On modeling information retrieval with probabilistic inference. ACM Transactions on Information Systems, 16, 38–68.Google Scholar
  46. Zhai, C., Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems (TOIS), 22(2), 179–214.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  • İlker Kocabaş
    • 1
    Email author
  • Bekir Taner Dinçer
    • 2
    • 3
  • Bahar Karaoğlan
    • 1
  1. 1.International Computer InstituteEge UniversityBornova, IzmirTurkey
  2. 2.Department of StatisticsMuğla UniversityMuglaTurkey
  3. 3.Department of Computer EngineeringMuğla UniversityMuglaTurkey

Personalised recommendations