Information Retrieval

, Volume 9, Issue 3, pp 331–342 | Cite as

A goodness of fit test approach in information retrieval

  • Kostas FragosEmail author
  • Yannis Maistros


In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model “fits” the user’s information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests’ framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms.


goodness of fit tests Information Retrieval 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Amati G and Rijsbergen V (2002) Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness. In: ACM Transactions on Information Systems, 20(4):357–389CrossRefGoogle Scholar
  2. Berger A and Lafferty J (1999) Information Retrieval as statistical Translation. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 222–229Google Scholar
  3. Broglio J, Callan JP, Croft WB and Nachbar DW (1995). Document Retrieval and Routing using the INQUERY system. In: Harman DW (ed.), Overview of the Third Retrieval Conference (TREC 3), pp. 29–38. NIST Special Publication, pp. 500–225Google Scholar
  4. D’Agostino BR and Stephens MA (eds.) (1986) Goodness-of-fit Techniques, Dekker, New YorkGoogle Scholar
  5. Jelinek F and Mercer R (1980) Interpolated estimation of Markov source parameters from sparse data. In: Gelsema ES and Kanal LN (eds.): Pattern Recognition in Practice, pp. 381–402. North Holland, AmsterdamGoogle Scholar
  6. Miller HD, Leek T and Schwartz RM (1999a) BBN at TREC-7: using hidden markov models for information retrieval. In: Proceedings of the seventh Text Retrieval Conference, TREC-7, pp. 133–142. NIST Special Publication, pp. 500–242Google Scholar
  7. Miller HD, Leek T and Schwartz RM (1999b) A hidden Markov model information retrieval system. In: Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 214–221Google Scholar
  8. Oakes M, Gaizauskas R and Fowkes H (2001) A Method Based on the Chi-square Test for Document Classification. In: 24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01)Google Scholar
  9. Ponte J and Croft B (1998) A language modeling approach in information retrieval. In: Croft B, Moffat A and Rijsbergen C (eds.): Proceeding of the 21st 5 ACM SIGIR Conference on Research and Development in Information Retrieval, (Melbourne, Australia), ACM, New York, pp. 275–281Google Scholar
  10. Robertson ES and Jones KS (1976) Relevance weighting of search terms. Journal of the American Society for Information Sciences 27(3):129–146Google Scholar
  11. Robertson ES, Walker S, Jones S, Hancock-Beaulieu MM and Gatford, M (1995) In: Harman DK (ed.): Okapi at TREC-3, the Third Text Retrieval Conference (TREC-3)Google Scholar
  12. Robertson SE, Rijsbergen JC and Porter M (1981) Probabilistic models of indexing and searching. In: Robertson SE, van Rijsbergen CJ and Williams P (eds.): Information Retrieval Research, Butterworths, Oxford, UK, Chapter 4, pp. 35–36Google Scholar
  13. Salton G (1971) The SMART Retrieval System—Experiments in Automatic Document Processing. Prentice hall Inc., Englewood Cliffs, NJGoogle Scholar
  14. Shannon C (1948) A mathematical theory of communication. Bell System Technical Journal 27:xxx–xxxMathSciNetGoogle Scholar
  15. Turtle H and Croft, W (1991) Evaluation of an inference network-based retrieval model. ACM transactions on Information Systems 9(3):187–222CrossRefGoogle Scholar
  16. Voorhess E and Harman D (eds.) (2001) Proceeding of text retrieval Conference (TREC1-9). NIST Special Publications,
  17. Zhai C (2001) Notes on the Lemur TFIDF model. In: School of Computer Science Carnegie Mellon UniversityGoogle Scholar
  18. Zhai C and Lafferty J (2001) A study of smoothing methods for language models applied to ad hoc information retrieval. In: 24th ACM SIGIR Conference on Research and Development in Information Retrieval, (SIGIR’01)Google Scholar

Copyright information

© Springer Science + Business Media, LLC 2006

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineersNational Technical University of Athens IroonZografouGreece

Personalised recommendations