Skip to main content

The use of Exploratory Data Analysis in Information Retrieval Research

  • Chapter
Advances in Information Retrieval

Part of the book series: The Information Retrieval Series ((INRE,volume 7))

Abstract

We report on a line of work in which techniques of Exploratory Data Analysis (EDA) have been used as a vehicle for better understanding of the issues confronting the researcher in information retrieval (IR). EDA is used for visualizing and studying data for the purpose of uncovering statistical regularities that might not be apparent otherwise. The analysis is carried out in terms of the formal notion of Weight of Evidence (WOE). As a result of this analysis, a novel theory in support of the use of inverse document frequency (idf) for document ranking is presented, and experimental evidence is given in favor of a modification of the classical idf formula motivated by the analysis. This approach is then extended to other sources of evidence commonly used for ranking in information retrieval systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Andrews, D. E (1978). Data analysis, exploratory. In Kruskal, W. H. and Tanur, J. M., editors, International Encyclopedia of Statistics, volume 7, pages 210–218. Free Press, New York.

    Google Scholar 

  • Beniger, J. R. and Brown, D. L. (1978). Quantitative graphics in statistics: A brief history. The American Statistician, 32(1):1–9.

    Google Scholar 

  • Bookstein, A. and Cooper, W. (1976). A general mathematical model for information retrieval systems. Library Quarterly, 46(2): 153–167.

    Google Scholar 

  • Callan, J. P., Croft, W. B., and Broglio, J. (1995). TREC and TIPSTER experiments with INQUERY. Information Processing & Management, 31(3):327–343.

    Article  Google Scholar 

  • Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications, pages 78–83.

    Google Scholar 

  • Church, K., Gale, W., Hanks, P., and Hindle, D. (1991). Using statistics in lexical analysis. In Zernik, U., editor, Lexical Acquisition: Exploiting OnLine Resources to Build a Lexicon, pages 115–164, Hillsdale, NJ. Lawrence Erlbaum Associates.

    Google Scholar 

  • Cooper, W. S. (1994). The formalism of probability theory in IR: A foundation or an encumbrance. In Croft, W. B. and van Rijsbergen, C. J., editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 242–248, Dublin, Ireland.

    Google Scholar 

  • Croft, W. B. and Xu, J. (1995). Corpus-specific stemming using word form co-occurence. In Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 147–159, Las Vegas, Nevada.

    Google Scholar 

  • Devroye, L. (1987). A Course in Density Estimation. Birkhauser, Boston.

    Google Scholar 

  • Fano, R. M. (1961). Transmission of Information; a Statistical Theory of Communications. MIT Press, Cambridge, MA.

    Google Scholar 

  • Good, I. J. (1950). Probability and the Weighing of Evidence. Charles Griffin, London.

    Google Scholar 

  • Good, I. J. (1983a). Good Thinking: The Foundations of Probability and its Applications. University of Minnesota Press, Minneapolis.

    Google Scholar 

  • Good, I. J. (1983b). Weight of evidence: A brief survey. In Bernardo, J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M., editors, Bayesian Statistics 2, pages 249–269. North-Holland, Amsterdam.

    Google Scholar 

  • Good, I. J. (1989). Statistical evidence. In Kotz, S. and Johnson, N. L., editors, Encyclopedia of Statistical Sciences, pages 651–656. Wiley.

    Google Scholar 

  • Greiff, W. R. (1999). Maximum Entropy, Weight of Evidence and Information Retrieval. PhD thesis, University of Massachusetts, Amherst, Massachusetts.

    Google Scholar 

  • Greiff, W. R. and Ponte, J. (1999). The maximum entropy approach and probabilistic IR models. To appear in ACM Transactions on Information Systems.

    Google Scholar 

  • Haines, D. and Croft, W. B. (1993). Relevance feedback and inference networks. In Korfhage, R., Rasmussen, E., and Willett, P., editors, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–203, Pittsburgh, Pa. USA.

    Google Scholar 

  • Harman, D. (1993). Overview of the first Text REtrieval Conference (TREC-1). In Harman, D. K., editor, The First Text REtrieval Conference (TRECI), pages 1–20, Gaithersburg, Md. NIST Special Publication 500-207.

    Google Scholar 

  • Harman, D. (1995). Overview of the third Text REtrieval Conference (TREC-3). In Harman, D. K., editor, The Third Text REtreival Conference (TREC-3), pages 1–20, Gaithersburg, Md. NIST Special Publication 500-225.

    Google Scholar 

  • Harman, D. (1997). Overview of the fifth Text REtrieval Conference (TREC-5). In Voorhees, E. M. and Harman, D. K., editors, The Fifth Text REtreival Conference (TREC-5), pages 1–28, Gaithersburg, Md. 500-238. NIST Special Publication 500-238.

    Google Scholar 

  • Hartwig, F. and Dearing, B. E. (1979). Exploratory Data Analysis. Number 07-016 in Sage university paper series: Quantitative applications in the social sciences. Sage Publications, Beverly Hills.

    Google Scholar 

  • Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge.

    Google Scholar 

  • Jeffreys, H. (1961). Theory of Probability. Oxford University Press, Oxford, 3 edition.

    Google Scholar 

  • Minsky, M. and Selfridge, O. G. (1961). Learning in random nets. In Cherry, C., editor, Information Theory: Fourth London Symposium, pages 335–347, London. Butterworths.

    Google Scholar 

  • Neter, J., Wasserman, W., and Kutner, M. H. (1985). Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. R. D. Irwin, Homewood, Ill., 2 edition.

    Google Scholar 

  • Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation, 33:294–304.

    Google Scholar 

  • Robertson, S. E. and Sparck Jones, K. (1977). Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129–146.

    Google Scholar 

  • Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.

    Google Scholar 

  • Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21.

    Google Scholar 

  • Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading, MA.

    Google Scholar 

  • van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, London, 2 edition.

    Google Scholar 

  • Witten, I. H., Moffat, A., and Bell, T. C. (1994). Managing Gigabytes: Compressing and Indexing Documents and Images. van Nostrand Reinhold, New York.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Kluwer Academic Publishers

About this chapter

Cite this chapter

Greiff, W.R. (2002). The use of Exploratory Data Analysis in Information Retrieval Research. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_2

Download citation

  • DOI: https://doi.org/10.1007/0-306-47019-5_2

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-7923-7812-9

  • Online ISBN: 978-0-306-47019-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics