Abstract
We report on a line of work in which techniques of Exploratory Data Analysis (EDA) have been used as a vehicle for better understanding of the issues confronting the researcher in information retrieval (IR). EDA is used for visualizing and studying data for the purpose of uncovering statistical regularities that might not be apparent otherwise. The analysis is carried out in terms of the formal notion of Weight of Evidence (WOE). As a result of this analysis, a novel theory in support of the use of inverse document frequency (idf) for document ranking is presented, and experimental evidence is given in favor of a modification of the classical idf formula motivated by the analysis. This approach is then extended to other sources of evidence commonly used for ranking in information retrieval systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andrews, D. E (1978). Data analysis, exploratory. In Kruskal, W. H. and Tanur, J. M., editors, International Encyclopedia of Statistics, volume 7, pages 210–218. Free Press, New York.
Beniger, J. R. and Brown, D. L. (1978). Quantitative graphics in statistics: A brief history. The American Statistician, 32(1):1–9.
Bookstein, A. and Cooper, W. (1976). A general mathematical model for information retrieval systems. Library Quarterly, 46(2): 153–167.
Callan, J. P., Croft, W. B., and Broglio, J. (1995). TREC and TIPSTER experiments with INQUERY. Information Processing & Management, 31(3):327–343.
Callan, J. P., Croft, W. B., and Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications, pages 78–83.
Church, K., Gale, W., Hanks, P., and Hindle, D. (1991). Using statistics in lexical analysis. In Zernik, U., editor, Lexical Acquisition: Exploiting OnLine Resources to Build a Lexicon, pages 115–164, Hillsdale, NJ. Lawrence Erlbaum Associates.
Cooper, W. S. (1994). The formalism of probability theory in IR: A foundation or an encumbrance. In Croft, W. B. and van Rijsbergen, C. J., editors, Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 242–248, Dublin, Ireland.
Croft, W. B. and Xu, J. (1995). Corpus-specific stemming using word form co-occurence. In Proceedings for the Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 147–159, Las Vegas, Nevada.
Devroye, L. (1987). A Course in Density Estimation. Birkhauser, Boston.
Fano, R. M. (1961). Transmission of Information; a Statistical Theory of Communications. MIT Press, Cambridge, MA.
Good, I. J. (1950). Probability and the Weighing of Evidence. Charles Griffin, London.
Good, I. J. (1983a). Good Thinking: The Foundations of Probability and its Applications. University of Minnesota Press, Minneapolis.
Good, I. J. (1983b). Weight of evidence: A brief survey. In Bernardo, J. M., DeGroot, M. H., Lindley, D. V., and Smith, A. F. M., editors, Bayesian Statistics 2, pages 249–269. North-Holland, Amsterdam.
Good, I. J. (1989). Statistical evidence. In Kotz, S. and Johnson, N. L., editors, Encyclopedia of Statistical Sciences, pages 651–656. Wiley.
Greiff, W. R. (1999). Maximum Entropy, Weight of Evidence and Information Retrieval. PhD thesis, University of Massachusetts, Amherst, Massachusetts.
Greiff, W. R. and Ponte, J. (1999). The maximum entropy approach and probabilistic IR models. To appear in ACM Transactions on Information Systems.
Haines, D. and Croft, W. B. (1993). Relevance feedback and inference networks. In Korfhage, R., Rasmussen, E., and Willett, P., editors, Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 191–203, Pittsburgh, Pa. USA.
Harman, D. (1993). Overview of the first Text REtrieval Conference (TREC-1). In Harman, D. K., editor, The First Text REtrieval Conference (TRECI), pages 1–20, Gaithersburg, Md. NIST Special Publication 500-207.
Harman, D. (1995). Overview of the third Text REtrieval Conference (TREC-3). In Harman, D. K., editor, The Third Text REtreival Conference (TREC-3), pages 1–20, Gaithersburg, Md. NIST Special Publication 500-225.
Harman, D. (1997). Overview of the fifth Text REtrieval Conference (TREC-5). In Voorhees, E. M. and Harman, D. K., editors, The Fifth Text REtreival Conference (TREC-5), pages 1–28, Gaithersburg, Md. 500-238. NIST Special Publication 500-238.
Hartwig, F. and Dearing, B. E. (1979). Exploratory Data Analysis. Number 07-016 in Sage university paper series: Quantitative applications in the social sciences. Sage Publications, Beverly Hills.
Hardle, W. (1990). Applied Nonparametric Regression. Cambridge University Press, Cambridge.
Jeffreys, H. (1961). Theory of Probability. Oxford University Press, Oxford, 3 edition.
Minsky, M. and Selfridge, O. G. (1961). Learning in random nets. In Cherry, C., editor, Information Theory: Fourth London Symposium, pages 335–347, London. Butterworths.
Neter, J., Wasserman, W., and Kutner, M. H. (1985). Applied Linear Statistical Models: Regression, Analysis of Variance, and Experimental Designs. R. D. Irwin, Homewood, Ill., 2 edition.
Robertson, S. E. (1977). The probability ranking principle in IR. Journal of Documentation, 33:294–304.
Robertson, S. E. and Sparck Jones, K. (1977). Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129–146.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, London.
Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11–21.
Tukey, J. W. (1977). Exploratory Data Analysis. Addison-Wesley Publishing Company, Reading, MA.
van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths, London, 2 edition.
Witten, I. H., Moffat, A., and Bell, T. C. (1994). Managing Gigabytes: Compressing and Indexing Documents and Images. van Nostrand Reinhold, New York.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Kluwer Academic Publishers
About this chapter
Cite this chapter
Greiff, W.R. (2002). The use of Exploratory Data Analysis in Information Retrieval Research. In: Croft, W.B. (eds) Advances in Information Retrieval. The Information Retrieval Series, vol 7. Springer, Boston, MA. https://doi.org/10.1007/0-306-47019-5_2
Download citation
DOI: https://doi.org/10.1007/0-306-47019-5_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-7923-7812-9
Online ISBN: 978-0-306-47019-6
eBook Packages: Springer Book Archive