Advertisement

Inverse Document Frequency (IDF): A Measure of Deviations from Poisson

  • K. Church
  • W. Gale
Part of the Text, Speech and Language Technology book series (TLTB, volume 11)

Abstract

Low frequency words tend to be rich in content, and vice versa. But not all equally frequent words are equally meaningful. We will use inverse document frequency (IDF), a quantity borrowed from Information Retrieval, to distinguish words like somewhat and boycott. Both somewhat and boycott appeared approximately 1000 times in a corpus of 1989 Associated Press articles, but boycott is a better keyword because its IDF is farther from what would be expected by chance (Poisson).

Keywords

Information Retrieval Word Frequency Hide Variable Inverse Document Frequency Poisson Mixture 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Church, K. and Gale, W. 1997. Poisson Mixtures. Natural Language Engineering, vol 3 (2).Google Scholar
  2. Johnson, N. and Kotz, S. 1969. Discrete Distributions Houghton Mifflin, Boston, Ma.Google Scholar
  3. Katz, S. M. 1996. Distribution of content words and phrases in text and language modelling. in Natural Language Engineering, vol 2(1), pp. 15–59.Google Scholar
  4. Mosteller, F. and Wallace, D. 1964. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, Ma.Google Scholar
  5. Salton, G. 1989. Automatic Text Processing. Addison-Wesley, Reading, Ma.Google Scholar
  6. Shannon, C. 1948. The Mathemarical Theory of Communication. in Bell System Technical Journal Google Scholar
  7. Sparck Jones, K. 1972. A Statistical Interpretation of Term Specificity and its Application in Retrieval. in Journal of Documentation, vol. 28 (1), pp. 11–21.Google Scholar
  8. van Rijsbergen, C. 1979. Information Retrieval. Second edition. Butterworths, London.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 1999

Authors and Affiliations

  • K. Church
  • W. Gale

There are no affiliations available

Personalised recommendations