The BNB Distribution for Text Modeling

  • Stéphane Clinchant
  • Eric Gaussier
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4956)

Abstract

We first review in this paper the burstiness and aftereffect of future sampling phenomena, and propose a formal, operational criterion to characterize distributions according to these phenomena. We then introduce the Beta negative binomial distribution for text modeling, and show its relations to several models (in particular to the Laplace law of succession and to the tf-itf model used in the Divergence from Randomness framework of [2]). We finally illustrate the behavior of this distribution on text categorization and information retrieval experiments.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Airoldi, E., Cohen, W., Fienberg, S.: Statistical models for frequent terms in text. CMU-CLAD Technical Report (2004), http://reports-archive.adm.cs.cmu.edu/cald2005.html
  2. 2.
    Amati, G., van Rijsbergen, C.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Transactions on Information Systems 20(4) (2002)Google Scholar
  3. 3.
    Church, K., Gale, W.: Poisson mixtures. Natural Language Engineering 1(2) (1995)Google Scholar
  4. 4.
    Elkan, C.: Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: ICML 2006: Proceedings of the 23rd international conference on Machine learning, ACM Press, New York (2006)Google Scholar
  5. 5.
    Feller, W.: An Introduction to Probability Theory and Its Applications, vol. I. Wiley, New York (1968)MATHGoogle Scholar
  6. 6.
    Johnson, N., Kemp, A., Kotz, S.: Univariate Discrete Distributions. John Wiley, Chichester (1993)Google Scholar
  7. 7.
    Katz, S.: Distribution of content words and phrases in text and language modeling. Natural Language Engineering 2(1) (1996)Google Scholar
  8. 8.
    Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the dirichlet distribution. In: ICML 2005: Proceedings of the 22nd international conference on Machine learning, ACM Press, New York (2005)Google Scholar
  9. 9.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proceedings of AAAI-98 Workshop on Learning for Text Categorization (1998)Google Scholar
  10. 10.
    Minka,T.: Estimating a Dirichlet Distribution. PhD thesis (2003) Unpublished paper available at: www.research.microsoft.com/~minka
  11. 11.
    Nallapati, R., Minka, T., Robertson, S.: The smoothed-dirichlet distribution: a new building block for generative models. CIIR Technical Report (2006), http://www.cs.cmu.edu/~nmramesh/sd_tc.pdf
  12. 12.
    Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Research and Development in Information Retrieval, SIGIR 1998 (1998)Google Scholar
  13. 13.
    Rennie, J., Shih, L., Teevan, J., Karger, D.: Tackling the poor assuptions of naive bayes classifiers. In: ICML 2003 (2003)Google Scholar
  14. 14.
    Rigouste, L.: Modéthodes probabilistes pour l’analyse exploratoire de données textuelles. PhD thesis, Thèse de l’ENST, Télécom Paris (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Stéphane Clinchant
    • 1
  • Eric Gaussier
    • 2
  1. 1.Xerox Research Centre EuropeMeylanFrance
  2. 2.University Joseph Fourier (LIG). BP 53 - 38041 Grenoble cedex 9France

Personalised recommendations