A segmented topic model based on the two-parameter Poisson-Dirichlet process

Du, Lan; Buntine, Wray; Jin, Huidong

doi:10.1007/s10994-010-5197-4

A segmented topic model based on the two-parameter Poisson-Dirichlet process

Published: 23 July 2010

Volume 81, pages 5–19, (2010)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

A segmented topic model based on the two-parameter Poisson-Dirichlet process

Download PDF

Lan Du^1,2,
Wray Buntine^1,2 &
Huidong Jin^1,3

1602 Accesses
43 Citations
3 Altmetric
Explore all metrics

Abstract

Documents come naturally with structure: a section contains paragraphs which itself contains sentences; a blog page contains a sequence of comments and links to related blogs. Structure, of course, implies something about shared topics. In this paper we take the simplest form of structure, a document consisting of multiple segments, as the basis for a new form of topic model. To make this computationally feasible, and to allow the form of collapsed Gibbs sampling that has worked well to date with topic models, we use the marginalized posterior of a two-parameter Poisson-Dirichlet process (or Pitman-Yor process) to handle the hierarchical modelling. Experiments using either paragraphs or sentences as segments show the method significantly outperforms standard topic models on either whole document or segment, and previous segmented models, based on the held-out perplexity measure.

References

Arora, R., & Ravindran, B. (2008). Latent Dirichlet allocation and singular value decomposition based multi-document summarization. In Proc. of ICDM’08 (pp. 713–718).
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In Proc. of the 23rd ICML (pp. 113–120).
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Article MATH Google Scholar
Blei, D., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested Chinese restaurant process. In NIPS 16.
Buntine, W., & Hutter, M. (2010). A Bayesian review of the Poisson-Dirichlet process. http://arxiv4.library.cornell.edu/abs/1007.0296v1.
Buntine, W., & Jakulin, A. (2006). Discrete components analysis. In Subspace, latent structure and feature selection techniques. Berlin: Springer.
Google Scholar
Cohn, D., & Hofmann, T. (2001). The missing link—a probabilistic model of document content and hypertext connectivity. In NIPS 13.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.
Article Google Scholar
Ishwaran, H., & James, L. (2001). Gibbs sampling methods for stick-breaking priors. Journal of ASA, 96(453), 161–173.
MATH MathSciNet Google Scholar
Lewis, D., Yand, Y., Rose, T., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Google Scholar
Li, W., Blei, D., & Mccallum, A. (2007). Nonparametric Bayes pachinko allocation. In Proc. of the 23rd UAI.
Mimno, D., Li, W., & McCallum, A. (2007). Mixtures of hierarchical topics with Pachinko allocation. In Proc. of the 24th ICML (pp. 633–640).
Minka, T. P. (2000). Estimating a Dirichlet distribution (Technical report). MIT.
Pitman, J., & Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability, 25(2), 855–900.
Article MATH MathSciNet Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proc. of the 20th UAI (pp. 487–494).
Shafiei, M. M., & Milios, E. E. (2006). Latent Dirichlet co-clustering. In Proc. of the 6-th ICDM (pp. 542–551).
Teh, Y. W. (2006). A Bayesian interpretation of interpolated Kneser-Ney (Technical Report TRA2/06). School of Computing, National University of Singapore.
Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proc. of the 21st ICCL (pp. 985–992).
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566–1581.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Research School of Information Sciences and Engineering, The Australian National University, Canberra, Australia
Lan Du, Wray Buntine & Huidong Jin
NICTA, Canberra, Australia
Lan Du & Wray Buntine
CSIRO Mathematics, Informatics and Statistics, Canberra, Australia
Huidong Jin

Authors

Lan Du
View author publications
You can also search for this author in PubMed Google Scholar
Wray Buntine
View author publications
You can also search for this author in PubMed Google Scholar
Huidong Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lan Du.

Additional information

Editors: José L. Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, L., Buntine, W. & Jin, H. A segmented topic model based on the two-parameter Poisson-Dirichlet process. Mach Learn 81, 5–19 (2010). https://doi.org/10.1007/s10994-010-5197-4

Download citation

Received: 30 April 2010
Accepted: 20 June 2010
Published: 23 July 2010
Issue Date: October 2010
DOI: https://doi.org/10.1007/s10994-010-5197-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A segmented topic model based on the two-parameter Poisson-Dirichlet process

Abstract

Article PDF

Similar content being viewed by others

A Semi-supervised Hidden Markov Topic Model Based on Prior Knowledge

A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models

An N-Gram Topic Model for Time-Stamped Documents

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A segmented topic model based on the two-parameter Poisson-Dirichlet process

Abstract

Article PDF

Similar content being viewed by others

A Semi-supervised Hidden Markov Topic Model Based on Prior Knowledge

A Novel Document Generation Process for Topic Detection Based on Hierarchical Latent Tree Models

An N-Gram Topic Model for Time-Stamped Documents

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation