Abstract
Documents come naturally with structure: a section contains paragraphs which itself contains sentences; a blog page contains a sequence of comments and links to related blogs. Structure, of course, implies something about shared topics. In this paper we take the simplest form of structure, a document consisting of multiple segments, as the basis for a new form of topic model. To make this computationally feasible, and to allow the form of collapsed Gibbs sampling that has worked well to date with topic models, we use the marginalized posterior of a two-parameter Poisson-Dirichlet process (or Pitman-Yor process) to handle the hierarchical modelling. Experiments using either paragraphs or sentences as segments show the method significantly outperforms standard topic models on either whole document or segment, and previous segmented models, based on the held-out perplexity measure.
Article PDF
Similar content being viewed by others
References
Arora, R., & Ravindran, B. (2008). Latent Dirichlet allocation and singular value decomposition based multi-document summarization. In Proc. of ICDM’08 (pp. 713–718).
Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. In Proc. of the 23rd ICML (pp. 113–120).
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Blei, D., Griffiths, T. L., Jordan, M. I., & Tenenbaum, J. B. (2004). Hierarchical topic models and the nested Chinese restaurant process. In NIPS 16.
Buntine, W., & Hutter, M. (2010). A Bayesian review of the Poisson-Dirichlet process. http://arxiv4.library.cornell.edu/abs/1007.0296v1.
Buntine, W., & Jakulin, A. (2006). Discrete components analysis. In Subspace, latent structure and feature selection techniques. Berlin: Springer.
Cohn, D., & Hofmann, T. (2001). The missing link—a probabilistic model of document content and hypertext connectivity. In NIPS 13.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.
Ishwaran, H., & James, L. (2001). Gibbs sampling methods for stick-breaking priors. Journal of ASA, 96(453), 161–173.
Lewis, D., Yand, Y., Rose, T., & Li, F. (2004). RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.
Li, W., Blei, D., & Mccallum, A. (2007). Nonparametric Bayes pachinko allocation. In Proc. of the 23rd UAI.
Mimno, D., Li, W., & McCallum, A. (2007). Mixtures of hierarchical topics with Pachinko allocation. In Proc. of the 24th ICML (pp. 633–640).
Minka, T. P. (2000). Estimating a Dirichlet distribution (Technical report). MIT.
Pitman, J., & Yor, M. (1997). The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator. Annals of Probability, 25(2), 855–900.
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proc. of the 20th UAI (pp. 487–494).
Shafiei, M. M., & Milios, E. E. (2006). Latent Dirichlet co-clustering. In Proc. of the 6-th ICDM (pp. 542–551).
Teh, Y. W. (2006). A Bayesian interpretation of interpolated Kneser-Ney (Technical Report TRA2/06). School of Computing, National University of Singapore.
Teh, Y. W. (2006). A hierarchical Bayesian language model based on Pitman-Yor processes. In Proc. of the 21st ICCL (pp. 985–992).
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566–1581.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: José L. Balcázar, Francesco Bonchi, Aristides Gionis, and Michèle Sebag.
Rights and permissions
About this article
Cite this article
Du, L., Buntine, W. & Jin, H. A segmented topic model based on the two-parameter Poisson-Dirichlet process. Mach Learn 81, 5–19 (2010). https://doi.org/10.1007/s10994-010-5197-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-010-5197-4