Sequential latent Dirichlet allocation

Du, Lan; Buntine, Wray; Jin, Huidong; Chen, Changyou

doi:10.1007/s10115-011-0425-1

Sequential latent Dirichlet allocation

Regular Paper
Published: 10 June 2011

Volume 31, pages 475–503, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Lan Du^1,2,
Wray Buntine^1,2,
Huidong Jin^1,3 &
…
Changyou Chen^1,2

1165 Accesses
30 Citations
2 Altmetric
Explore all metrics

Abstract

Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant of latent Dirichlet allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, i.e. a document consists of multiple segments (e.g. chapters, paragraphs), each of which is correlated to its antecedent and subsequent segments. Such progressive sequential dependency is captured by using the hierarchical two-parameter Poisson–Dirichlet process (HPDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the SeqLDA based on the HPDP. Our experimental results on patent documents show that by considering the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on several books such as Melville’s ‘Moby Dick’.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Ahmed A, Xing EP (2010) Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. In: Proceedings of the twenty-sixth conference annual conference on uncertainty in artificial intelligence’
AlSumait L, Barbará D, Domeniconi C (2008) On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the eighth international conference on data mining, pp 3–12
Blei D, Lafferty J (2006a) Correlated topic models. In: Advances in neural information processing systems, vol 18, pp 147–154
Blei DM, Griffiths TL, Jordan MI (2010) The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. J ACM 57(2): 1–30
Article MathSciNet Google Scholar
Blei DM, Lafferty JD (2006b) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning, pp 113–120
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3: 993–1022
MATH Google Scholar
Blei D, McAuliffe J (2007) Supervised topic models. In: Advances in neural information processing systems, vol 20, pp 121–128
Buntine W, Du L, Nurmi P (2010) Bayesian networks on Dirichlet distributed vectors. In: Proceedings of the fifth European workshop on probabilistic graphical models (PGM-2010), pp 33–40
Buntine W, Hutter M (2010) A bayesian review of the poisson–dirichlet process, Technical Report arXiv:1007.0296, NICTA and ANU, Australia. http://arxiv.org/abs/1007.0296
Buntine W, Jakulin A (2006) Discrete components analysis, In: Subspace, latent structure and feature selection techniques. Springer, Berlin
Du L, Buntine W, Jin H (2010) A segmented topic model based on the two-parameter Poisson–Dirichlet process. Mach Learn 81: 5–19
Article Google Scholar
Du L, Buntine WL, Jin H (2010b) Sequential latent Dirichlet allocation: discover underlying topic structures within a document. In: Proceedings of the 2010 IEEE international conference on data mining. ICDM ’10, pp 148–157
Gilks WR, Wild P (1992) Adaptive rejection sampling for Gibbs sampling. J R Stat Soc Ser C 41(2): 337–348
MATH Google Scholar
Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci USA 101(1): 5228–5235
Article Google Scholar
Griffiths TL, Steyvers M, Blei DM, Tenenbaum JB (2005) Integrating topics and syntax. In: Advances in neural information processing systems, vol 17, pp 537–544
He Q, Chen B, Pei J, Qiu B, Mitra P, Giles L (2009) Detecting topic evolution in scientific literature: how can citations help?. In: Proceeding of the 18th ACM conference on information and knowledge management, pp 957–966
Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 50–57
Ishwaran H, James LF (2001) Gibbs sampling methods for stick breaking priors. J Am Stat Assoc 96: 161–173
Article MathSciNet MATH Google Scholar
Kandylas V, Upham S, Ungar L (2008) Finding cohesive clusters for analyzing knowledge communities. Knowl Inform Syst 17: 335–354
Article Google Scholar
Li T (2008) Clustering based on matrix approximation: a unifying view. Knowl Inform Syst 17: 1–15
Article MATH Google Scholar
Mimno D, McCallum A (2008) Topic models conditioned on arbitrary features with Dirichlet-multinomial regression. In: Proceedings of the twenty-fourth conference annual conference on uncertainty in artificial intelligence, pp 411–418
Minka TP (2000) Estimating a Dirichlet distribution. Technical report, MIT
Nallapati RM, Ditmore S, Lafferty JD, Ung K (2007) Multiscale topic tomography. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 520–529
Newman D, Asuncion A, Smyth P, Welling M (2008) Distributed inference for latent Dirichlet allocation. In: Advances in neural information processing systems, vol 20, pp 1081–1088
Peng W, Li T (2011) Temporal relation co-clustering on directional social network and author-topic evolution. Knowl Inform Syst 26: 467–486
Article MathSciNet Google Scholar
Pitman J, Yor M (1997) The two-parameter Poisson–Diriclet distribution derived from a stable subordinator. Ann Prob 25(2): 855–900
Article MathSciNet MATH Google Scholar
Porteous I., Newman D., Ihler A., Asuncion A., Smyth P, Welling M (2008) Fast collapsed Gibbs sampling for latent Dirichlet allocation. In: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 569–577
Ren L, Dunson DB, Carin L (2008) The dynamic hierarchical dirichlet process. In: Proceedings of the 25th international conference on machine learning, pp 824–831
Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on Uncertainty in Artificial Intelligence. AUAI Press, Arlington, Virginia, United States, pp 487–494
Shafiei MM, Milios EE (2006) Latent Dirichlet co-clustering. In: Proceedings of the sixth international conference on data mining, pp 542–551
Shen ZY, Sun J, Shen YD (2008) Collective latent Dirichlet allocation. In: Proceedings of the 2008 eighth IEEE international conference on data mining. IEEE Computer Society, Washington, DC, pp 1019–1024
Teh Y (2006a) A Bayesian interpretation of interpolated Kneser-Ney, Technical Report TRA2/06, School of Computing. National University of Singapore
Teh YW (2006b) A hierarchical Bayesian language model based on Pitman-Yor processes. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, pp 985–992
Teh Y, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101
Thurau C, Kersting K, Wahabzada M, Bauckhage C (2010) Convex non-negative matrix factorization for massive datasets. Knowl Inform Syst. doi:10.1007/s10115-010-0352-6
Wang C, Blei D, Heckerman D (2008) Continuous time dynamic topic models. In: Proceedings of the 24th annual conference on uncertainty in artificial intelligence, pp 579–586
Wang H, Huang M, Zhu X (2008) A generative probabilistic model for multi-label classification. In: Proceedings of the 2008 eighth IEEE international conference on data mining. IEEE Computer Society, Washington, DC, pp 628–637
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp 424–433
Wei X, Sun J, Wang X (2007) Dynamic mixture models for multiple time series. In: Proceedings of the 20th international joint conference on artifical intelligence. Morgan Kaufmann Publishers Inc., pp 2909–2914
Zhang J, Song Y, Zhang C, Liu S (2010) Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 1079–1088

Download references

Author information

Authors and Affiliations

CECS, The Australian National University, Canberra, ACT, Australia
Lan Du, Wray Buntine, Huidong Jin & Changyou Chen
National ICT Australia, Building A, 7 London Circuit, Canberra, ACT, 2601, Australia
Lan Du, Wray Buntine & Changyou Chen
CSIRO Mathematics, Informatics and Statistics, Canberra, ACT, Australia
Huidong Jin

Authors

Lan Du
View author publications
You can also search for this author in PubMed Google Scholar
Wray Buntine
View author publications
You can also search for this author in PubMed Google Scholar
Huidong Jin
View author publications
You can also search for this author in PubMed Google Scholar
Changyou Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lan Du.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Du, L., Buntine, W., Jin, H. et al. Sequential latent Dirichlet allocation. Knowl Inf Syst 31, 475–503 (2012). https://doi.org/10.1007/s10115-011-0425-1

Download citation

Received: 08 January 2011
Revised: 14 April 2011
Accepted: 24 May 2011
Published: 10 June 2011
Issue Date: June 2012
DOI: https://doi.org/10.1007/s10115-011-0425-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sequential latent Dirichlet allocation

Abstract

Access this article

Similar content being viewed by others

A Layered Dirichlet Process for Hierarchical Segmentation of Sequential Grouped Data

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

Nested Hierarchical Dirichlet Process for Nonparametric Entity-Topic Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sequential latent Dirichlet allocation

Abstract

Access this article

Similar content being viewed by others

A Layered Dirichlet Process for Hierarchical Segmentation of Sequential Grouped Data

A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet Allocation

Nested Hierarchical Dirichlet Process for Nonparametric Entity-Topic Analysis

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation