Hierarchical topic modeling with nested hierarchical Dirichlet process

Ding, Yi-qun; Li, Shan-ping; Zhang, Zhen; Shen, Bin

doi:10.1631/jzus.A0820796

Hierarchical topic modeling with nested hierarchical Dirichlet process

Published: 01 June 2009

Volume 10, pages 858–867, (2009)
Cite this article

Journal of Zhejiang University-SCIENCE A Aims and scope Submit manuscript

Yi-qun Ding¹,
Shan-ping Li¹,
Zhen Zhang¹ &
…
Bin Shen²

185 Accesses
2 Citations
Explore all metrics

Abstract

This paper deals with the statistical modeling of latent topic hierarchies in text corpora. The height of the topic tree is assumed as fixed, while the number of topics on each level as unknown a priori and to be inferred from data. Taking a nonparametric Bayesian approach to this problem, we propose a new probabilistic generative model based on the nested hierarchical Dirichlet process (nHDP) and present a Markov chain Monte Carlo sampling algorithm for the inference of the topic tree structure as well as the word distribution of each topic and topic distribution of each document. Our theoretical analysis and experiment results show that this model can produce a more compact hierarchical topic structure and captures more fine-grained topic relationships compared to the hierarchical latent Dirichlet allocation model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bast, H., Majumdar, D., 2005. Why Spectral Retrieval Works. Proc. 28th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.11–18. [doi:10.1145/1076034.1076040]
Blackwell, D., MacQueen, J.B., 1973. Ferguson distributions via Polya Urn schemes. Ann. Statist., 1(2):353–355. [doi:10.1214/aos/1176342372]
Article MathSciNet MATH Google Scholar
Blei, D.M., Lafferty, J.D., 2006. Dynamic Topic Models. Proc. 23rd Int. Conf. on Machine Learning, p.113–120. [doi:10.1145/1143844.1143859]
Blei, D.M., Lafferty, J.D., 2007. A correlated topic model of science. Ann. Appl. Statist., 1(1):17–35. [doi:10.1214/07-AOAS114]
Article MathSciNet MATH Google Scholar
Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B., 2003a. Hierarchical Topic Models and the Nested Chinese Restaurant Process. NIPS, p.17–24.
Blei, D.M., Ng, A.Y., Jordan, M.I., 2003b. Latent Dirichlet allocation. J. Mach. Learning Res., 3(4–5):993–1022. [doi:10.1162/jmlr.2003.3.4-5.993]
MATH Google Scholar
Boley, D.L., 1998. Principal direction divisive partitioning. Data Min. Knowl. Discov., 2(4):325–344. [doi:10.1023/A:1009740529316]
Article Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., 1990. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci., 41(6):391–407. [doi:10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9]
Article Google Scholar
Dhillon, I.S., Modha, D.S., 2001. Concept decompositions for large sparse text data using clustering. Mach. Learning, 42(1/2):143–175. [doi:10.1023/A:1007612920971]
Article MATH Google Scholar
Elkan, C., 2006. Clustering Documents with an Exponential-family Approximation of the Dirichlet Compound Multinomial Distribution. Proc. 23rd Int. Conf. on Machine Learning, p.289–296. [doi:10.1145/1143844.1143881]
Geman, S., Geman, D., 1990. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. In: Shafer, G., Pearl, J. (Eds.), Readings in Uncertain Reasoning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, p.452–472.
Google Scholar
Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(Suppl. 1):5228–5235. [doi:10.1073/pnas.0307752101]
Article Google Scholar
Li, W., McCallum, A., 2006. Pachinko Allocation: DAG-structured Mixture Models of Topic Correlations. Proc. 23rd Int. Conf. on Machine Learning, p.577–584. [doi:10.1145/1143844.1143917]
Lin, J., 1991. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory, 37(1):145–151. [doi:10.1109/18.61115]
Article MathSciNet MATH Google Scholar
Manning, C.D., Raghavan, P., Schütze, H., 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, England.
Book MATH Google Scholar
Mimno, D., Li, W., McCallum, A., 2007. Mixtures of Hierarchical Topics with Pachinko Allocation. Proc. 24th Int. Conf. on Machine Learning, p.633–640. [doi:10.1145/1273496.1273576]
Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., 1992. Numerical Recipes in C. Cambridge University Press, Cambridge, England.
MATH Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P., 2004. The Author-topic Model for Authors and Documents. Proc. 20th Conf. on Uncertainty in Artificial Intelligence, p.487–494.
Strehl, A., Ghosh, J., Mooney, R., 2000. Impact of Similarity Measures on Web-page Clustering. Proc. 17th National Conf. on Artificial Intelligence: Workshop of Artificial Intelligence for Web Search, p.58–64.
Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M., 2006. Hierarchical Dirichlet processes. J. Am. Statist. Assoc., 101(476):1566–1581. [doi:10.1198/016214506000000302]
Article MathSciNet MATH Google Scholar
Walker, D.D., Ringger, E.K., 2008. Model-based Document Clustering with a Collapsed Gibbs Sampler. Proc. 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.704–712. [doi:10.1145/1401890.1401975]
Wallach, H.M., 2006. Topic Modeling: Beyond Bag-of-words. Proc. 23rd Int. Conf. on Machine Learning, p.977–984. [doi:10.1145/1143844.1143967]
Wei, X., Croft, B.W., 2006. LDA-based Document Models for Ad-hoc Retrieval. Proc. 29th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, p.178–185. [doi:10.1145/1148170.1148204]
Zhang, Z., Phan, X.H., Horiguchi, S., 2008. An Efficient Feature Selection Using Hidden Topic in Text Categorization. Proc. 22nd Int. Conf. on Advanced Information Networking and Applications, p.1223–1228. [doi:10.1109/WAINA.2008.137]

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Zhejiang University, Hangzhou, 310027, China
Yi-qun Ding, Shan-ping Li & Zhen Zhang
State Street Hangzhou, Hangzhou, 310000, China
Bin Shen

Authors

Yi-qun Ding
View author publications
You can also search for this author in PubMed Google Scholar
Shan-ping Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Bin Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shan-ping Li.

Additional information

Project (No. 60773180) supported by the National Natural Science Foundation of China

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ding, Yq., Li, Sp., Zhang, Z. et al. Hierarchical topic modeling with nested hierarchical Dirichlet process. J. Zhejiang Univ. Sci. A 10, 858–867 (2009). https://doi.org/10.1631/jzus.A0820796

Download citation

Received: 15 November 2008
Accepted: 10 April 2009
Published: 01 June 2009
Issue Date: June 2009
DOI: https://doi.org/10.1631/jzus.A0820796

Key words

CLC number

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical topic modeling with nested hierarchical Dirichlet process

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A simple algorithm for computing the probabilities of count models based on pure birth processes

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

CLC number

Navigation

Hierarchical topic modeling with nested hierarchical Dirichlet process

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A simple algorithm for computing the probabilities of count models based on pure birth processes

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Search

Navigation