Abstract
The Hierarchical Dirichlet Process (HDP) is a Bayesian nonparametric prior for grouped data, such as collections of documents, where each group is a mixture of a set of shared mixture densities, or topics, where the number of topics is not fixed, but grows with data size. The Nested Dirichlet Process (NDP) builds on the HDP to cluster the documents, but allowing them to choose only from a set of specific topic mixtures. In many applications, such a set of topic mixtures may be identified with the set of entities for the collection. However, in many applications, multiple entities are associated with documents, and often the set of entities may also not be known completely in advance. In this paper, we address this problem using a nested HDP (nHDP), where the base distribution of an outer HDP is itself an HDP. The inner HDP creates a countably infinite set of topic mixtures and associates them with entities, while the outer HDP associates documents with these entities or topic mixtures. Making use of a nested Chinese Restaurant Franchise (nCRF) representation for the nested HDP, we propose a collapsed Gibbs sampling based inference algorithm for the model. Because of couplings between two HDP levels, scaling up is naturally a challenge for the inference algorithm. We propose an inference algorithm by extending the direct sampling scheme of the HDP to two levels. In our experiments on two real world research corpora, we show that, even when large fractions of author entities are hidden, the nHDP is able to generalize significantly better than existing models. More importantly, we are able to detect missing authors at a reasonable level of accuracy.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Antoniak, C.: Mixtures of Dirichlet Processes with applications to Bayesian nonparametric problems. Ann. Statist. 2(6), 1152–1174 (1974)
Blei, D., Griffiths, T., Jordan, M., Tanenbaum, J.: The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J. ACM (2010)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. JMLR (2003)
Dai, A., Storkey, A.: Author disambiguation: A nonparametric topic and co-authorship model. In: NIPS Workshop on Applications for Topic Models Text and Beyond, pp. 1–4 (2009)
Erosheva, E., Fienberg, S., Lafferty, J.: Mixed-membership models of scientific publications. PNASÂ 101(suppl. 1) (2004)
Ferguson, T.: A Bayesian analysis of some nonparametric problems. Ann. Statist. 1(2), 209–230 (1973)
Fox, E., Sudderth, E., Jordan, M., Willsky, A.: A Sticky HDP-HMM with Application to Speaker Diarization. Annals of Applied Stats. 5(2A), 1020–1056 (2011)
Kim, H., Sun, Y., Hockenmaier, J., Han, J.: Etm: Entity topic models for mining documents associated with entities. In: ICDM, pp. 349–358 (2012)
McCallum, A., Corrada-Emmanuel, A., Wang, X.: The author recepient topic model for topic and role discovery in social networks (2004)
Newman, D., Chemudugunta, C., Smyth, P.: Statistical entity-topic models. In: ACM SIGKDD, KDD 2006, pp. 680–686. ACM, New York (2006)
Paisley, J., Wang, C., Blei, D., Jordan, M.: Nested hierarchical dirichlet processes. Arxiv (2012)
Pitman, J.: Gibbs sampling methods for stick-breaking priors. Lecture Notes for St. Flour Summer School (2002)
Rodriguez, A., Dunson, D., Gelfand, A.: The nested dirichlet process. Journal of the American Statistical Association 103(483), 1131–1154 (2008)
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: UAI (2004)
Sethuraman, J.: A constructive definition of Dirichlet Priors. Statistica Sinica 4, 639–650 (1994)
Teh, Y., Jordan, M., Beal, M., Blei, D.: Hierarchical Dirichlet processes. Journal of the American Statistical Association (2006)
Wulsin, D., Jensen, S., Litt, B.: A hierarchical dirichlet process model with multiple levels of clustering for human eeg seizure modeling. In: ICML (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Agrawal, P., Tekumalla, L.S., Bhattacharya, I. (2013). Nested Hierarchical Dirichlet Process for Nonparametric Entity-Topic Analysis. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2013. Lecture Notes in Computer Science(), vol 8189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40991-2_36
Download citation
DOI: https://doi.org/10.1007/978-3-642-40991-2_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40990-5
Online ISBN: 978-3-642-40991-2
eBook Packages: Computer ScienceComputer Science (R0)