Abstract
Information hierarchies are organizational structures that often used to organize and present large and complex information as well as provide a mechanism for effective human navigation. Fortunately, many statistical and computational models exist that automatically generate hierarchies; however, the existing approaches do not consider linkages in information networks that are increasingly common in real-world scenarios. Current approaches also tend to present topics as an abstract probably distribution over words, etc., rather than as tangible nodes from the original network. Furthermore, the statistical techniques present in many previous works are not yet capable of processing data at Web-scale. In this paper, we present the hierarchical document-topic model (HDTM), which uses a distributed vertex-programming process to calculate a nonparametric Bayesian generative model. Experiments on three medium- size data sets and the entire Wikipedia data set show that HDTM can infer accurate hierarchies even over large information networks.
Similar content being viewed by others
Notes
Most related works denote the jumping probability as \(\alpha \); however, this would be ambiguous with the Dirichlet hyperparameter \(\alpha \).
References
Adams RP, Ghahramani Z, Jordan MI (2010) Tree-structured stick breaking for hierarchical data. In: NIPS. NIPS Foundation, pp 19–27
Ahmed A, Aly M, Gonzalez J, Narayanamurthy S, Smola AJ (2012) Scalable inference in latent variable models. In: WSDM. ACM, pp 123–132
Bahmani B, Chowdhury A, Goel A (2010) Fast incremental and personalized PageRank. In: PVLDB, VLDB Endowment, pp 173–184
Blei DM, Griffiths TL, Jordan MI (2010) The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J ACM 57(2):7
Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested chinese restaurant process. In: NIPS. NIPS Foundation, pp 17–24
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Chambers A, Smyth P, Steyvers M (2010) Learning concept graphs from text with stick-breaking priors. In: NIPS. NIPS Foundation, pp 334–342
Chang J, Blei DM (2010) Annals of relational topic models for document networks. Appl Stat 4(1):121–150
Chang J, Gerrish S, Wang C, Boyd-graber JL Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS. NIPS Foundation, pp 288–296
Clauset A, Moore C, Newman MEJ (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98–101
Cohn DA, Hofmann T (2000) The missing link—a probabilistic model of document content and hypertext connectivity. In: NIPS. NIPS Foundation, pp 430–436
Faloutsos C, Koutra D, Vogelstein JT (2013) Deltacon: a principled massive-graph similarity function. In: SDM. SIAM, pp 162–170
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Furukawa T, Matsuo Y, Ohmukai I, Uchiyama K, Ishizuka M (2008) Extracting topics and innovators using topic diffusion process in weblogs. In: ICWSM. AAAI, pp 182–183
Gennari JH, Langley P, Fisher D (1989) Models of incremental concept formation. Artif Intell 40(1–3):11–61
Giles CL, Bollacker KD, Lawrence S (1998) Citeseer: An automatic citation indexing system. In: ICDL. ACM, pp 89–98
Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM (2009) A survey of statistical network models. Found Trends Mach Learn 2(2):129–233
Gruber A, Rosen-Zvi M, Weiss Y (2008) Latent topic models for hypertext. In: UAI. AUAI, pp 230–239
Haveliwala TH (2002) Topic-sensitive PageRank. In: WWW. IW3C2, pp 517–526
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: ICML, IEEE, pp 297–304
Ho Q, Eisenstein J, Xing EP (2012) Document hierarchies from text and links. In: WWW, IW3C2, pp 739–748
Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109–137
Huang J, Sun H, Han J, Deng H, Sun Y, Liu Y (2010) SHRINK. In: CIKM. ACM, p 219
Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2012) Parallel data processing with mapreduce: a survey. sigmod Record 40(4):11–20
Ley M (2002) The dblp computer science bibliography: evolution, research issues, perspectives. In: Laender AHF, Oliveira AL (eds) String processing and information retrieval, vol 2476. Lecture notes in computer science, Springer, Berlin Heidelberg pp 1–10
Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q (2013) Hierarchical classification of protein folds using a novel ensemble classifier. PloS One 8(2):e56499
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: PVLDB, VLDB Endowment, pp 716–727
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: SIGMOD. ACM, pp 135–146
McCallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: IJCAI. IJCAI Organization, pp 786–791
McCallum AK (2002) MALLET: a machine learning for language toolkit. (http://mallet.cs.umass.edu/)
Mccallum A, Mimno DM, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS. NIPS Foundation, pp 1973–1981
McCune RR, Weninger T, Madey G (2015) Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. In: ACM Computing Surveys
Mei Q, Cai D, Zhang D, Zhai C (2008) Topic modeling with network regularization. In: WWW, IW3C2, pp 101–110
Mimno D, Li W, McCallum A (2007) Mixtures of hierarchical topics with Pachinko allocation. In: ICML, IEEE, pp 633–640
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) Scop: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540
Nallapati RM, Ahmed A, Xing EP, Cohen WW (2008) Joint latent topic models for text and citations. In: SIGKDD. ACM, pp 542–550
Nallapati R, McFarland DA, Manning CD (2011) Topicflow model: unsupervised learning of topic-specific influences of hyperlinked documents. In: AISTATS, vol 15, pp 543–551
Newman D, Smyth P, Welling M, Asuncion AU (2007) Distributed inference for latent dirichlet allocation. In: NIPS. NIPS Foundation, pp 1081–1088
Petinot Y, McKeown K, Thadani K (2011) A hierarchical model of web summaries. In: ACL. ACL, pp 670–675
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: SIGIR. ACM, pp 275–281
Qin T, Liu T-Y, Zhang X-D, Chen Z, Ma W-Y (2005) A study of relevance propagation for web search. In: SIGIR. ACM, pp 408–415
Reisinger J, Paca M (2009) Latent variable models of concept-attribute attachment. In: ACL. ACL, pp 620–628
Rosen-Zvi M, Griffiths TL, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI. AUAI, pp 487–494
Smyth P, Welling M, Asuncion AU (2009) Asynchronous distributed learning of topic models. In: NIPS. NIPS Foundation, pp 81–88
Song R, Wen J-R, Shi S, Xin G, Liu T-Y, Qin T, Zheng X, Zhang J, Xue G-R, Ma W-Y (2004) Microsoft research Asia at web track and terabyte track. In: TREC. NIST
Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) Arnetminer. In: SIGKDD. ACM, pp 990–998
Willett P (1988) Recent trends in hierarchic document clustering: a critical review. Inf Proces Manag 24(5):577–597
Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) GraphX: a resilient distributed graph system on Spark. In: GRADES workshop at SIGMOD. ACM
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: USENIX conference on Hot topics in cloud computing. USENIX Association, p 10
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22(2):179–214
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168
Zou Q, Li X-B, Jiang W-R, Lin Z-Y, Li G-L, Chen K (2014) Survey of mapreduce frame operation in bioinformatics. Brief Bioinform 15(4):637–647
Acknowledgments
This work is sponsored by an AFOSR Grant FA9550-15-1-0003, and a John Templeton Foundation Grant FP053369-M.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shi, B., Weninger, T. Scalable models for computing hierarchies in information networks. Knowl Inf Syst 49, 687–717 (2016). https://doi.org/10.1007/s10115-016-0917-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0917-0