Scalable models for computing hierarchies in information networks

Shi, Baoxu; Weninger, Tim

doi:10.1007/s10115-016-0917-0

Scalable models for computing hierarchies in information networks

Regular Paper
Published: 22 January 2016

Volume 49, pages 687–717, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Baoxu Shi¹ &
Tim Weninger¹

301 Accesses
2 Citations
Explore all metrics

Abstract

Information hierarchies are organizational structures that often used to organize and present large and complex information as well as provide a mechanism for effective human navigation. Fortunately, many statistical and computational models exist that automatically generate hierarchies; however, the existing approaches do not consider linkages in information networks that are increasingly common in real-world scenarios. Current approaches also tend to present topics as an abstract probably distribution over words, etc., rather than as tangible nodes from the original network. Furthermore, the statistical techniques present in many previous works are not yet capable of processing data at Web-scale. In this paper, we present the hierarchical document-topic model (HDTM), which uses a distributed vertex-programming process to calculate a nonparametric Bayesian generative model. Experiments on three medium- size data sets and the entire Wikipedia data set show that HDTM can infer accurate hierarchies even over large information networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing topical hierarchies in heterogeneous information networks

Article 26 August 2014

Scaling Up Integrated Structural and Content-Based Network Analysis

Article 31 August 2017

Hierarchical Expert Profiling Using Heterogeneous Information Networks

Notes

http://cse.nd.edu.
http://www.dmoz.org.
Most related works denote the jumping probability as \(\alpha \); however, this would be ambiguous with the Dirichlet hyperparameter \(\alpha \).

References

Adams RP, Ghahramani Z, Jordan MI (2010) Tree-structured stick breaking for hierarchical data. In: NIPS. NIPS Foundation, pp 19–27
Ahmed A, Aly M, Gonzalez J, Narayanamurthy S, Smola AJ (2012) Scalable inference in latent variable models. In: WSDM. ACM, pp 123–132
Bahmani B, Chowdhury A, Goel A (2010) Fast incremental and personalized PageRank. In: PVLDB, VLDB Endowment, pp 173–184
Blei DM, Griffiths TL, Jordan MI (2010) The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J ACM 57(2):7
Article MathSciNet MATH Google Scholar
Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested chinese restaurant process. In: NIPS. NIPS Foundation, pp 17–24
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Chambers A, Smyth P, Steyvers M (2010) Learning concept graphs from text with stick-breaking priors. In: NIPS. NIPS Foundation, pp 334–342
Chang J, Blei DM (2010) Annals of relational topic models for document networks. Appl Stat 4(1):121–150
MathSciNet Google Scholar
Chang J, Gerrish S, Wang C, Boyd-graber JL Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS. NIPS Foundation, pp 288–296
Clauset A, Moore C, Newman MEJ (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98–101
Article Google Scholar
Cohn DA, Hofmann T (2000) The missing link—a probabilistic model of document content and hypertext connectivity. In: NIPS. NIPS Foundation, pp 430–436
Faloutsos C, Koutra D, Vogelstein JT (2013) Deltacon: a principled massive-graph similarity function. In: SDM. SIAM, pp 162–170
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172
Google Scholar
Furukawa T, Matsuo Y, Ohmukai I, Uchiyama K, Ishizuka M (2008) Extracting topics and innovators using topic diffusion process in weblogs. In: ICWSM. AAAI, pp 182–183
Gennari JH, Langley P, Fisher D (1989) Models of incremental concept formation. Artif Intell 40(1–3):11–61
Article Google Scholar
Giles CL, Bollacker KD, Lawrence S (1998) Citeseer: An automatic citation indexing system. In: ICDL. ACM, pp 89–98
Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM (2009) A survey of statistical network models. Found Trends Mach Learn 2(2):129–233
Article MATH Google Scholar
Gruber A, Rosen-Zvi M, Weiss Y (2008) Latent topic models for hypertext. In: UAI. AUAI, pp 230–239
Haveliwala TH (2002) Topic-sensitive PageRank. In: WWW. IW3C2, pp 517–526
Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: ICML, IEEE, pp 297–304
Ho Q, Eisenstein J, Xing EP (2012) Document hierarchies from text and links. In: WWW, IW3C2, pp 739–748
Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109–137
Article MathSciNet Google Scholar
Huang J, Sun H, Han J, Deng H, Sun Y, Liu Y (2010) SHRINK. In: CIKM. ACM, p 219
Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2012) Parallel data processing with mapreduce: a survey. sigmod Record 40(4):11–20
Article Google Scholar
Ley M (2002) The dblp computer science bibliography: evolution, research issues, perspectives. In: Laender AHF, Oliveira AL (eds) String processing and information retrieval, vol 2476. Lecture notes in computer science, Springer, Berlin Heidelberg pp 1–10
Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q (2013) Hierarchical classification of protein folds using a novel ensemble classifier. PloS One 8(2):e56499
Article Google Scholar
Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: PVLDB, VLDB Endowment, pp 716–727
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: SIGMOD. ACM, pp 135–146
McCallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: IJCAI. IJCAI Organization, pp 786–791
McCallum AK (2002) MALLET: a machine learning for language toolkit. (http://mallet.cs.umass.edu/)
Mccallum A, Mimno DM, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS. NIPS Foundation, pp 1973–1981
McCune RR, Weninger T, Madey G (2015) Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. In: ACM Computing Surveys
Mei Q, Cai D, Zhang D, Zhai C (2008) Topic modeling with network regularization. In: WWW, IW3C2, pp 101–110
Mimno D, Li W, McCallum A (2007) Mixtures of hierarchical topics with Pachinko allocation. In: ICML, IEEE, pp 633–640
Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) Scop: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540
Google Scholar
Nallapati RM, Ahmed A, Xing EP, Cohen WW (2008) Joint latent topic models for text and citations. In: SIGKDD. ACM, pp 542–550
Nallapati R, McFarland DA, Manning CD (2011) Topicflow model: unsupervised learning of topic-specific influences of hyperlinked documents. In: AISTATS, vol 15, pp 543–551
Newman D, Smyth P, Welling M, Asuncion AU (2007) Distributed inference for latent dirichlet allocation. In: NIPS. NIPS Foundation, pp 1081–1088
Petinot Y, McKeown K, Thadani K (2011) A hierarchical model of web summaries. In: ACL. ACL, pp 670–675
Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: SIGIR. ACM, pp 275–281
Qin T, Liu T-Y, Zhang X-D, Chen Z, Ma W-Y (2005) A study of relevance propagation for web search. In: SIGIR. ACM, pp 408–415
Reisinger J, Paca M (2009) Latent variable models of concept-attribute attachment. In: ACL. ACL, pp 620–628
Rosen-Zvi M, Griffiths TL, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI. AUAI, pp 487–494
Smyth P, Welling M, Asuncion AU (2009) Asynchronous distributed learning of topic models. In: NIPS. NIPS Foundation, pp 81–88
Song R, Wen J-R, Shi S, Xin G, Liu T-Y, Qin T, Zheng X, Zhang J, Xue G-R, Ma W-Y (2004) Microsoft research Asia at web track and terabyte track. In: TREC. NIST
Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) Arnetminer. In: SIGKDD. ACM, pp 990–998
Willett P (1988) Recent trends in hierarchic document clustering: a critical review. Inf Proces Manag 24(5):577–597
Article Google Scholar
Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) GraphX: a resilient distributed graph system on Spark. In: GRADES workshop at SIGMOD. ACM
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: USENIX conference on Hot topics in cloud computing. USENIX Association, p 10
Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22(2):179–214
Article Google Scholar
Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168
Article MathSciNet Google Scholar
Zou Q, Li X-B, Jiang W-R, Lin Z-Y, Li G-L, Chen K (2014) Survey of mapreduce frame operation in bioinformatics. Brief Bioinform 15(4):637–647

Download references

Acknowledgments

This work is sponsored by an AFOSR Grant FA9550-15-1-0003, and a John Templeton Foundation Grant FP053369-M.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, 46556, USA
Baoxu Shi & Tim Weninger

Authors

Baoxu Shi
View author publications
You can also search for this author in PubMed Google Scholar
Tim Weninger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tim Weninger.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shi, B., Weninger, T. Scalable models for computing hierarchies in information networks. Knowl Inf Syst 49, 687–717 (2016). https://doi.org/10.1007/s10115-016-0917-0

Download citation

Received: 29 December 2014
Revised: 02 September 2015
Accepted: 11 January 2016
Published: 22 January 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s10115-016-0917-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable models for computing hierarchies in information networks

Abstract

Access this article

Similar content being viewed by others

Constructing topical hierarchies in heterogeneous information networks

Scaling Up Integrated Structural and Content-Based Network Analysis

Hierarchical Expert Profiling Using Heterogeneous Information Networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Scalable models for computing hierarchies in information networks

Abstract

Access this article

Similar content being viewed by others

Constructing topical hierarchies in heterogeneous information networks

Scaling Up Integrated Structural and Content-Based Network Analysis

Hierarchical Expert Profiling Using Heterogeneous Information Networks

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation