Skip to main content
Log in

Scalable models for computing hierarchies in information networks

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Information hierarchies are organizational structures that often used to organize and present large and complex information as well as provide a mechanism for effective human navigation. Fortunately, many statistical and computational models exist that automatically generate hierarchies; however, the existing approaches do not consider linkages in information networks that are increasingly common in real-world scenarios. Current approaches also tend to present topics as an abstract probably distribution over words, etc., rather than as tangible nodes from the original network. Furthermore, the statistical techniques present in many previous works are not yet capable of processing data at Web-scale. In this paper, we present the hierarchical document-topic model (HDTM), which uses a distributed vertex-programming process to calculate a nonparametric Bayesian generative model. Experiments on three medium- size data sets and the entire Wikipedia data set show that HDTM can infer accurate hierarchies even over large information networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://cse.nd.edu.

  2. http://www.dmoz.org.

  3. Most related works denote the jumping probability as \(\alpha \); however, this would be ambiguous with the Dirichlet hyperparameter \(\alpha \).

References

  1. Adams RP, Ghahramani Z, Jordan MI (2010) Tree-structured stick breaking for hierarchical data. In: NIPS. NIPS Foundation, pp 19–27

  2. Ahmed A, Aly M, Gonzalez J, Narayanamurthy S, Smola AJ (2012) Scalable inference in latent variable models. In: WSDM. ACM, pp 123–132

  3. Bahmani B, Chowdhury A, Goel A (2010) Fast incremental and personalized PageRank. In: PVLDB, VLDB Endowment, pp 173–184

  4. Blei DM, Griffiths TL, Jordan MI (2010) The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J ACM 57(2):7

    Article  MathSciNet  MATH  Google Scholar 

  5. Blei DM, Griffiths TL, Jordan MI, Tenenbaum JB (2004) Hierarchical topic models and the nested chinese restaurant process. In: NIPS. NIPS Foundation, pp 17–24

  6. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  7. Chambers A, Smyth P, Steyvers M (2010) Learning concept graphs from text with stick-breaking priors. In: NIPS. NIPS Foundation, pp 334–342

  8. Chang J, Blei DM (2010) Annals of relational topic models for document networks. Appl Stat 4(1):121–150

    MathSciNet  Google Scholar 

  9. Chang J, Gerrish S, Wang C, Boyd-graber JL Blei DM (2009) Reading tea leaves: how humans interpret topic models. In: NIPS. NIPS Foundation, pp 288–296

  10. Clauset A, Moore C, Newman MEJ (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453(7191):98–101

    Article  Google Scholar 

  11. Cohn DA, Hofmann T (2000) The missing link—a probabilistic model of document content and hypertext connectivity. In: NIPS. NIPS Foundation, pp 430–436

  12. Faloutsos C, Koutra D, Vogelstein JT (2013) Deltacon: a principled massive-graph similarity function. In: SDM. SIAM, pp 162–170

  13. Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2(2):139–172

    Google Scholar 

  14. Furukawa T, Matsuo Y, Ohmukai I, Uchiyama K, Ishizuka M (2008) Extracting topics and innovators using topic diffusion process in weblogs. In: ICWSM. AAAI, pp 182–183

  15. Gennari JH, Langley P, Fisher D (1989) Models of incremental concept formation. Artif Intell 40(1–3):11–61

    Article  Google Scholar 

  16. Giles CL, Bollacker KD, Lawrence S (1998) Citeseer: An automatic citation indexing system. In: ICDL. ACM, pp 89–98

  17. Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM (2009) A survey of statistical network models. Found Trends Mach Learn 2(2):129–233

    Article  MATH  Google Scholar 

  18. Gruber A, Rosen-Zvi M, Weiss Y (2008) Latent topic models for hypertext. In: UAI. AUAI, pp 230–239

  19. Haveliwala TH (2002) Topic-sensitive PageRank. In: WWW. IW3C2, pp 517–526

  20. Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: ICML, IEEE, pp 297–304

  21. Ho Q, Eisenstein J, Xing EP (2012) Document hierarchies from text and links. In: WWW, IW3C2, pp 739–748

  22. Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109–137

    Article  MathSciNet  Google Scholar 

  23. Huang J, Sun H, Han J, Deng H, Sun Y, Liu Y (2010) SHRINK. In: CIKM. ACM, p 219

  24. Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2012) Parallel data processing with mapreduce: a survey. sigmod Record 40(4):11–20

    Article  Google Scholar 

  25. Ley M (2002) The dblp computer science bibliography: evolution, research issues, perspectives. In: Laender AHF, Oliveira AL (eds) String processing and information retrieval, vol 2476. Lecture notes in computer science, Springer, Berlin Heidelberg pp 1–10

  26. Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, Zou Q (2013) Hierarchical classification of protein folds using a novel ensemble classifier. PloS One 8(2):e56499

    Article  Google Scholar 

  27. Low Y, Bickson D, Gonzalez J, Guestrin C, Kyrola A, Hellerstein JM (2012) Distributed GraphLab: a framework for machine learning and data mining in the cloud. In: PVLDB, VLDB Endowment, pp 716–727

  28. Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: SIGMOD. ACM, pp 135–146

  29. McCallum A, Corrada-Emmanuel A, Wang X (2005) Topic and role discovery in social networks. In: IJCAI. IJCAI Organization, pp 786–791

  30. McCallum AK (2002) MALLET: a machine learning for language toolkit. (http://mallet.cs.umass.edu/)

  31. Mccallum A, Mimno DM, Wallach HM (2009) Rethinking lda: why priors matter. In: NIPS. NIPS Foundation, pp 1973–1981

  32. McCune RR, Weninger T, Madey G (2015) Thinking like a vertex: a survey of vertex-centric frameworks for large-scale distributed graph processing. In: ACM Computing Surveys

  33. Mei Q, Cai D, Zhang D, Zhai C (2008) Topic modeling with network regularization. In: WWW, IW3C2, pp 101–110

  34. Mimno D, Li W, McCallum A (2007) Mixtures of hierarchical topics with Pachinko allocation. In: ICML, IEEE, pp 633–640

  35. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) Scop: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247(4):536–540

    Google Scholar 

  36. Nallapati RM, Ahmed A, Xing EP, Cohen WW (2008) Joint latent topic models for text and citations. In: SIGKDD. ACM, pp 542–550

  37. Nallapati R, McFarland DA, Manning CD (2011) Topicflow model: unsupervised learning of topic-specific influences of hyperlinked documents. In: AISTATS, vol 15, pp 543–551

  38. Newman D, Smyth P, Welling M, Asuncion AU (2007) Distributed inference for latent dirichlet allocation. In: NIPS. NIPS Foundation, pp 1081–1088

  39. Petinot Y, McKeown K, Thadani K (2011) A hierarchical model of web summaries. In: ACL. ACL, pp 670–675

  40. Ponte JM, Croft WB (1998) A language modeling approach to information retrieval. In: SIGIR. ACM, pp 275–281

  41. Qin T, Liu T-Y, Zhang X-D, Chen Z, Ma W-Y (2005) A study of relevance propagation for web search. In: SIGIR. ACM, pp 408–415

  42. Reisinger J, Paca M (2009) Latent variable models of concept-attribute attachment. In: ACL. ACL, pp 620–628

  43. Rosen-Zvi M, Griffiths TL, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: UAI. AUAI, pp 487–494

  44. Smyth P, Welling M, Asuncion AU (2009) Asynchronous distributed learning of topic models. In: NIPS. NIPS Foundation, pp 81–88

  45. Song R, Wen J-R, Shi S, Xin G, Liu T-Y, Qin T, Zheng X, Zhang J, Xue G-R, Ma W-Y (2004) Microsoft research Asia at web track and terabyte track. In: TREC. NIST

  46. Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z (2008) Arnetminer. In: SIGKDD. ACM, pp 990–998

  47. Willett P (1988) Recent trends in hierarchic document clustering: a critical review. Inf Proces Manag 24(5):577–597

    Article  Google Scholar 

  48. Xin RS, Gonzalez JE, Franklin MJ, Stoica I (2013) GraphX: a resilient distributed graph system on Spark. In: GRADES workshop at SIGMOD. ACM

  49. Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. In: USENIX conference on Hot topics in cloud computing. USENIX Association, p 10

  50. Zhai C, Lafferty J (2004) A study of smoothing methods for language models applied to information retrieval. ACM Trans Inf Syst 22(2):179–214

    Article  Google Scholar 

  51. Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168

    Article  MathSciNet  Google Scholar 

  52. Zou Q, Li X-B, Jiang W-R, Lin Z-Y, Li G-L, Chen K (2014) Survey of mapreduce frame operation in bioinformatics. Brief Bioinform 15(4):637–647

Download references

Acknowledgments

This work is sponsored by an AFOSR Grant FA9550-15-1-0003, and a John Templeton Foundation Grant FP053369-M.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tim Weninger.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shi, B., Weninger, T. Scalable models for computing hierarchies in information networks. Knowl Inf Syst 49, 687–717 (2016). https://doi.org/10.1007/s10115-016-0917-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0917-0

Keywords

Navigation