Constructing topical hierarchies in heterogeneous information networks

Abstract

Many digital documentary data collections (e.g., scientific publications, enterprise reports, news articles, and social media) can be modeled as a heterogeneous information network, linking text with multiple types of entities. Constructing high-quality hierarchies that can represent topics at multiple granularities benefits tasks such as search, information browsing, and pattern mining. In this work, we present an algorithm for recursively constructing multi-typed topical hierarchies. Contrary to traditional text-based topic modeling, our approach handles both textual phrases and multiple types of entities by a newly designed clustering and ranking algorithm for heterogeneous network data, as well as mining and ranking topical patterns of different types. Our experiments on datasets from two different domains demonstrate that our algorithm yields high-quality, multi-typed topical hierarchies.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. 1.

    We chose papers published in 20 conferences related to the areas of Artificial Intelligence, Databases, Data Mining, Information Retrieval, Machine Learning, and Natural Language Processing from http://www.dblp.org/.

  2. 2.

    As a paper is always published in exactly one venue, there can naturally be no venue–venue links.

  3. 3.

    The 16 topics chosen were: Bill Clinton, Boston Marathon, Earthquake, Egypt, Gaza, Iran, Israel, Joe Biden, Microsoft, Mitt Romney, Nuclear power, Steve Jobs, Sudan, Syria, Unemployment, US Crime.

  4. 4.

    The one exception is venues, as there are only 20 venues in the DBLP dataset, so we set \(K=3\) in this case.

References

  1. 1.

    Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  2. 2.

    Chang J, Boyd-Graber J, Wang C, Gerrish S, Blei DM (2009) Reading tea leaves: how humans interpret topic models. NIPS

  3. 3.

    Chen X, Zhou M, Carin L (2012) The contextual focused topic model. In: KDD

  4. 4.

    Chuang SL, Chien LF (2004) A practical web-based approach to generating topic hierarchy for text segments. In: CIKM

  5. 5.

    Deng H, Han J, Zhao B, Yu Y, Lin CX (2011) Probabilistic topic models with biased propagation on heterogeneous information networks. In: KDD

  6. 6.

    Di Caro L, Candan KS, Sapino ML (2008) Using tagflake for condensing navigable tag hierarchies from tag clouds. In: KDD

  7. 7.

    Gauch S, Chaffee J, Pretschner A (2003) Ontology-based personalized search and browsing. Web Intell Agent Syst 1(3/4):219–234

    Google Scholar 

  8. 8.

    Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87

    MathSciNet  Article  Google Scholar 

  9. 9.

    Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1–2):177–196

  10. 10.

    Kim H, Sun Y, Hockenmaier J, Han J (2012) Etm: Entity topic models for mining documents associated with entities. In: ICDM

  11. 11.

    Lawrie D, Croft WB (2000) Discovering and comparing topic hierarchies. In: Proceedings of RIAO

  12. 12.

    Li Q, Ji H, Huang L (2013) Joint event extraction via structured prediction with global features. In: ACL

  13. 13.

    Liu X, Song Y, Liu S, Wang H (2012) Automatic taxonomy construction from keywords. In: KDD

  14. 14.

    Navigli R, Velardi P, Faralli S (2011) A graph-based algorithm for inducing lexical taxonomies from scratch. In: IJCAI

  15. 15.

    Newman D, Lau JH, Grieser K, Baldwin T (2010) Automatic evaluation of topic coherence. In: NAACL-HLT

  16. 16.

    Smyth P (2000) Model selection for probabilistic clustering using cross-validated likelihood. Stat Comput 10(1):63–72

    Article  Google Scholar 

  17. 17.

    Snow R, Jurafsky D, Ng AY (2004) Learning syntactic patterns for automatic hypernym discovery. NIPS

  18. 18.

    Sun Y, Han J, Gao J, Yu Y (2009a) itopicmodel: information network-integrated topic modeling. In: ICDM

  19. 19.

    Sun Y, Yu Y, Han J (2009b) Ranking-based clustering of heterogeneous information networks with star network schema. In: KDD

  20. 20.

    Tang J, Zhang M, Mei Q (2013) One theme in all views: modeling consensus topics in multiple contexts. In: KDD

  21. 21.

    Wang C, Danilevsky M, Desai N, Zhang Y, Nguyen P, Taula T, Han J (2013) A phrase mining framework for recursive construction of a topical hierarchy. In: KDD

  22. 22.

    Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: a look back and into the future. ACM Comput Surv (CSUR) 44(4):20

    Article  Google Scholar 

  23. 23.

    Zavitsanos E, Paliouras G, Vouros GA, Petridis S (2007) Discovering subsumption hierarchies of ontology concepts from text corpora. In: Proceedings of IEEE/WIC/ACM international conference on web intelligence

Download references

Acknowledgments

Research was sponsored in part by the Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), the Army Research Office under Cooperative Agreement No. W911NF-13-1-0193, National Science Foundation IIS-1017362, IIS-1320617, and IIS-1354329, DTRA, and MIAS, a DHS-IDS Center for Multimodal Information Access and Synthesis at UIUC. Chi Wang was supported by a Microsoft Research PhD Fellowship. Marina Danilevsky was supported by a National Science Foundation Graduate Research Fellowship Grant NSF DGE 07-15088.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Chi Wang.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, C., Liu, J., Desai, N. et al. Constructing topical hierarchies in heterogeneous information networks. Knowl Inf Syst 44, 529–558 (2015). https://doi.org/10.1007/s10115-014-0777-4

Download citation

Keywords

  • Topic hierarchy
  • Information network
  • Link mining
  • Text mining
  • Topic modeling