Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

  • Hosein AzarbonyadEmail author
  • Mostafa Dehghani
  • Tom Kenter
  • Maarten Marx
  • Jaap Kamps
  • Maarten de Rijke
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10193)


A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due to generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents’ topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.


Language Model Topic Model Latent Dirichlet Allocation Text Document General Topic 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was supported by Ahold Delhaize, Amsterdam Data Science, Blendle, the Bloomberg Research Grant program, the Dutch national program COMMIT, Elsevier, the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreements nr 283465 (ENVRI) and 312827 (VOX-Pol), the Microsoft Research Ph.D. program, the Netherlands eScience Center under project number 027.012.105, the Netherlands Institute for Sound and Vision, the Netherlands Organisation for Scientific Research (NWO) under project nrs 314.99.108, 600.006.014, HOR-11-10, CI-14-25, 652.-002.-001, 612.-001.-551, 652.-001.-003, 314-98-071, and Yandex. All content represents the opinion of the authors, which is not necessarily shared or endorsed by their respective employers and/or sponsors.


  1. 1.
    U.S. National Library of Medicine. Pubmed Central Open Access Initiative (2010)Google Scholar
  2. 2.
    Azarbonyad, H., Saan, F., Dehghani, M., Marx, M., Kamps, J.: Are topically diverse documents also interesting? In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 215–221. Springer, Cham (2015). doi: 10.1007/978-3-319-24027-5_19 CrossRefGoogle Scholar
  3. 3.
    Bache, K., Newman, D., Smyth, P.: Text-based measures of document diversity. In KDD (2013)Google Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Boyd-Gaber, J., Mimno, D., Newman, D.: Care and feeding of topic models. In: Mixed Membership Models & Their Applic. CRC Press (2014)Google Scholar
  6. 6.
    Dehghani, M., Azarbonyad, H., Kamps, J., Marx, M.: Two-way parsimonious classification models for evolving hierarchies. In: Fuhr, N., Quaresma, P., Gonçalves, T., Larsen, B., Balog, K., Macdonald, C., Cappellato, L., Ferro, N. (eds.) CLEF 2016. LNCS, vol. 9822, pp. 69–82. Springer, Heidelberg (2016). doi: 10.1007/978-3-319-44564-9_6 CrossRefGoogle Scholar
  7. 7.
    Dehghani, M., Azarbonyad, H., Kamps, J., Marx, M.: On horizontal and vertical separation in hierarchical text classification. In: ICTIR (2016)Google Scholar
  8. 8.
    Derzinski, M., Rohanimanesh, K.: An information theoretic approach to quantifying text interestingness. In: NIPS MLNLP Workshop (2014)Google Scholar
  9. 9.
    Hiemstra, D., Robertson, S., Zaragoza, H.: Parsimonious language models for information retrieval. In: SIGIR (2004)Google Scholar
  10. 10.
    Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: discriminative learning for dimensionality reduction and classification. In: NIPS (2009)Google Scholar
  11. 11.
    Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: EACL (2014)Google Scholar
  12. 12.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  13. 13.
    Lin, T., Tian, W., Mei, Q., Cheng, H.: The dual-sparse topic model: Mining focused topics and focused terms in short text. In: WWW (2014)Google Scholar
  14. 14.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefzbMATHGoogle Scholar
  15. 15.
    Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: SIGIR (2013)Google Scholar
  16. 16.
    Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)Google Scholar
  17. 17.
    Rao, C.: Diversity and dissimilarity coefficients: a unified approach. Theoret. Popul. Biol. 21(1), 24–43 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: WSDM (2015)Google Scholar
  19. 19.
    Soleimani, H., Miller, D.: Parsimonious topic models with salient word discovery. IEEE Trans. Knowl. Data Eng. 27(3), 824–837 (2015)CrossRefGoogle Scholar
  20. 20.
    Solow, A., Polasky, S., Broadus, J.: On the measurement of biological diversity. J. Environ. Econ. Manag. 24(1), 60–68 (1993)CrossRefGoogle Scholar
  21. 21.
    Wallach, H.M., Mimno, D.M., McCallum, A.: Rethinking LDA: why priors matter. In: NIPS (2009)Google Scholar
  22. 22.
    Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: NIPS (2009)Google Scholar
  23. 23.
    Williamson, S., Wang, C., Heller, K.A., Blei, D.M.: The IBP compound Dirichlet process and its application to focused topic modeling. In: ICML (2010)Google Scholar
  24. 24.
    Xie, P., Xing, E.P.: Integrating document clustering and topic modeling. In: UAI (2013)Google Scholar
  25. 25.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: WWW (2013)Google Scholar
  26. 26.
    Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: CIKM (2001)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Hosein Azarbonyad
    • 1
    Email author
  • Mostafa Dehghani
    • 1
  • Tom Kenter
    • 1
  • Maarten Marx
    • 1
  • Jaap Kamps
    • 1
  • Maarten de Rijke
    • 1
  1. 1.University of AmsterdamAmsterdamThe Netherlands

Personalised recommendations