Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

  • Hosein Azarbonyad
  • Mostafa Dehghani
  • Tom Kenter
  • Maarten Marx
  • Jaap Kamps
  • Maarten de Rijke
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10193)

Abstract

A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due to generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents’ topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.

References

  1. 1.
    U.S. National Library of Medicine. Pubmed Central Open Access Initiative (2010)Google Scholar
  2. 2.
    Azarbonyad, H., Saan, F., Dehghani, M., Marx, M., Kamps, J.: Are topically diverse documents also interesting? In: Mothe, J., Savoy, J., Kamps, J., Pinel-Sauvagnat, K., Jones, G.J.F., SanJuan, E., Cappellato, L., Ferro, N. (eds.) CLEF 2015. LNCS, vol. 9283, pp. 215–221. Springer, Cham (2015). doi:10.1007/978-3-319-24027-5_19 CrossRefGoogle Scholar
  3. 3.
    Bache, K., Newman, D., Smyth, P.: Text-based measures of document diversity. In KDD (2013)Google Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3(4–5), 993–1022 (2003)MATHGoogle Scholar
  5. 5.
    Boyd-Gaber, J., Mimno, D., Newman, D.: Care and feeding of topic models. In: Mixed Membership Models & Their Applic. CRC Press (2014)Google Scholar
  6. 6.
    Dehghani, M., Azarbonyad, H., Kamps, J., Marx, M.: Two-way parsimonious classification models for evolving hierarchies. In: Fuhr, N., Quaresma, P., Gonçalves, T., Larsen, B., Balog, K., Macdonald, C., Cappellato, L., Ferro, N. (eds.) CLEF 2016. LNCS, vol. 9822, pp. 69–82. Springer, Heidelberg (2016). doi:10.1007/978-3-319-44564-9_6 CrossRefGoogle Scholar
  7. 7.
    Dehghani, M., Azarbonyad, H., Kamps, J., Marx, M.: On horizontal and vertical separation in hierarchical text classification. In: ICTIR (2016)Google Scholar
  8. 8.
    Derzinski, M., Rohanimanesh, K.: An information theoretic approach to quantifying text interestingness. In: NIPS MLNLP Workshop (2014)Google Scholar
  9. 9.
    Hiemstra, D., Robertson, S., Zaragoza, H.: Parsimonious language models for information retrieval. In: SIGIR (2004)Google Scholar
  10. 10.
    Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: discriminative learning for dimensionality reduction and classification. In: NIPS (2009)Google Scholar
  11. 11.
    Lau, J.H., Newman, D., Baldwin, T.: Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: EACL (2014)Google Scholar
  12. 12.
    Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  13. 13.
    Lin, T., Tian, W., Mei, Q., Cheng, H.: The dual-sparse topic model: Mining focused topics and focused terms in short text. In: WWW (2014)Google Scholar
  14. 14.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)CrossRefMATHGoogle Scholar
  15. 15.
    Mehrotra, R., Sanner, S., Buntine, W., Xie, L.: Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: SIGIR (2013)Google Scholar
  16. 16.
    Nguyen, D.Q., Billingsley, R., Du, L., Johnson, M.: Improving topic models with latent feature word representations. Trans. Assoc. Comput. Linguist. 3, 299–313 (2015)Google Scholar
  17. 17.
    Rao, C.: Diversity and dissimilarity coefficients: a unified approach. Theoret. Popul. Biol. 21(1), 24–43 (1982)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: WSDM (2015)Google Scholar
  19. 19.
    Soleimani, H., Miller, D.: Parsimonious topic models with salient word discovery. IEEE Trans. Knowl. Data Eng. 27(3), 824–837 (2015)CrossRefGoogle Scholar
  20. 20.
    Solow, A., Polasky, S., Broadus, J.: On the measurement of biological diversity. J. Environ. Econ. Manag. 24(1), 60–68 (1993)CrossRefGoogle Scholar
  21. 21.
    Wallach, H.M., Mimno, D.M., McCallum, A.: Rethinking LDA: why priors matter. In: NIPS (2009)Google Scholar
  22. 22.
    Wang, C., Blei, D.M.: Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In: NIPS (2009)Google Scholar
  23. 23.
    Williamson, S., Wang, C., Heller, K.A., Blei, D.M.: The IBP compound Dirichlet process and its application to focused topic modeling. In: ICML (2010)Google Scholar
  24. 24.
    Xie, P., Xing, E.P.: Integrating document clustering and topic modeling. In: UAI (2013)Google Scholar
  25. 25.
    Yan, X., Guo, J., Lan, Y., Cheng, X.: A biterm topic model for short texts. In: WWW (2013)Google Scholar
  26. 26.
    Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: CIKM (2001)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Hosein Azarbonyad
    • 1
  • Mostafa Dehghani
    • 1
  • Tom Kenter
    • 1
  • Maarten Marx
    • 1
  • Jaap Kamps
    • 1
  • Maarten de Rijke
    • 1
  1. 1.University of AmsterdamAmsterdamThe Netherlands

Personalised recommendations