Prioritized Named Entity Driven LDA for Document Clustering

  • Durgesh KumarEmail author
  • Sanasam Ranbir Singh
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11942)


Topic modeling methods like LSI, pLSI, and LDA have been widely studied in text mining domain for various applications like document representation, document clustering/classification, information retrieval, etc. However, such unsupervised methods are effective over corpus with well separable topics. In real-world applications, topics might be of highly overlapping in nature. For example, a news corpus of different terror attacks has highly overlapping keywords across reporting of different terror events. In this paper, we propose a variant of LDA, named as Prioritized Named Entity driven LDA (PNE-LDA), which can address the issue of overlapping topics by prioritizing named entities related to the topics. From various experimental setups, it is observed that the proposed method outperforms its counterparts in entity driven overlapping topics.


Topic modeling LDA Entity-driven topics PNE-LDA 


  1. 1.
    Aggarwal, C.C., Wang, H.: Text mining in social networks. In: Aggarwal, C. (ed.) Social Network Data Analytics, pp. 353–378. Springer, Boston (2011). Scholar
  2. 2.
    AlSumait, L., Barbará, D., Domeniconi, C.: On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Eighth IEEE ICDM 2008, pp. 3–12. IEEE (2008)Google Scholar
  3. 3.
    Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd ICML, pp. 113–120. ACM (2006)Google Scholar
  4. 4.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)zbMATHGoogle Scholar
  5. 5.
    Chong, W., Blei, D., Li, F.F.: Simultaneous image classification and annotation. In: 2009 IEEE Conference on CVPR, pp. 1903–1910. IEEE (2009)Google Scholar
  6. 6.
    Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on ACL, pp. 363–370. ACL (2005)Google Scholar
  7. 7.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(suppl 1), 5228–5235 (2004)CrossRefGoogle Scholar
  8. 8.
    Jagarlamudi, J., Daumé III, H., Udupa, R.: Incorporating lexical priors into topic models. In: Proceedings of the 13th Conference of the European Chapter of the ACL, pp. 204–213. ACL (2012)Google Scholar
  9. 9.
    Jankowski, M.: Boost multi-class sLDA model for text classification. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2018. LNCS (LNAI), vol. 10841, pp. 633–644. Springer, Cham (2018). Scholar
  10. 10.
    Kim, D., Kim, S., Oh, A.: Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. arXiv preprint arXiv:1206.4658 (2012)
  11. 11.
    Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: discriminative learning for dimensionality reduction and classification. In: Advances in NIPS, pp. 897–904 (2009)Google Scholar
  12. 12.
    Lienou, M., Maitre, H., Datcu, M.: Semantic annotation of satellite images using latent Dirichlet allocation. IEEE GRSL 7(1), 28–32 (2010)Google Scholar
  13. 13.
    Mcauliffe, J.D., Blei, D.M.: Supervised topic models. In: Advances in NIPS, pp. 121–128 (2008)Google Scholar
  14. 14.
    McCallum, A., Wang, X., Corrada-Emmanuel, A.: Topic and role discovery in social networks with experiments on enron and academic email. J. Artif. Intell. Res. 30, 249–272 (2007)CrossRefGoogle Scholar
  15. 15.
    Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on EMNLP, vol. 1, pp. 248–256. ACL (2009)Google Scholar
  16. 16.
    Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press (2004)Google Scholar
  17. 17.
    Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statistical topic models for multi-label document classification. Mach. Learn. 88(1–2), 157–208 (2012)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Tu, Y., Johri, N., Roth, D., Hockenmaier, J.: Citation author topic model in expert search. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 1265–1273. ACL (2010)Google Scholar
  19. 19.
    Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD, pp. 424–433. ACM (2006)Google Scholar
  20. 20.
    Wang, Y., Agichtein, E., Benzi, M.: TM-LDA: efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD, pp. 123–131. ACM (2012)Google Scholar
  21. 21.
    Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: ACM SIGIR. ACM (2006)Google Scholar
  22. 22.
    Wood, J., Tan, P., Wang, W., Arnold, C.: Source-LDA: enhancing probabilistic topic models using prior knowledge sources. In: 2017 IEEE 33rd ICDE, pp. 411–422. IEEE (2017)Google Scholar
  23. 23.
    Zhu, J., Ahmed, A., Xing, E.P.: MedLDA: maximum margin supervised topic models. J. Mach. Learn. Res. 13(Aug), 2237–2278 (2012)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology GuwahatiGuwahatiIndia

Personalised recommendations