Topic discovery and evolution in scientific literature based on content and citations

  • Hou-kui Zhou
  • Hui-min YuEmail author
  • Roland Hu


Researchers across the globe have been increasingly interested in the manner in which important research topics evolve over time within the corpus of scientific literature. In a dataset of scientific articles, each document can be considered to comprise both the words of the document itself and its citations of other documents. In this paper, we propose a citation- content-latent Dirichlet allocation (LDA) topic discovery method that accounts for both document citation relations and the con-tent of the document itself via a probabilistic generative model. The citation-content-LDA topic model exploits a two-level topic model that includes the citation information for ‘father’ topics and text information for sub-topics. The model parameters are estimated by a collapsed Gibbs sampling algorithm. We also propose a topic evolution algorithm that runs in two steps: topic segmentation and topic dependency relation calculation. We have tested the proposed citation-content-LDA model and topic evolution algorithm on two online datasets, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and IEEE Computer Society (CS), to demonstrate that our algorithm effectively discovers important topics and reflects the topic evolution of important research themes. According to our evaluation metrics, citation-content-LDA outperforms both content-LDA and citation-LDA.


Topic extraction Topic evolution Evaluation method 

CLC number



Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11714_2017_1154_MOESM1_ESM.pdf (283 kb)
Supplementary material, approximately 284 KB.


  1. Ahmed, A., Xing, E.P., 2010. Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. Proc. 26th Conf. on Uncertainty in Artificial Intelligence, p.20–29.Google Scholar
  2. Blei, D.M., Lafferty, J.D., 2006. Dynamic topic models. Proc. 23rd ACM Int. Conf. on Machine Learning, p.113–120. Scholar
  3. Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Mach. Learn. Res., 3: 993–1022.zbMATHGoogle Scholar
  4. Brin, B.S., Page, L., 1998. The anatomy of a large scale hy-pertextual web search engine. Comput. Netw. ISDN Syst., 30(98): 107–117. Scholar
  5. Chang, J., Blei, D.M., 2009. Relational topic models for document networks. Proc. 12th Int. Conf. on Artificial Intelligence and Statistics, p.81–88.Google Scholar
  6. Cohn, D., Chang, H., 2000. Learning to probabilistically identify authoritative documents. Proc. 17th Int. Conf. on Machine Learning, p.167–174.Google Scholar
  7. Dietz, L., Bickel, S., Scheffer, T., 2007. Unsupervised predic-tion of citation influences. Proc. 24th ACM Int. Conf. on Machine Learning, p.233–240. Scholar
  8. Erosheva, E., Fienberg, S., Lafferty, J., 2004. Mixed-membership models of scientific publications. PNAS, 101(Suppl 1):5220–5227. Scholar
  9. Griffiths, T.L., Steyvers, M., 2004. Finding scientific topics. PNAS, 101(Suppl 1):5228–5235. Scholar
  10. Guo, Z., Zhang, Z., Zhu, S., et al., 2014. A two-level topic model towards knowledge discovery from citation net-works. IEEE Trans. Knowl. Data Eng., 26(4): 780–794. Scholar
  11. He, Q., Chen, B., Pei, J., et al., 2009. Detecting topic evolution in scientific literature: how can citations help? Proc. 18th ACM Conf. on Information and Knowledge Management, p.957–966. Scholar
  12. Hofmann, T., 2001. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn., 42(1–2): 177–196. Scholar
  13. Lin, F.R., Huang, F.M., Liang, C.H., 2007. Individualized storyline-based news topic retrospection. Pacific Asia Conf. on Information Systems, Article 140.Google Scholar
  14. Lu, Z., Mamoulis, N., Cheung, D.W., 2014. A collective topic model for milestone paper discovery. Proc. 37th Int. ACM SIGIR Conf. on Research & Development in In-formation Retrieval, p.1019–1022. Scholar
  15. Macroberts, M.H., Macroberts, B.R., 1989. Problems of cita-tion analysis: a critical review. J. Am. Soc. Inform. Sci., 40(5): 342–349.<342::AID-ASI7>3.0.CO;2-UCrossRefGoogle Scholar
  16. Mei, Q., Zhai, C., 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proc. 11th ACM SIGKDD Int. Conf. on Knowledge Discovery in Data Mining, p.198–207. Scholar
  17. Mei, Q., Cai, D., Zhang, D., et al., 2008. Topic modeling with network regularization. Proc. 17th Int. Conf. on World Wide Web, p.101–110. Scholar
  18. Nallapati, R., Cohen, W.W., 2008. Link-PLSA-LDA: a new unsupervised model for topics and influence of blogs. Proc. 2nd Int. Conf. on Weblogs and Social Media, p.84–92.Google Scholar
  19. Nallapati, R.M., Ahmed, A., Xing, E.P., et al., 2008. Joint latent topic models for text and citations. Proc. 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.542–550. Scholar
  20. Wang, X.L., Zhai, C.X., Roth, D., 2013. Understanding evo-lution of research themes: a probabilistic generative model for citations. Proc. 19th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.1115–1123. Scholar
  21. Wang, X.R., McCallum, A., 2006. Topics over time: a non-Markov continuous-time model of topical trends. Proc. 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.424–433. Scholar

Copyright information

© Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2017

Authors and Affiliations

  1. 1.College of Information Science & Electronic EngineeringZhejiang UniversityHangzhouChina
  2. 2.State Key Lab of CAD & CGZhejiang UniversityHangzhouChina
  3. 3.School of Information EngineeringZhejiang A&F UniversityLin’anChina
  4. 4.Zhejiang Provincial Key Laboratory of Forestry Intelligent Monitoring and Information TechnologyLin’anChina

Personalised recommendations