Semantic Similarity-Enhanced Topic Models for Document Analysis

  • Yan Gao
  • Dunwei Wen
Part of the Lecture Notes in Educational Technology book series (LNET)


In e-learning environment, more and more larger-scale text resources are generated by teaching–learning interactions. Finding latent topics in these resources can help us understand the teaching contents and the learners’ interests and focuses. Latent Dirichlet allocation (LDA) has been widely used in many areas to extract the latent topics in a text corpus. However, the extracted topics cannot be understood by the end user. Adding more auxiliary information to LDA to guide the process of topic extraction is a good way to improve the interpretability of topic modeling. Co-occurrence information in corpus is such information, but it is not sufficient yet to measure the similarity between word pairs, especially in sparse document space. To deal with this problem, we propose a new semantic similarity-enhanced topic model in this paper. In this model, we use not only co-occurrence information but also the semantic similarity based on WordNet as auxiliary information. Those two kinds of information are combined into a topic-word component though generative Pólya urn model. The distribution of documents over the extracted topics obtained by the new model can be inputted to the classifier. The accuracy of extracting topics can improve the performance of the classifier. Our experiments on newsgroup corpus show that the semantic similarity-enhanced topic model performs better than the topic models with only single information separately.


Topic modeling LDA Gibbs sampling Generative Pólya urn model Semantic similarity WordNet 



The authors acknowledge the support of Research Incentive Grant (RIG) of Athabasca University. This work was also partly supported by the National Natural Science Foundation of China under Grant 61273314 and 61175064.


  1. Andrzejewski, D, Zhu, X. J., & Craven, M. (2009). Incorporating domain knowledge into topic via Dirichlet forest priors. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009).Google Scholar
  2. Andrzejewski, D, Zhu, X. J., Craven, M., & Recht, B. (2011). A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In Proceedings of the 22nd International Joint Conferences on Artificial Intelligence (IJCAI 2011).Google Scholar
  3. Banerjee, S., & Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (pp. 805–810).Google Scholar
  4. Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with python. Sebastopol: O’Reilly Media Inc.Google Scholar
  5. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. JMLR, 3, 993–1022. Google Scholar
  6. Boyd-Graber J, Blei D, & Zhu X. (2007). A topic model for word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 1024–1033).Google Scholar
  7. Canini, K. R., Shi, L., & Griffiths T. L. (2009). Online inference of topics with latent Dirichlet allocation. In AISTATS (vol. 5).Google Scholar
  8. Chang, J., Boyd-Graber, J., Gerrish, S., Wang C., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems (NIPS) (vol. 21).Google Scholar
  9. Chen, Z. Y., Mukherjee, A, Liu., B., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Discovering coherent topics using general knowledge. In Proceedings of the 22nd ACM International Conference on Conference on Information and Knowledge Management (pp. 209–218).Google Scholar
  10. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.CrossRefGoogle Scholar
  11. Heinrich, G. (2009). Parameter estimation for text analysis, Technical report, Fraunhofer IGD, Available at
  12. Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings on International Conference on Research in Computational Linguistics (pp. 19–33).Google Scholar
  13. Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.Google Scholar
  14. Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning.Google Scholar
  15. Lin, C. & He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In The 18th ACM Conference on Information and Knowledge Management (CIKM).Google Scholar
  16. Mahmoud, H. (2008). Pólya urn models. Chapman & Hall/CRC Texts in statistical science.Google Scholar
  17. Miller, G. A., Beckwith, R., Fellbaum, C. D., Gross, D., & Miller, K. (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4), 235–244.CrossRefGoogle Scholar
  18. Mimno, D. & McCallum, A. (2007). Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 376–385).Google Scholar
  19. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. EMNLP (pp. 262–272).Google Scholar
  20. Minka T. P. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001).Google Scholar
  21. Newman, D., Chemudugunta, C., Smyth, P., & Steyvers, M. (2006). Analyzing entities and topics in news articles using statistical topic models. In Intelligence and security informatics. Lecture Notes in Computer Science.Google Scholar
  22. Newman, D., Noh, Y., Talley, E., Karimi, S., & Baldwin, T. (2010). Evaluating Topic Models for Digital Libraries, In: Proceedings of the 10th Annual Joint Conference on Digital Libraries (pp. 215–224).Google Scholar
  23. Newman, D., Bonilla, E. V. & Buntine, W. L. (2011). Improving topic coherence with regularized topic models. NIPS (pp. 496–504).Google Scholar
  24. Ramage, D, Dumais, S. T., & Liebling, D. J. (2010). Characterizing microblogs with topic models. ICWSM 2010.Google Scholar
  25. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (pp. 448–453).Google Scholar
  26. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 487–494).Google Scholar
  27. Wang, L. D., Wei, B. G., & Yuan, J. (2012). Document clustering based on probabilistic topic model. Acta Electronic Sinica, 40(11), 2346–2350.Google Scholar
  28. Wu, Z. & Palmer, M. (1994). Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics (pp. 133–138).Google Scholar
  29. Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of ACM SIGIR (pp. 334–342).Google Scholar
  30. Zhang, H. B., Zhu, M. J., Shi, S. M., & Wen, J. R. (2009). Employing topic models for pattern-based semantic class discovery. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’09).Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.School of Information Science and EngineeringCentral South UniversityChangshaChina
  2. 2.School of Computing and Information SystemsAthabasca UniversityAthabascaCanada
  3. 3.Hunan Engineering Laboratory for Advanced Control and Intelligent AutomationChangshaChina

Personalised recommendations