Skip to main content

Semantic Similarity-Enhanced Topic Models for Document Analysis

  • Chapter
  • First Online:
Smart Learning Environments

Part of the book series: Lecture Notes in Educational Technology ((LNET))

Abstract

In e-learning environment, more and more larger-scale text resources are generated by teaching–learning interactions. Finding latent topics in these resources can help us understand the teaching contents and the learners’ interests and focuses. Latent Dirichlet allocation (LDA) has been widely used in many areas to extract the latent topics in a text corpus. However, the extracted topics cannot be understood by the end user. Adding more auxiliary information to LDA to guide the process of topic extraction is a good way to improve the interpretability of topic modeling. Co-occurrence information in corpus is such information, but it is not sufficient yet to measure the similarity between word pairs, especially in sparse document space. To deal with this problem, we propose a new semantic similarity-enhanced topic model in this paper. In this model, we use not only co-occurrence information but also the semantic similarity based on WordNet as auxiliary information. Those two kinds of information are combined into a topic-word component though generative Pólya urn model. The distribution of documents over the extracted topics obtained by the new model can be inputted to the classifier. The accuracy of extracting topics can improve the performance of the classifier. Our experiments on newsgroup corpus show that the semantic similarity-enhanced topic model performs better than the topic models with only single information separately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Andrzejewski, D, Zhu, X. J., & Craven, M. (2009). Incorporating domain knowledge into topic via Dirichlet forest priors. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009).

    Google Scholar 

  • Andrzejewski, D, Zhu, X. J., Craven, M., & Recht, B. (2011). A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In Proceedings of the 22nd International Joint Conferences on Artificial Intelligence (IJCAI 2011).

    Google Scholar 

  • Banerjee, S., & Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (pp. 805–810).

    Google Scholar 

  • Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with python. Sebastopol: O’Reilly Media Inc.

    Google Scholar 

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. JMLR, 3, 993–1022.

    Google Scholar 

  • Boyd-Graber J, Blei D, & Zhu X. (2007). A topic model for word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 1024–1033).

    Google Scholar 

  • Canini, K. R., Shi, L., & Griffiths T. L. (2009). Online inference of topics with latent Dirichlet allocation. In AISTATS (vol. 5).

    Google Scholar 

  • Chang, J., Boyd-Graber, J., Gerrish, S., Wang C., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems (NIPS) (vol. 21).

    Google Scholar 

  • Chen, Z. Y., Mukherjee, A, Liu., B., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Discovering coherent topics using general knowledge. In Proceedings of the 22nd ACM International Conference on Conference on Information and Knowledge Management (pp. 209–218).

    Google Scholar 

  • Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.

    Article  Google Scholar 

  • Heinrich, G. (2009). Parameter estimation for text analysis, Technical report, Fraunhofer IGD, Available at http://www.arbylon.net/publications/text-est2.pdf

  • Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings on International Conference on Research in Computational Linguistics (pp. 19–33).

    Google Scholar 

  • Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.

    Google Scholar 

  • Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning.

    Google Scholar 

  • Lin, C. & He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In The 18th ACM Conference on Information and Knowledge Management (CIKM).

    Google Scholar 

  • Mahmoud, H. (2008). Pólya urn models. Chapman & Hall/CRC Texts in statistical science.

    Google Scholar 

  • Miller, G. A., Beckwith, R., Fellbaum, C. D., Gross, D., & Miller, K. (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4), 235–244.

    Article  Google Scholar 

  • Mimno, D. & McCallum, A. (2007). Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 376–385).

    Google Scholar 

  • Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. EMNLP (pp. 262–272).

    Google Scholar 

  • Minka T. P. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001).

    Google Scholar 

  • Newman, D., Chemudugunta, C., Smyth, P., & Steyvers, M. (2006). Analyzing entities and topics in news articles using statistical topic models. In Intelligence and security informatics. Lecture Notes in Computer Science.

    Google Scholar 

  • Newman, D., Noh, Y., Talley, E., Karimi, S., & Baldwin, T. (2010). Evaluating Topic Models for Digital Libraries, In: Proceedings of the 10th Annual Joint Conference on Digital Libraries (pp. 215–224).

    Google Scholar 

  • Newman, D., Bonilla, E. V. & Buntine, W. L. (2011). Improving topic coherence with regularized topic models. NIPS (pp. 496–504).

    Google Scholar 

  • Ramage, D, Dumais, S. T., & Liebling, D. J. (2010). Characterizing microblogs with topic models. ICWSM 2010.

    Google Scholar 

  • Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (pp. 448–453).

    Google Scholar 

  • Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 487–494).

    Google Scholar 

  • Wang, L. D., Wei, B. G., & Yuan, J. (2012). Document clustering based on probabilistic topic model. Acta Electronic Sinica, 40(11), 2346–2350.

    Google Scholar 

  • Wu, Z. & Palmer, M. (1994). Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics (pp. 133–138).

    Google Scholar 

  • Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of ACM SIGIR (pp. 334–342).

    Google Scholar 

  • Zhang, H. B., Zhu, M. J., Shi, S. M., & Wen, J. R. (2009). Employing topic models for pattern-based semantic class discovery. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’09).

    Google Scholar 

Download references

Acknowledgments

The authors acknowledge the support of Research Incentive Grant (RIG) of Athabasca University. This work was also partly supported by the National Natural Science Foundation of China under Grant 61273314 and 61175064.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dunwei Wen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Gao, Y., Wen, D. (2015). Semantic Similarity-Enhanced Topic Models for Document Analysis. In: Chang, M., Li, Y. (eds) Smart Learning Environments. Lecture Notes in Educational Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44447-4_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-44447-4_3

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-44446-7

  • Online ISBN: 978-3-662-44447-4

  • eBook Packages: Humanities, Social Sciences and LawEducation (R0)

Publish with us

Policies and ethics