Semantic Similarity-Enhanced Topic Models for Document Analysis

Gao, Yan; Wen, Dunwei

doi:10.1007/978-3-662-44447-4_3

Yan Gao^8,9,10 &
Dunwei Wen⁹

Part of the book series: Lecture Notes in Educational Technology ((LNET))

1541 Accesses
2 Citations

Abstract

In e-learning environment, more and more larger-scale text resources are generated by teaching–learning interactions. Finding latent topics in these resources can help us understand the teaching contents and the learners’ interests and focuses. Latent Dirichlet allocation (LDA) has been widely used in many areas to extract the latent topics in a text corpus. However, the extracted topics cannot be understood by the end user. Adding more auxiliary information to LDA to guide the process of topic extraction is a good way to improve the interpretability of topic modeling. Co-occurrence information in corpus is such information, but it is not sufficient yet to measure the similarity between word pairs, especially in sparse document space. To deal with this problem, we propose a new semantic similarity-enhanced topic model in this paper. In this model, we use not only co-occurrence information but also the semantic similarity based on WordNet as auxiliary information. Those two kinds of information are combined into a topic-word component though generative Pólya urn model. The distribution of documents over the extracted topics obtained by the new model can be inputted to the classifier. The accuracy of extracting topics can improve the performance of the classifier. Our experiments on newsgroup corpus show that the semantic similarity-enhanced topic model performs better than the topic models with only single information separately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andrzejewski, D, Zhu, X. J., & Craven, M. (2009). Incorporating domain knowledge into topic via Dirichlet forest priors. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009).
Google Scholar
Andrzejewski, D, Zhu, X. J., Craven, M., & Recht, B. (2011). A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In Proceedings of the 22nd International Joint Conferences on Artificial Intelligence (IJCAI 2011).
Google Scholar
Banerjee, S., & Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (pp. 805–810).
Google Scholar
Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with python. Sebastopol: O’Reilly Media Inc.
Google Scholar
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. JMLR, 3, 993–1022.
Google Scholar
Boyd-Graber J, Blei D, & Zhu X. (2007). A topic model for word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 1024–1033).
Google Scholar
Canini, K. R., Shi, L., & Griffiths T. L. (2009). Online inference of topics with latent Dirichlet allocation. In AISTATS (vol. 5).
Google Scholar
Chang, J., Boyd-Graber, J., Gerrish, S., Wang C., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems (NIPS) (vol. 21).
Google Scholar
Chen, Z. Y., Mukherjee, A, Liu., B., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Discovering coherent topics using general knowledge. In Proceedings of the 22nd ACM International Conference on Conference on Information and Knowledge Management (pp. 209–218).
Google Scholar
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.
Article Google Scholar
Heinrich, G. (2009). Parameter estimation for text analysis, Technical report, Fraunhofer IGD, Available at http://www.arbylon.net/publications/text-est2.pdf
Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings on International Conference on Research in Computational Linguistics (pp. 19–33).
Google Scholar
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.
Google Scholar
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning.
Google Scholar
Lin, C. & He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In The 18th ACM Conference on Information and Knowledge Management (CIKM).
Google Scholar
Mahmoud, H. (2008). Pólya urn models. Chapman & Hall/CRC Texts in statistical science.
Google Scholar
Miller, G. A., Beckwith, R., Fellbaum, C. D., Gross, D., & Miller, K. (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4), 235–244.
Article Google Scholar
Mimno, D. & McCallum, A. (2007). Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 376–385).
Google Scholar
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. EMNLP (pp. 262–272).
Google Scholar
Minka T. P. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001).
Google Scholar
Newman, D., Chemudugunta, C., Smyth, P., & Steyvers, M. (2006). Analyzing entities and topics in news articles using statistical topic models. In Intelligence and security informatics. Lecture Notes in Computer Science.
Google Scholar
Newman, D., Noh, Y., Talley, E., Karimi, S., & Baldwin, T. (2010). Evaluating Topic Models for Digital Libraries, In: Proceedings of the 10th Annual Joint Conference on Digital Libraries (pp. 215–224).
Google Scholar
Newman, D., Bonilla, E. V. & Buntine, W. L. (2011). Improving topic coherence with regularized topic models. NIPS (pp. 496–504).
Google Scholar
Ramage, D, Dumais, S. T., & Liebling, D. J. (2010). Characterizing microblogs with topic models. ICWSM 2010.
Google Scholar
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (pp. 448–453).
Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 487–494).
Google Scholar
Wang, L. D., Wei, B. G., & Yuan, J. (2012). Document clustering based on probabilistic topic model. Acta Electronic Sinica, 40(11), 2346–2350.
Google Scholar
Wu, Z. & Palmer, M. (1994). Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics (pp. 133–138).
Google Scholar
Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of ACM SIGIR (pp. 334–342).
Google Scholar
Zhang, H. B., Zhu, M. J., Shi, S. M., & Wen, J. R. (2009). Employing topic models for pattern-based semantic class discovery. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’09).
Google Scholar

Download references

Acknowledgments

The authors acknowledge the support of Research Incentive Grant (RIG) of Athabasca University. This work was also partly supported by the National Natural Science Foundation of China under Grant 61273314 and 61175064.

Author information

Authors and Affiliations

School of Information Science and Engineering, Central South University, Changsha, Hunan, China
Yan Gao
School of Computing and Information Systems, Athabasca University, Athabasca, AB, Canada
Yan Gao & Dunwei Wen
Hunan Engineering Laboratory for Advanced Control and Intelligent Automation, Changsha, Hunan, China
Yan Gao

Authors

Yan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Dunwei Wen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dunwei Wen .

Editor information

Editors and Affiliations

Athabasca University, Edmonton, Alberta, Canada
Maiga Chang
School of Educational Technology, Beijing Normal University, Beijing, China
Yanyan Li

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gao, Y., Wen, D. (2015). Semantic Similarity-Enhanced Topic Models for Document Analysis. In: Chang, M., Li, Y. (eds) Smart Learning Environments. Lecture Notes in Educational Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44447-4_3

Download citation

DOI: https://doi.org/10.1007/978-3-662-44447-4_3
Published: 06 September 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44446-7
Online ISBN: 978-3-662-44447-4
eBook Packages: Humanities, Social Sciences and LawEducation (R0)

Publish with us

Policies and ethics