Abstract
In e-learning environment, more and more larger-scale text resources are generated by teaching–learning interactions. Finding latent topics in these resources can help us understand the teaching contents and the learners’ interests and focuses. Latent Dirichlet allocation (LDA) has been widely used in many areas to extract the latent topics in a text corpus. However, the extracted topics cannot be understood by the end user. Adding more auxiliary information to LDA to guide the process of topic extraction is a good way to improve the interpretability of topic modeling. Co-occurrence information in corpus is such information, but it is not sufficient yet to measure the similarity between word pairs, especially in sparse document space. To deal with this problem, we propose a new semantic similarity-enhanced topic model in this paper. In this model, we use not only co-occurrence information but also the semantic similarity based on WordNet as auxiliary information. Those two kinds of information are combined into a topic-word component though generative Pólya urn model. The distribution of documents over the extracted topics obtained by the new model can be inputted to the classifier. The accuracy of extracting topics can improve the performance of the classifier. Our experiments on newsgroup corpus show that the semantic similarity-enhanced topic model performs better than the topic models with only single information separately.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrzejewski, D, Zhu, X. J., & Craven, M. (2009). Incorporating domain knowledge into topic via Dirichlet forest priors. In Proceedings of the 26th International Conference on Machine Learning (ICML 2009).
Andrzejewski, D, Zhu, X. J., Craven, M., & Recht, B. (2011). A framework for incorporating general domain knowledge into latent Dirichlet allocation using first-order logic. In Proceedings of the 22nd International Joint Conferences on Artificial Intelligence (IJCAI 2011).
Banerjee, S., & Pedersen, T. (2003). Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (pp. 805–810).
Bird, S., Loper, E., & Klein, E. (2009). Natural language processing with python. Sebastopol: O’Reilly Media Inc.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. JMLR, 3, 993–1022.
Boyd-Graber J, Blei D, & Zhu X. (2007). A topic model for word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (pp. 1024–1033).
Canini, K. R., Shi, L., & Griffiths T. L. (2009). Online inference of topics with latent Dirichlet allocation. In AISTATS (vol. 5).
Chang, J., Boyd-Graber, J., Gerrish, S., Wang C., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems (NIPS) (vol. 21).
Chen, Z. Y., Mukherjee, A, Liu., B., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Discovering coherent topics using general knowledge. In Proceedings of the 22nd ACM International Conference on Conference on Information and Knowledge Management (pp. 209–218).
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl. 1), 5228–5235.
Heinrich, G. (2009). Parameter estimation for text analysis, Technical report, Fraunhofer IGD, Available at http://www.arbylon.net/publications/text-est2.pdf
Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings on International Conference on Research in Computational Linguistics (pp. 19–33).
Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.
Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning.
Lin, C. & He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In The 18th ACM Conference on Information and Knowledge Management (CIKM).
Mahmoud, H. (2008). Pólya urn models. Chapman & Hall/CRC Texts in statistical science.
Miller, G. A., Beckwith, R., Fellbaum, C. D., Gross, D., & Miller, K. (1990). WordNet: An online lexical database. International Journal of Lexicography, 3(4), 235–244.
Mimno, D. & McCallum, A. (2007). Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries (pp. 376–385).
Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011). Optimizing semantic coherence in topic models. EMNLP (pp. 262–272).
Minka T. P. (2001). Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI2001).
Newman, D., Chemudugunta, C., Smyth, P., & Steyvers, M. (2006). Analyzing entities and topics in news articles using statistical topic models. In Intelligence and security informatics. Lecture Notes in Computer Science.
Newman, D., Noh, Y., Talley, E., Karimi, S., & Baldwin, T. (2010). Evaluating Topic Models for Digital Libraries, In: Proceedings of the 10th Annual Joint Conference on Digital Libraries (pp. 215–224).
Newman, D., Bonilla, E. V. & Buntine, W. L. (2011). Improving topic coherence with regularized topic models. NIPS (pp. 496–504).
Ramage, D, Dumais, S. T., & Liebling, D. J. (2010). Characterizing microblogs with topic models. ICWSM 2010.
Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (pp. 448–453).
Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (pp. 487–494).
Wang, L. D., Wei, B. G., & Yuan, J. (2012). Document clustering based on probabilistic topic model. Acta Electronic Sinica, 40(11), 2346–2350.
Wu, Z. & Palmer, M. (1994). Verb semantics and lexical selection. In 32nd Annual Meeting of the Association for Computational Linguistics (pp. 133–138).
Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of ACM SIGIR (pp. 334–342).
Zhang, H. B., Zhu, M. J., Shi, S. M., & Wen, J. R. (2009). Employing topic models for pattern-based semantic class discovery. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’09).
Acknowledgments
The authors acknowledge the support of Research Incentive Grant (RIG) of Athabasca University. This work was also partly supported by the National Natural Science Foundation of China under Grant 61273314 and 61175064.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Gao, Y., Wen, D. (2015). Semantic Similarity-Enhanced Topic Models for Document Analysis. In: Chang, M., Li, Y. (eds) Smart Learning Environments. Lecture Notes in Educational Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44447-4_3
Download citation
DOI: https://doi.org/10.1007/978-3-662-44447-4_3
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44446-7
Online ISBN: 978-3-662-44447-4
eBook Packages: Humanities, Social Sciences and LawEducation (R0)