Experiments with Semantic Similarity Measures Based on LDA and LSA

  • Nobal Niraula
  • Rajendra Banjade
  • Dan Ştefănescu
  • Vasile Rus
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7978)

Abstract

We present in this paper experiments with several semantic similarity measures based on the unsupervised method Latent Dirichlet Allocation. For comparison purposes, we also report experimental results using an algebraic method, Latent Semantic Analysis. The proposed semantic similarity methods were evaluated using one dataset that includes student answers from conversational intelligent tutoring systems and a standard paraphrase dataset, the Microsoft Research Paraphrase corpus. Results indicate that the method based on word representations as topic vectors outperforms methods based on distributions over topics and words. The proposed evaluation methods can also be regarded as an extrinsic method for evaluating topic coherence or selecting the number of topics in LDA models, i.e. a task-based evaluation of topic coherence and selection of number of topics in LDA.

Keywords

semantic similarity statistical methods Latent Dirichlet Allocation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)MATHGoogle Scholar
  2. 2.
    Celikyilmaz, A., Hakkani-Tür, D., Tur, G.: 2010. LDA Based Similarity Modeling for Question Answering. In: NAACL-HLT. Workshop on Semantic Search, Los Angeles, CA (June 2010)Google Scholar
  3. 3.
    Chen, X., Li, L., Xiao, H., Xu, G., Yang, Z., Kitsuregawa, M.: Recommending Related Microblogs: A Comparison between Topic and WordNet based Approaches. In: Proceedings of the 26th International Conference on Artificial Intelligence (2012)Google Scholar
  4. 4.
    Dagan, I., Glickman, O., Magnini, B.: Recognizing textual entailment (2004), http://www.pascalnetwork.org/Challenges/RTE
  5. 5.
    Dagan, I., Lee, L., Pereira, F.C.N.: Similarity Based Methods For Word Sense Disambiguation. In: ACL, pp. 56–63 (1997)Google Scholar
  6. 6.
    Dolan, B., Quirk, C., Brockett, C.: Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources. In: COLING 2004, Geneva, Switzerland (2004)Google Scholar
  7. 7.
    Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the Computational Linguistics UK, CLUK 2008 (2008)Google Scholar
  8. 8.
    Graesser, A.C., Olney, A., Haynes, B.C., Chipman, P.: Autotutor: A cognitive system that simulates a tutor that facilitates learning through mixed-initiative dialogue. In: Cognitive Systems: Human Cognitive Models in Systems Design. Erlbaum, Mahwah (2005)Google Scholar
  9. 9.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(suppl. 1), 5228–5235 (2004)CrossRefGoogle Scholar
  10. 10.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of SIGIR 1999, pp. 50–57 (1999)Google Scholar
  11. 11.
    Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Kuhn, H.W.: The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Landauer, T., McNamara, D.S., Dennis, S., Kintsch, W.: Handbook of Latent Semantic Analysis. Erlbaum, Mahwah (2007)Google Scholar
  14. 14.
    Lintean, M., Moldovan, C., Rus, V., McNamara, D.: The Role of Local and Global Weighting in Assessing the Semantic Similarity of Texts Using Latent Semantic Analysis. In: Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference, Daytona Beach, FL (2010)Google Scholar
  15. 15.
    Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272. ACL (2011)Google Scholar
  16. 16.
    Munkres, J.: Algorithms for the assignment and transportation problems. Journal of the Society for Industrial and Applied Mathematics 5(1), 32–38 (1957)MathSciNetMATHCrossRefGoogle Scholar
  17. 17.
    Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: HLT-NACL, pp. 100–108. ACL (2010)Google Scholar
  18. 18.
    McCarthy, P.M., McNamara, D.S.: User-Language Paraphrase Corpus Challenge (2008)Google Scholar
  19. 19.
    Rus, V., Graesser, A.C.: Deeper natural language processing for evaluating student answers in intelligent tutoring systems. Paper Presented at the Annual Meeting of the American Association of Artificial Intelligence (AAAI 2006), Boston, MA, July 16-20 (2006)Google Scholar
  20. 20.
    Rus, V., Lintean, M.: A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics. In: Proceedings of the Seventh Workshop on Innovative Use of Natural Language Processing for Building Educational Applications, NAACL-HLT 2012, Montreal, Canada, June 7-8 (2012)Google Scholar
  21. 21.
    Rus, V., Niraula, N., Banjade, R.: Similarity Measures Based on Latent Dirichlet Allocation. In: Gelbukh, A. (ed.) CICLing 2013, Part I. LNCS, vol. 7816, pp. 459–470. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  22. 22.
    Steyvers, M., Griffiths, T.: Probabilistic topic models. Handbook of Latent Semantic Analysis 427(7), 424–440 (2006)Google Scholar
  23. 23.
    Wallach, H., Mimno, D., McCallum, A.: Rethinking LDA: Why priors matter? Advances in Neural Information Processing Systems, 22, 1973–1981 (2009)Google Scholar
  24. 24.
    Teh, Y.W., Jordan, M.I., Beal, M.J., Blei, D.M.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566–1581 (2006)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Nobal Niraula
    • 1
  • Rajendra Banjade
    • 1
  • Dan Ştefănescu
    • 1
  • Vasile Rus
    • 1
  1. 1.Department of Computer ScienceThe University of MemphisUSA

Personalised recommendations