Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods

Banjade, Rajendra; Maharjan, Nabin; Niraula, Nobal B.; Rus, Vasile; Gautam, Dipesh

doi:10.1007/978-3-319-18111-0_25

Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods

Rajendra Banjade¹⁴,
Nabin Maharjan¹⁴,
Nobal B. Niraula¹⁴,
Vasile Rus¹⁴ &
…
Dipesh Gautam¹⁴

Conference paper

3031 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9041))

Abstract

Substantial amount of work has been done on measuring word-to-word relatedness which is also commonly referred as similarity. Though relatedness and similarity are closely related, they are not the same as illustrated by the words lemon and tea which are related but not similar. The relatedness takes into account a broader ranLemge of relations while similarity only considers subsumption relations to assess how two objects are similar. We present in this paper a method for measuring the semantic similarity of words as a combination of various techniques including knowledge-based and corpus-based methods that capture different aspects of similarity. Our corpus based method exploits state-of-the-art word representations. We performed experiments with a recently published significantly large dataset called Simlex-999 and achieved a significantly better correlation (ρ = 0.642, P < 0.001) with human judgment compared to the individual performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A.: A study on similarity and relatedness using distributional and WordNet-based approaches. In: Proceedings of HLT: The Annual Conference of NAACL, pp. 19–27. Association for Computational Linguistics (2009)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Burgess, C., Lund, K.: Hyperspace analog to language (hal): A general model of semantic representation. In: Proceedings of the Annual Meeting of the Psychonomic Society, vol. 12, pp. 177–210 (1995)
Google Scholar
Fellbaum, C.: WordNet. Blackwell Publishing Ltd. (1998)
Google Scholar
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: The concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In: IJCAI, vol. 7, pp. 1606–1611 (2007)
Google Scholar
Graesser, A.C., Penumatsa, P., Ventura, M., Cai, Z., Hu, X.: Using LSA in AutoTutor: Learning through mixed initiative dialogue in natural language. In: Handbook of Latent Semantic Analysis, pp. 243–262 (2007)
Google Scholar
Han, L., Kashyap, A., Finin, T., Mayfield, J., Weese, J.: UMBC EBIQUITY-CORE: Semantic textual similarity systems. In: Proceedings of the Second Joint Conference on Lexical and Computational Semantics, vol. 1, pp. 44–52 (2013)
Google Scholar
Hill, F., Reichart, R., Korhonen, A.: Simlex-999: Evaluating semantic models with (genuine) similarity estimation. arXiv preprint arXiv:1408.3456 (2014)
Google Scholar
Hinton, G.E.: Distributed representations (1984)
Google Scholar
Hirst, G., St-Onge, D.: Lexical chains as representations of context for the detection and correction of malapropisms. WordNet: An Electronic Lexical Database 305, 305–332 (1998)
Google Scholar
Jiang, J.J., Conrath, D.W.: Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmp-lg/9709008
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25(2-3), 259–284 (1998)
Article Google Scholar
Leacock, C., Chodorow, M.: Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database 49(2), 265–283 (1998)
Google Scholar
Lee, J.H., Kim, M.H., Lee, Y.J.: Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation 42(2), 188–207 (1989)
Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304 (1998)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Mohler, M., Mihalcea, R.: Text-to-text semantic similarity for automatic short answer grading. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 567–575. Association for Computational Linguistics (March 2009)
Google Scholar
Patwardhan, S., Banerjee, S., Pedersen, T.: Using measures of semantic relatedness for word sense disambiguation. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 241–257. Springer, Heidelberg (2003)
Chapter Google Scholar
Pedersen, T., Patwardhan, S., Michelizzi, J.: WordNet:: Similarity: measuring the relatedness of concepts. In: Demonstration Papers at HLT-NAACL 2004, pp. 38–41. Association for Computational Linguistics (May 2004)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation
Google Scholar
Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Stein, B.: Overview of the 4th International Competition on Plagiarism Detection. In: CLEF (Online Working Notes/Labs/Workshop) (September 2012)
Google Scholar
Resnik, P.: Using information content to evaluate semantic similarity in taxonomy. arXiv preprint cmp-lg/9511007 (1995)
Google Scholar
Rus, V., Lintean, M.: A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In: Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pp. 157–162. Association for Computational Linguistics (2012)
Google Scholar
Rus, V., Lintean, M.C., Banjade, R., Niraula, N. B., Stefanescu, D.: SEMILAR: The Semantic Similarity Toolkit. In: ACL (Conference System Demonstrations), pp. 163–168 (August 2013)
Google Scholar
Rus, V., Niraula, N., Banjade, R.: Similarity measures based on latent dirichlet allocation. In: Gelbukh, A. (ed.) CICLing 2013, Part I. LNCS, vol. 7816, pp. 459–470. Springer, Heidelberg (2013)
Chapter Google Scholar
Ştefănescu, D., Banjade, R. Rus, V.: Latent Semantic Analysis Models on Wikipedia and TASA, LREC (2014)
Google Scholar
Stefanescu, D., Rus, V., Niraula, N.B., Banjade, R.: Combining Knowledge and Corpus-based Measures for Word-to-Word Similarity. In: The Twenty-Seventh International Flairs Conference (March 2014)
Google Scholar
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 384–394. Association for Computational Linguistics (July 2010)
Google Scholar
Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics, pp.133–138 (1994)
Google Scholar
Niraula, N.B., Gautam, D., Banjade, R., Maharjan, N., Rus, V.: Combining Word Representations for Measuring Word Relatedness and Similarity. In: The Proceedings of 28th International FLAIRS Conference (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Memphis, Memphis, TN, 38152, USA
Rajendra Banjade, Nabin Maharjan, Nobal B. Niraula, Vasile Rus & Dipesh Gautam

Authors

Rajendra Banjade
View author publications
You can also search for this author in PubMed Google Scholar
Nabin Maharjan
View author publications
You can also search for this author in PubMed Google Scholar
Nobal B. Niraula
View author publications
You can also search for this author in PubMed Google Scholar
Vasile Rus
View author publications
You can also search for this author in PubMed Google Scholar
Dipesh Gautam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajendra Banjade .

Editor information

Editors and Affiliations

Centro de Investigación en Computación, Instituto Politécnico Nacional, Mexico DF, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Banjade, R., Maharjan, N., Niraula, N.B., Rus, V., Gautam, D. (2015). Lemon and Tea Are Not Similar: Measuring Word-to-Word Similarity by Combining Different Methods. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2015. Lecture Notes in Computer Science(), vol 9041. Springer, Cham. https://doi.org/10.1007/978-3-319-18111-0_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-18111-0_25
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18110-3
Online ISBN: 978-3-319-18111-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics