Language Resources and Evaluation

, Volume 50, Issue 1, pp 125–161 | Cite as

Robust semantic text similarity using LSA, machine learning, and linguistic resources

  • Abhay Kashyap
  • Lushan Han
  • Roberto Yus
  • Jennifer Sleeman
  • Taneeya Satyapanich
  • Sunil Gandhi
  • Tim Finin
Original Paper


Semantic textual similarity is a measure of the degree of semantic equivalence between two pieces of text. We describe the SemSim system and its performance in the *SEM 2013 and SemEval-2014 tasks on semantic textual similarity. At the core of our system lies a robust distributional word similarity component that combines latent semantic analysis and machine learning augmented with data from several linguistic resources. We used a simple term alignment algorithm to handle longer pieces of text. Additional wrappers and resources were used to handle task specific challenges that include processing Spanish text, comparing text sequences of different lengths, handling informal words and phrases, and matching words with sense definitions. In the *SEM 2013 task on Semantic Textual Similarity, our best performing system ranked first among the 89 submitted runs. In the SemEval-2014 task on Multilingual Semantic Textual Similarity, we ranked a close second in both the English and Spanish subtasks. In the SemEval-2014 task on Cross-Level Semantic Similarity, we ranked first in Sentence–Phrase, Phrase–Word, and Word–Sense subtasks and second in the Paragraph–Sentence subtask.


Latent semantic analysis WordNet Term alignment  Semantic similarity 



This research was supported by awards 1228198, 1250627 and 0910838 from the US National Science Foundation. We would like to thank the anonymous reviewers for their valuable comments on an earlier version of this paper.


  1. ACLwiki. (2015). WordSimilarity-353 test collection.
  2. Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., et al. (2014). SemEval-2014 task 10: Multilingual semantic textual similarity. In 8th international workshop on semantic evaluation (SemEval 2014) (pp. 81–91).Google Scholar
  3. Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., & Guo, W. (2013). *SEM 2013 shared task: Semantic textual similarity, including a pilot on typed-similarity. In Second joint conference on lexical and computational semantics (*SEM 2013).Google Scholar
  4. Agirre, E., Diab, M., Cer, D., & Gonzalez-Agirre, A. (2012). SemEval-2012 task 6: A pilot on semantic textual similarity. In First joint conference on lexical and computational semantics (*SEM 2012) (pp. 385–393).Google Scholar
  5. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.CrossRefGoogle Scholar
  6. Bird, S. (2006). NLTK: The natural language toolkit. In ACL 2006, 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics (COLING-ACL 2006) (pp. 69–72).Google Scholar
  7. Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., et al. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.Google Scholar
  8. Bullinaria, J. A., & Levy, J. P. (2012). Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44(3), 890–907.CrossRefGoogle Scholar
  9. Burgess, C., Livesay, K., & Lund, K. (1998). Explorations in context space: Words, sentences, discourse. Discourse Processes, 25, 211–257.CrossRefGoogle Scholar
  10. Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 27:1–27:27.CrossRefGoogle Scholar
  11. Coelho, T., Pereira Calado, P., Vieira Souza, L., Ribeiro-Neto, B., & Muntz, R. (2004). Image retrieval using multiple evidence ranking. IEEE Transactions on Knowledge and Data Engineering, 16(4), 408–417.CrossRefGoogle Scholar
  12. Collins, M. J. (1999). Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania.Google Scholar
  13. Davidson, S. (2013). Wordnik. The Charleston Advisor, 15(2), 54–58.CrossRefGoogle Scholar
  14. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.CrossRefGoogle Scholar
  15. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In 20th international conference on computational linguistics (COLING 2004).Google Scholar
  16. Finkel, J. R., Grenager, T., & Manning, C. D. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In 43rd annual meeting of the ACL (ACL 2005) (pp. 363–370).Google Scholar
  17. Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20(1), 116–131.CrossRefGoogle Scholar
  18. Google word frequency counts. (2008).
  19. Gonzalez-Agirre, A., Laparra, E., & Rigau, G. (2012). Multilingual central repository version 3.0. In 8th international conference on language resources and evaluation (LREC 2012) (pp. 2525–2529).Google Scholar
  20. Han, L. (2014). Schema free querying of semantic data. Ph.D. thesis. Baltimore County: University of Maryland.Google Scholar
  21. Han, L., & Finin, T. (2013). UMBC webbase corpus.
  22. Han, L., Finin, T., & Joshi, A. (2012). Schema-free structured querying of DBpedia data. In 21st ACM International Conference on Information and Knowledge Management (CIKM 2012) (pp. 2090–2093).Google Scholar
  23. Han, L., Finin, T., Joshi, A., & Cheng, D. (2015). Querying RDF data with text annotated graphs. In 27th international conference on scientific and statistical database management (SSDBM 2015).Google Scholar
  24. Han, L., Finin, T., McNamee, P., Joshi, A., & Yesha, Y. (2013). Improving word similarity by augmenting PMI with estimates of word polysemy. IEEE Transactions on Knowledge and Data Engineering, 25(6), 1307–1322.CrossRefGoogle Scholar
  25. Han, L., Kashyap, A. L., Finin, T., Mayfield, J., & Weese, J. (2013). \(\text{ UMBC }\_\text{ EBIQUITY }\)-CORE: Semantic textual similarity systems. In Second joint conference on lexical and computational semantics (*SEM 2013).Google Scholar
  26. Harris, Z. (1968). Mathematical structures of language. New York: Wiley.Google Scholar
  27. Hart, M. (1997). Project Gutenberg.
  28. Hatcher, E., Gospodnetic, O., & McCandless, M. (2004). Lucene in action. Greenwich, CT: Manning.Google Scholar
  29. Jurgens, D., Pilehvar, M. T., & Navigli, R. (2014). SemEval-2014 task 3: Cross-level semantic similarity. In 8th international workshop on semantic evaluation (SemEval 2014) (pp. 17–26).Google Scholar
  30. Kashyap, A., Han, L., Yus, R., Sleeman, J., Satyapanich, T., Gandhi, S., & Finin, T. (2014). Meerkat mafia: Multilingual and cross-level semantic textual similarity systems. In 8th International Workshop on Semantic Evaluation (SemEval 2014) (pp. 416–423).Google Scholar
  31. Kauchak, D., & Barzilay, R. (2006). Paraphrasing for automatic evaluation. In Human language technology conference of the North American chapter of the ACL (HLT-NAACL 2006) (pp. 455–462).Google Scholar
  32. Landauer, T., & Dumais, S. (1997). A solution to plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.CrossRefGoogle Scholar
  33. Lapesa, G., & Evert, S. (2014). A large scale evaluation of distributional semantic models: Parameters, interactions and model selection. Transactions of the Association for Computational Linguistics, 2, 531–545.Google Scholar
  34. Li, Y., Bandar, Z., & McLean, D. (2003). An approach for measuring semantic similarity between words using multiple information sources. IEEE Transactions on Knowledge and Data Engineering, 15(4), 871–882.CrossRefGoogle Scholar
  35. Lin, D. (1998). Automatic retrieval and clustering of similar words. In 17th international conference on computational linguistics (ACL 1998) (pp. 768–774).Google Scholar
  36. Lin, D. (1998). An information-theoretic definition of similarity. In 15th international conference on machine learning (ICML 1998) (pp. 296–304).Google Scholar
  37. Meadow, C. T. (1992). Text information retrieval systems. San Diego: Academic press.Google Scholar
  38. Metzler, D., Dumais, S., & Meek, C. (2007). Similarity measures for short segments of text. In 29th European conference on IR research (ECIR 2007) (pp. 16–27).Google Scholar
  39. Michel, J. B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Team, T. G. B., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176–182.CrossRefGoogle Scholar
  40. Mihalcea, R., Corley, C., & Strapparava, C. (2006). Corpus-based and knowledge-based measures of text semantic similarity. In 21st national conference on Artificial Intelligence (AAAI 2006) (pp. 775–780).Google Scholar
  41. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In NIPS (pp. 3111–3119).Google Scholar
  42. Miller, G. A. (1995). WordNet: A lexical database for english. Communications of the ACM, 38(11), 39–41.CrossRefGoogle Scholar
  43. Mohammad, S., Dorr, B., & Hirst, G. (2008). Computing word-pair antonymy. In Conference on empirical methods in natural language processing and computational natural language learning (EMNLP 2008) (pp. 982–991).Google Scholar
  44. Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In 9th machine translation summit (pp. 315–322).Google Scholar
  45. Ravin, Y., & Leacock, C. (2000). Polysemy: Theoretical and computational approaches: Theoretical and computational approaches. New York: Oxford University Press.Google Scholar
  46. Rose, T., Stevenson, M., & Whitehead, M. (2002). The reuters corpus volume 1—From yesterday’s news to tomorrow’s language resources. In 3rd International conference on language resources and evaluation (LREC 2002) (pp. 29–31).Google Scholar
  47. Sahami, M., & Heilman, T. D. (2006). A web-based kernel function for measuring the similarity of short text snippets. In 15th international world wide web conference (WWW 2006) (pp. 377–386).Google Scholar
  48. Saric, F., Glavas, G., Karan, M., Snajder, J., & Basic, B. D. (2012). TakeLab: Systems for measuring semantic text similarity. In First joint conference on lexical and computational semantics (*SEM 2012) (pp. 441–448).Google Scholar
  49. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., & Demirbas, M. (2010). Short text classification in twitter to improve information filtering. In 33rd international acm sigir conference on research and development in information retrieval (SIGIR 2010) (pp. 841–842).Google Scholar
  50. Stanford. (2001). Stanford WebBase project.
  51. Toutanova, K., Klein, D., Manning, C., Morgan, W., Rafferty, A., & Galley, M. (2000). Stanford log-linear part-of-speech tagger.
  52. UMBC. (2013). Graph of relations project.
  53. UMBC. (2013). Semantic similarity demo.
  54. Urban dictionary. (2014).
  55. Wu, Z., & Palmer, M. (1994). Verb semantic and lexical selection. In 32nd annual meeting of the Association for Computational Linguistics (ACL 1994) (pp. 133–138).Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2015

Authors and Affiliations

  • Abhay Kashyap
    • 1
  • Lushan Han
    • 1
  • Roberto Yus
    • 2
  • Jennifer Sleeman
    • 1
  • Taneeya Satyapanich
    • 1
  • Sunil Gandhi
    • 1
  • Tim Finin
    • 1
  1. 1.University of MarylandBaltimore CountyUSA
  2. 2.University of ZaragozaZaragozaSpain

Personalised recommendations