A Sentence Similarity Method Based on Chunking and Information Content

  • Dan Ştefănescu
  • Rajendra Banjade
  • Vasile Rus
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8403)


This paper introduces a method for assessing the semantic similarity between sentences, which relies on the assumption that the meaning of a sentence is captured by its syntactic constituents and the dependencies between them. We obtain both the constituents and their dependencies from a syntactic parser. Our algorithm considers that two sentences have the same meaning if it can find a good mapping between their chunks and also if the chunk dependencies in one text are preserved in the other. Moreover, the algorithm takes into account that every chunk has a different importance with respect to the overall meaning of a sentence, which is computed based on the information content of the words in the chunk. The experiments conducted on a well-known paraphrase data set show that the performance of our method is comparable to state of the art.


Similarity Semantic Similarity Sentence Similarity Paraphrase Identification Short Text Similarity Information Content 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 305–316. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  2. 2.
    Agirre, E., Diab, M., Cer, D., Gonzalez-Agirre, A.: Semeval 2012 task 6: A pilot on semantic textual similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 385–393. Association for Computational Linguistics (2012)Google Scholar
  3. 3.
    Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, vol. 1, pp. 468–476. ACL (2009)Google Scholar
  4. 4.
    Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. ACL (2004)Google Scholar
  5. 5.
    Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)Google Scholar
  6. 6.
    Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloqium (2008)Google Scholar
  7. 7.
    Finch, A., Hwang, Y.S., Sumita, E.: Using machine translation evaluation techniques to determine sentence-level semantic equivalence. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005), pp. 17–24 (2005)Google Scholar
  8. 8.
    Hassan, S.: Measuring semantic relatedness using salient encyclopedic concepts. PhD, Thesis. University of North Texas (2011)Google Scholar
  9. 9.
    Islam, A., Inkpen, D.: Semantic similarity of short texts. In: Recent Advances in Natural Language Processing V, vol. 309, pp. 227–236 (2009)Google Scholar
  10. 10.
    Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  11. 11.
    Kuhn, H.W.: The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955)CrossRefMathSciNetGoogle Scholar
  12. 12.
    Lintean, M.: Measuring Semantic Similarity: Representations and Methods (Doctoral dissertation). The University of Memphis, Memphis, TN (2011)Google Scholar
  13. 13.
    Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 182–190. Association for Computational Linguistics (2012)Google Scholar
  14. 14.
    McCarthy, P.M., McNamara, D.S.: The user-language paraphrase challenge (2008)Google Scholar
  15. 15.
    Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)Google Scholar
  16. 16.
    Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. arXiv preprint cmp-lg/9511007 (1995)Google Scholar
  17. 17.
    Rocchio, J.J.: Relevance feedback in information retrieval. Prentice Hall, Ing. Englewood Cliffs, New Jersey (1971)Google Scholar
  18. 18.
    Rus, V., Banjade, R., Lintean, M.: On Paraphrase Identification Corpora. LREC (2014)Google Scholar
  19. 19.
    Rus, V., Lintean, M.: A Comparison of Greedy and Optimal Assessment of Natural Language Student Input Using Word-to-Word Similarity Metrics. In: Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, The 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Montreal, Canada (2012)Google Scholar
  20. 20.
    Rus, V., McCarthy, P.M., Lintean, M.C., McNamara, D.S., Graesser, A.C.: Paraphrase Identification with Lexico-Syntactic Graph Subsumption. In: FLAIRS Conference, pp. 201–206 (2008)Google Scholar
  21. 21.
    Rus, V., Niraula, N., Banjade, R.: Similarity measures based on latent dirichlet allocation. In: Gelbukh, A. (ed.) CICLing 2013, Part I. LNCS, vol. 7816, pp. 459–470. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  22. 22.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  23. 23.
    Salton, G., Lesk, M.E.: Computer evaluation of indexing and text processing. Journal of the ACM (JACM) 15(1), 8–36 (1968)CrossRefMATHGoogle Scholar
  24. 24.
    Socher, R., Bauer, J., Manning, C.D., Ng, A.Y.: Parsing With Compositional Vector Grammars. In: Proceedings of ACL 2013 (2013)Google Scholar
  25. 25.
    Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)Google Scholar
  26. 26.
    Ştefănescu, D., Banjade, R., Rus, V.: Latent Semantic Analysis Models on Wikipedia and TASA, LREC (2014)Google Scholar
  27. 27.
    Turney, P.: Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL (2001)Google Scholar
  28. 28.
    Van Valin Jr, R.D.: Lexical representation, co-composition, and linking syntax and semantics. In: Advances in Generative Lexicon Theory, pp. 67–107. Springer (2013)Google Scholar
  29. 29.
    Wan, S., Dras, M., Dale, R., Paris, C.: Using dependency-based features to take the “para-farce” out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Dan Ştefănescu
    • 1
  • Rajendra Banjade
    • 1
  • Vasile Rus
    • 1
  1. 1.Department of Computer ScienceThe University of MemphisMemphisUSA

Personalised recommendations