Journal of Intelligent Information Systems

, Volume 48, Issue 3, pp 675–689 | Cite as

Sentence similarity based on semantic kernels for intelligent text retrieval

  • Samir Amir
  • Adrian Tanasescu
  • Djamel A. Zighed


We propose a new approach to compute semantic similarity between sentences. It is based on the semantic kernel, composed of subject, verb, and object that, we suppose, summarize the general meaning of each sentence. Thanks to linguistics resources available such as Stanford Parser, many features are then extracted from the semantic kernels and aggregated by mean of weights. The weighting is produced by a supervised machine learning technique on a training data set provided by human experts as ground truth. The cross validation shows good performances. Thanks to this similarity measure between sentences, one can build an intelligent text retrieval engine more sensitive to the semantic content, specifically suited for short texts than the classical methods based on bag of words. An application is being developed for highlighting parts of speech in scientific articles.


Sentence similarity Text retrieval Semantic kernels 


  1. Breaux, H.J. (1968). A modification of efroymson’s technique for stepwise regression analysis. Communications of the ACM, 11(8), 556–558.CrossRefGoogle Scholar
  2. Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computer Linguistic, 32(1), 13–47.CrossRefzbMATHGoogle Scholar
  3. Che, L.M., Wei, C.J., Cheng, H.T., Hui, C.H., & Chen, C.H. (2012). A sentence similarity metric based on semantic patterns. Advances in Information Sciences and Service Sciences, 4(1), 576–585.Google Scholar
  4. Croft, D., Coupland, S., Shell, J., & Brown, S. (2013). A fast and efficient semantic short text similarity metric.Google Scholar
  5. De Boni, M., & Manandhar, S. (2003). The use of sentence similarity as a semantic relevance metric for question answering. In New directions in question answering, papers from 2003 AAAI spring symposium (pp. 138–144). Stanford: Stanford University.Google Scholar
  6. de Marneffe, M.-C., & Manning, C.D. (2008). The stanford typed dependencies representation. In Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, CrossParser ’08 (pp. 1–8). Stroudsburg: Association for Computational Linguistics.Google Scholar
  7. Hardin, J.W., & Hilbe, J. (2001). Generalized linear models and extensions. College station: Stata Press.zbMATHGoogle Scholar
  8. Hatzlvassiloglou, V., Klavans, J.L., & Eskin, E. (1999). Detecting text similarity over short passages:Exploring linguistic feature combinations via machine learning. In 1999 Joint SIGDAT conference on empirical methods in natural language processing and very large corpora (pp. 203–212).Google Scholar
  9. Heidinger, V. (1984). Analyzing Syntax and Semantics: Workbook: Gallaudet university press.Google Scholar
  10. Hirst, G., & St-Onge, D. (1994). WORDNET: A Lexical database for English. In Human language technology, proceedings of a workshop held at plainsboro, New Jersey, USA, March 8-11.Google Scholar
  11. Hirst, G., & St Onge, D. (1998). Lexical Chains as representation of context for the detection and correction malapropisms: The MIT Press.Google Scholar
  12. Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions Knowledge Discovery Data, 2(2), 10:1-10:25.Google Scholar
  13. Jurafsky, D., & Martin, J.H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edn. Upper Saddle River: Prentice Hall PTR.Google Scholar
  14. Landauer, T.K., Foltz, P.W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.CrossRefGoogle Scholar
  15. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Kleef, P.V., Auer, S., & Bizer, C. (2015). Dbpedia- A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195.Google Scholar
  16. Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.CrossRefGoogle Scholar
  17. Oliva, J., Serrano, J.I., Dolores del Castillo, M., & Iglesias, A. (2011). Symss: A syntax-based measure for short-text semantic similarity. Data Knowledge Engineering, 70(4), 390–405.CrossRefGoogle Scholar
  18. O’Shea, J., Bandar, Z., Crockett, K.A., & McLean, D. (2008). A comparative study of two short text semantic similarity measures. In Proceedings onAgent and multi-agent systems: Technologies and applications, second KES international symposium, KES-AMSTA 2008, incheon, korea, march 26-28, 2008 (pp. 172–181).Google Scholar
  19. O’shea, J., Bandar, Z., & Crockett, K. (2014). A new benchmark dataset with production methodology for short text semantic similarity algorithms. ACM Transactions Speech Language Processing, 10(4), 19:1–19:63.Google Scholar
  20. Rakesh, P., Shivapratap, G., Divya, G., & Soman, K.P. (2009). Evaluation of svd and nmf methods for latent semantic analysis. International Journal of Recent Trends in Engineering, 1(3).Google Scholar
  21. Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.CrossRefzbMATHGoogle Scholar
  22. Salton, G., & McGill, M. (1984). Introduction to Modern Information Retrieval: McGraw-Hill Book Company.Google Scholar
  23. Spaeth, A., & Desmarais, M.C. (2013). Combining collaborative filtering and text similarity for expert profile recommendations in social websites. In Proceedings on User modeling, adaptation, and personalization - 21th international conference, UMAP 2013, rome, Italy, June 10-14, 2013 (pp. 178–189).Google Scholar
  24. Tsatsaronis, G., Varlamis, I., & Vazirgiannis, Michalis (2010). Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research, 37(1), 1–40.zbMATHGoogle Scholar
  25. Winkler, W.E. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau.Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Samir Amir
    • 1
  • Adrian Tanasescu
    • 1
  • Djamel A. Zighed
    • 1
  1. 1.Institut des Sciences de l’HommeLyonFrance

Personalised recommendations