Abstract
We propose a new approach to compute semantic similarity between sentences. It is based on the semantic kernel, composed of subject, verb, and object that, we suppose, summarize the general meaning of each sentence. Thanks to linguistics resources available such as Stanford Parser, many features are then extracted from the semantic kernels and aggregated by mean of weights. The weighting is produced by a supervised machine learning technique on a training data set provided by human experts as ground truth. The cross validation shows good performances. Thanks to this similarity measure between sentences, one can build an intelligent text retrieval engine more sensitive to the semantic content, specifically suited for short texts than the classical methods based on bag of words. An application is being developed for highlighting parts of speech in scientific articles.
Similar content being viewed by others
Notes
References
Breaux, H.J. (1968). A modification of efroymson’s technique for stepwise regression analysis. Communications of the ACM, 11(8), 556–558.
Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computer Linguistic, 32(1), 13–47.
Che, L.M., Wei, C.J., Cheng, H.T., Hui, C.H., & Chen, C.H. (2012). A sentence similarity metric based on semantic patterns. Advances in Information Sciences and Service Sciences, 4(1), 576–585.
Croft, D., Coupland, S., Shell, J., & Brown, S. (2013). A fast and efficient semantic short text similarity metric.
De Boni, M., & Manandhar, S. (2003). The use of sentence similarity as a semantic relevance metric for question answering. In New directions in question answering, papers from 2003 AAAI spring symposium (pp. 138–144). Stanford: Stanford University.
de Marneffe, M.-C., & Manning, C.D. (2008). The stanford typed dependencies representation. In Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, CrossParser ’08 (pp. 1–8). Stroudsburg: Association for Computational Linguistics.
Hardin, J.W., & Hilbe, J. (2001). Generalized linear models and extensions. College station: Stata Press.
Hatzlvassiloglou, V., Klavans, J.L., & Eskin, E. (1999). Detecting text similarity over short passages:Exploring linguistic feature combinations via machine learning. In 1999 Joint SIGDAT conference on empirical methods in natural language processing and very large corpora (pp. 203–212).
Heidinger, V. (1984). Analyzing Syntax and Semantics: Workbook: Gallaudet university press.
Hirst, G., & St-Onge, D. (1994). WORDNET: A Lexical database for English. In Human language technology, proceedings of a workshop held at plainsboro, New Jersey, USA, March 8-11.
Hirst, G., & St Onge, D. (1998). Lexical Chains as representation of context for the detection and correction malapropisms: The MIT Press.
Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions Knowledge Discovery Data, 2(2), 10:1-10:25.
Jurafsky, D., & Martin, J.H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edn. Upper Saddle River: Prentice Hall PTR.
Landauer, T.K., Foltz, P.W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Kleef, P.V., Auer, S., & Bizer, C. (2015). Dbpedia- A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195.
Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.
Oliva, J., Serrano, J.I., Dolores del Castillo, M., & Iglesias, A. (2011). Symss: A syntax-based measure for short-text semantic similarity. Data Knowledge Engineering, 70(4), 390–405.
O’Shea, J., Bandar, Z., Crockett, K.A., & McLean, D. (2008). A comparative study of two short text semantic similarity measures. In Proceedings onAgent and multi-agent systems: Technologies and applications, second KES international symposium, KES-AMSTA 2008, incheon, korea, march 26-28, 2008 (pp. 172–181).
O’shea, J., Bandar, Z., & Crockett, K. (2014). A new benchmark dataset with production methodology for short text semantic similarity algorithms. ACM Transactions Speech Language Processing, 10(4), 19:1–19:63.
Rakesh, P., Shivapratap, G., Divya, G., & Soman, K.P. (2009). Evaluation of svd and nmf methods for latent semantic analysis. International Journal of Recent Trends in Engineering, 1(3).
Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.
Salton, G., & McGill, M. (1984). Introduction to Modern Information Retrieval: McGraw-Hill Book Company.
Spaeth, A., & Desmarais, M.C. (2013). Combining collaborative filtering and text similarity for expert profile recommendations in social websites. In Proceedings on User modeling, adaptation, and personalization - 21th international conference, UMAP 2013, rome, Italy, June 10-14, 2013 (pp. 178–189).
Tsatsaronis, G., Varlamis, I., & Vazirgiannis, Michalis (2010). Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research, 37(1), 1–40.
Winkler, W.E. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau.
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Amir, S., Tanasescu, A. & Zighed, D.A. Sentence similarity based on semantic kernels for intelligent text retrieval. J Intell Inf Syst 48, 675–689 (2017). https://doi.org/10.1007/s10844-016-0434-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-016-0434-3