Abstract
Short text similarity measurement methods play an important role in many applications within natural language processing. This paper reviews the research literature on short text similarity (STS) measurement method with the aim to (i) classify and give a broad overview of existing techniques; (ii) find out its strengths and weaknesses in terms of the domain the independence, language independence, requirement of semantic knowledge, corpus and training data, ability to identify semantic meaning, word order similarity and polysemy; and (iii) identify semantic knowledge and corpus resource that can be utilized for the STS measurement methods. Furthermore, our study also considers various issues such as the difference between the various text similarity methods and the difference between semantic knowledge sources and corpora for text similarity. Although there are a few review papers in this area, they focus mostly only on one/two existing techniques. Furthermore, existing review papers do not cover recent research. To the best of our knowledge, this is a comprehensive systematic literature review on this topic. The findings of this research can be as follows: It identified four semantic knowledge and eight corpus resources as external resources that can be classified into general-purpose and domain-specific. Furthermore, the existing techniques can be classified into string-based, corpus-based, knowledge-based and hybrid-based. Moreover, expert researchers can utilize this review as a benchmark as well as reference to the limitations of current techniques. The paper also identifies the open issues that can be considered as feasible opportunities for future research directions.
Similar content being viewed by others
References
Abdi A, Idris N, Alguliyev RM, Aliguliyev RM (2017) Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Comput 21:1785–1801
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36
Abualigah LM, Khader AT, Hanandeh ES (2018a) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48:4047–4071
Abualigah LM, Khader AT, Hanandeh ES (2018b) A novel weighting scheme applied to improve the text document clustering techniques. In: Innovative computing, optimization and its applications. Springer, Berlin, pp 305–320
Agirre E, Diab M, Cer D, Gonzalez-Agirre (2012) A Semeval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of the first joint conference on lexical and computational semantics-volume 1: proceedings of the main conference and the shared task, and volume 2: proceedings of the sixth international workshop on semantic evaluation. Association for Computational Linguistics, pp 385–393
Alguliyev RM, Aliguliyev RM, Isazade NR, Abdi A, Idris N (2017) A model for text summarization. Int J Intell Inf Technol (IJIIT) 13:67–85
Altszyler E, Sigman M, Slezak DF (2016) Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database CoRR abs/1610.01520
Aouicha MB, Taieb MAH, Hamadou AB (2018) SISR: system for integrating semantic relatedness and similarity measures. Soft Comput 22:1855–1879
Banea C, Hassan S, Mohler M, Mihalcea R (2012) UNT: a supervised synergistic approach to semantic text similarity. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 635–642
Bär D, Biemann C, Gurevych I, Zesch T (2012) UKP: computing semantic textual similarity by combining multiple content similarity measures. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 435–440
Batet M, Sánchez D, Valls A (2011) An ontology-based measure to compute semantic similarity in biomedicine. J Biomed Inform 44:118–125. https://doi.org/10.1016/j.jbi.2010.09.002
Ben Aouicha M, Hadj Taieb MA (2016) Computing semantic similarity between biomedical concepts using new information content approach. J Biomed Inform 59:258–275. https://doi.org/10.1016/j.jbi.2015.12.007
Benedetti F, Beneventano D, Bergamaschi S, Simonini G (2019) Computing inter-document similarity with Context Semantic Analysis. Inf Syst 80:136–147. https://doi.org/10.1016/j.is.2018.02.009
Blei DM (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32:267D–270D. https://doi.org/10.1093/nar/gkh061
Budgen D, Brereton P (2006) Performing systematic literature reviews in software engineering. In: ICSE ‘06. ACM, New York, NY, USA, pp 1051–1052. https://doi.org/10.1145/1134285.1134500
Burgun A, Bodenreider O (2001) Comparing terms, concepts and semantic classes in WordNet and the Unified Medical Language System, pp 77–82
Burnard L (2007) Reference Guide for the British National Corpus (XML Edition)
Castillo JJ, Cardenas ME (2010) Using sentence semantic similarity based on WordNet in recognizing textual entailment. In: Kuri-Morales A, Simari GR (eds) Advances in artificial intelligence – IBERAMIA 2010, vol 6433. Lecture notes in computer science. Springer, Berlin, pp 366–375. https://doi.org/10.1007/978-3-642-16952-6_37
Chen M, Xu ZE, Weinberger KQ, Sha F (2012) Marginalized Denoising Autoencoders for Domain Adaptation CoRR abs/1206.4683
Chen Q, Kim S, Wilbur WJ, Lu Z (2018) Sentence Similarity Measures Revisited: Ranking Sentences in PubMed Documents. In: BCB ‘18. ACM, New York, NY, USA, pp 531–532. https://doi.org/10.1145/3233547.3233640
Cordeiro M, Sarmento RP, Brazdil P, Gama J (2018) Evolving networks and social network analysis methods and techniques. Soc Media J Trends Connect Implic 101:8
Croft D, Coupland S, Shell J, Brown S (2013) A fast and efficient semantic short text similarity metric. In: 2013 13th UK workshop on computational intelligence (UKCI), pp 221–227. https://doi.org/10.1109/ukci.2013.6651309
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407
Díaz I, Ralescu A (2012) Privacy issues in social networks: a brief survey. In: International conference on information processing and management of uncertainty in knowledge-based systems. Springer, Berlin, pp 509–518
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302. https://doi.org/10.2307/1932409
Elavarasi S, Akilandeswari J, Menaga K (2014) A survey on semantic similarity measure. Int J Res Advent Technol 2:389–398
Elhadi MT (2012) Text similarity calculations using text and syntactical structures. In: 2012 7th international conference on computing and convergence technology (ICCCT), December 2012, pp 715–719
Ferreira JD, Couto FM (2010) Semantic similarity for automatic classification of chemical compounds. PLoS Comput Biol. https://doi.org/10.1371/journal.pcbi.1000937
Francis WN, Kucera H (1979) The brown corpus: a standard corpus of present-day edited American English
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI’07. Morgan Kaufmann Publishers Inc., San Francisco, pp 1606–1611
Garla VN, Brandt C (2012) Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification. In: 2012 IEEE second international conference on healthcare informatics, imaging and systems biology, pp 22–22. https://doi.org/10.1109/hisb.2012.12
Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68:13–18
Goth G (2016) Deep or shallow, NLP is breaking out. Commun ACM 59:13–16. https://doi.org/10.1145/2874915
Goyal N, Singh J (2016) A review on resemblance of user profiles in social networks using similarity measures. Int J Comput (IJC) 22:1–8
Islam M, Inkpen D (2006) Second order co-occurrence PMI for determining the semantic similarity of words. In: European Language Resources Association (ELRA), Genoa, Italy
Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data 2:10–25. https://doi.org/10.1145/1376815.1376819
Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Société vaudoise des sciences naturelles 37:547–579
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy
Kaundal A, Kaur A (2017) A review on WordNet and Vector space analysis for short-text semantic similarity. Int J Innov Eng Technol. https://doi.org/10.21172/ijiet.81.018
Kitchenham B (2004) Procedures for performing systematic reviews. Keele 33:1–26
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97. https://doi.org/10.1002/nav.3800020109
Kusner MJ, Sun Y, Kolkin NI, Weinberger KQ (2015) From word embeddings to document distances. In: ICML’15. JMLR.org, Lille, France, pp 957–966
Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database 49:265–283
Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18:1138–1150. https://doi.org/10.1109/TKDE.2006.130
Li Y, Li H, Cai Q, Han D (2012) A novel semantic similarity measure within sentences. In: Proceedings of 2012 2nd international conference on computer science and network technology, pp 1176-1179. https://doi.org/10.1109/iccsnt.2012.6526134
Lin D (1998) An information-theoretic definition of similarity. In: Citeseer, pp 296–304
Lin C-Y, Hovy E (2003) Automatic evaluation of summaries using N-gram Co-occurrence statistics. In: NAACL ‘03. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 71–78. https://doi.org/10.3115/1073445.1073465
Liu H, Wang P (2013) Assessing sentence similarity using WordNet based word similarity. JSW 8:1451–1458
Mabotuwana T, Lee MC, Cohen-Solal EV (2013) An ontology-based similarity measure for biomedical data—application to radiology reports. J Biomed Inform 46:857–868. https://doi.org/10.1016/j.jbi.2013.06.013
Majumder G, Pakray P, Gelbukh A, Pinto D (2016) Semantic textual similarity methods, tools, and applications: a survey. Computacion y Sistemas 20:647–665. https://doi.org/10.13053/cys-20-4-2506
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI’06. AAAI Press, Boston, Massachusetts, pp 775–780
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26. Curran Associates, Inc, Red Hook, pp 3111–3119
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38:39–41. https://doi.org/10.1145/219717.219748
O’Shea J, Bandar Z, Crockett K, McLean D (2008) A comparative study of two short text semantic similarity measures. In: Nguyen NT, Jo GS, Howlett RJ, Jain LC (eds) Agent and multi-agent systems: technologies and applications, KES-AMSTA 2008, vol 4953. Lecture notes in computer science. Springer, Berlin, pp 172–181. https://doi.org/10.1007/978-3-540-78582-8_18
Pawar A, Mago V (2018) Calculating the similarity between words and sentences using a lexical database and corpus statistics CoRR abs/1802.05667
Perina A, Jojic N, Bicego M, Truski A (2013) Documents as multiple overlapping windows into grids of counts. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26. Curran Associates, inc, Red Hook, pp 10–18
Petersen K, Gencel C (2013) Worldviews, research methods, and their relationship to validity in empirical software engineering research. In: 2013 joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement, pp 81–89. https://doi.org/10.1109/iwsm-mensura.2013.22
Pirró G, Seco N (2008) Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. In: Meersman R, Tari Z (eds) On the move to meaningful internet systems: OTM 2008, vol 5332. Lecture notes in computer science. Springer, Berlin, pp 1271–1288. https://doi.org/10.1007/978-3-540-88873-4_25
Pupazan E, Bhulai S (2011) Social networking analytics BMI Paper, VU University Amsterdam, Amsterdam
Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybernet 19:17–30. https://doi.org/10.1109/21.24528
Rawashdeh A, Ralescu AL (2015) Similarity measure for social networks-a brief survey. In: Maics, pp 153–159
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. Morgan Kaufmann Publishers Inc., Burlington, pp 448–453
Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1995) Okapi at TREC-3, pp 109–126
Rus V, Niraula N, Banjade R (2013) Similarity measures based on latent Dirichlet allocation. In: Gelbukh A (ed) Computational linguistics and intelligent text processing, CICLing 2013, vol 7816. Lecture notes in computer science. Springer, Berlin, pp 459–470. https://doi.org/10.1007/978-3-642-37247-6_37
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24:513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18:613–620. https://doi.org/10.1145/361219.361220
Sánchez D, Batet M (2013) A semantic similarity method based on information content exploiting multiple ontologies. Expert Syst Appl 40:1393–1399. https://doi.org/10.1016/j.eswa.2012.08.049
Šarić F, Glavaš G, Karan M, Šnajder J, Bašić BD (2012) TakeLab: systems for measuring semantic text similarity. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 441–448
Shrestha P (2011) Corpus-based methods for short text similarity. Montpellier, France, p 297
Soğancıoğlu G, Öztürk H, Özgür A (2017) BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33:i49–i58. https://doi.org/10.1093/bioinformatics/btx238
Stapić Z, López EG, Cabot AG, de Marcos Ortega L, Strahonja V (2012) Performing systematic literature review in software engineering. In: CECIIS 2012-23rd international conference
Sugathadasa K, Ayesha B, Silva Nd, Perera AS, Jayawardana V, Lakmal D, Perera M (2017) Synergistic union of Word2Vec and lexicon for domain specific semantic similarity. In: 2017 IEEE international conference on industrial and information systems (ICIIS), pp 1–6. https://doi.org/10.1109/iciinfs.2017.8300343
Sultana S, Biskri I (2018) Identifying Similar sentences by using n-grams of characters. In: Mouhoub M, Sadaoui S, Ait Mohamed O, Ali M (eds) International conference on industrial, engineering and other applications of applied intelligent systems. Springer International Publishing, Cham, pp 833–843
Tabassum S, Pereira FS, Fernandes S, Gama J (2018) Social network analysis: an overview wiley interdisciplinary reviews. Data Min Knowl Disc 8:e1256
Takale SA, Nandgaonkar SS (2010) Measuring semantic similarity between words using web documents. Int J Adv Comput Sci Appl (IJACSA) 1:10
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14:249–260. https://doi.org/10.1007/BF01206331
Vu HH, Villaneau J, Saïd F, Marteau P-F (2014) Sentence similarity by combining explicit semantic analysis and overlapping n-grams. In: Sojka P, Horák A, Kopeček I, Pala K (eds) Text, speech and dialogue, TSD 2014, vol 8655. Lecture notes in computer science. Springer International Publishing, Cham, pp 201–208. https://doi.org/10.1007/978-3-319-10816-2_25
Wang JZ, Taylor W (2007) Concept forest: a new ontology-assisted text document similarity measurement method. In: IEEE/WIC/ACM international conference on web intelligence (WI’07), pp 395–401. https://doi.org/10.1109/wi.2007.11
Wang C, Long L, Li L (2008) HowNet based evaluation for Chinese text summarization. In: 2008 international conference on natural language processing and knowledge engineering, October 2008, pp 1–6. https://doi.org/10.1109/nlpke.2008.4906789
Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: EASE ‘14. ACM, New York, NY, USA, pp 38:31–38:10. https://doi.org/10.1145/2601248.2601268
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. Association for Computational Linguistics, Stroudsburg, pp 133–138
Zhao C, Yao X, Sun S (2009) A HowNet-based feature selection method for Chinese text representation. In: Sixth international conference on fuzzy systems and knowledge discovery, pp 26–30. https://doi.org/10.1109/fskd.2009.280
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
I hereby and on behalf of the co-authors declare all the authors agreed to submit the article exclusively to this journal and also declare that there is no conflict of interests regarding the publication of this article.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Prakoso, D.W., Abdi, A. & Amrit, C. Short text similarity measurement methods: a review. Soft Comput 25, 4699–4723 (2021). https://doi.org/10.1007/s00500-020-05479-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-05479-2