Skip to main content
Log in

Short text similarity measurement methods: a review

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Short text similarity measurement methods play an important role in many applications within natural language processing. This paper reviews the research literature on short text similarity (STS) measurement method with the aim to (i) classify and give a broad overview of existing techniques; (ii) find out its strengths and weaknesses in terms of the domain the independence, language independence, requirement of semantic knowledge, corpus and training data, ability to identify semantic meaning, word order similarity and polysemy; and (iii) identify semantic knowledge and corpus resource that can be utilized for the STS measurement methods. Furthermore, our study also considers various issues such as the difference between the various text similarity methods and the difference between semantic knowledge sources and corpora for text similarity. Although there are a few review papers in this area, they focus mostly only on one/two existing techniques. Furthermore, existing review papers do not cover recent research. To the best of our knowledge, this is a comprehensive systematic literature review on this topic. The findings of this research can be as follows: It identified four semantic knowledge and eight corpus resources as external resources that can be classified into general-purpose and domain-specific. Furthermore, the existing techniques can be classified into string-based, corpus-based, knowledge-based and hybrid-based. Moreover, expert researchers can utilize this review as a benchmark as well as reference to the limitations of current techniques. The paper also identifies the open issues that can be considered as feasible opportunities for future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://www.ebizmba.com/articles/social-networking-websites.

References

  • Abdi A, Idris N, Alguliyev RM, Aliguliyev RM (2017) Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Comput 21:1785–1801

    Article  Google Scholar 

  • Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin

    Book  Google Scholar 

  • Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36

    Article  Google Scholar 

  • Abualigah LM, Khader AT, Hanandeh ES (2018a) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48:4047–4071

    Article  Google Scholar 

  • Abualigah LM, Khader AT, Hanandeh ES (2018b) A novel weighting scheme applied to improve the text document clustering techniques. In: Innovative computing, optimization and its applications. Springer, Berlin, pp 305–320

  • Agirre E, Diab M, Cer D, Gonzalez-Agirre (2012) A Semeval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of the first joint conference on lexical and computational semantics-volume 1: proceedings of the main conference and the shared task, and volume 2: proceedings of the sixth international workshop on semantic evaluation. Association for Computational Linguistics, pp 385–393

  • Alguliyev RM, Aliguliyev RM, Isazade NR, Abdi A, Idris N (2017) A model for text summarization. Int J Intell Inf Technol (IJIIT) 13:67–85

    Article  Google Scholar 

  • Altszyler E, Sigman M, Slezak DF (2016) Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database CoRR abs/1610.01520

  • Aouicha MB, Taieb MAH, Hamadou AB (2018) SISR: system for integrating semantic relatedness and similarity measures. Soft Comput 22:1855–1879

    Article  Google Scholar 

  • Banea C, Hassan S, Mohler M, Mihalcea R (2012) UNT: a supervised synergistic approach to semantic text similarity. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 635–642

  • Bär D, Biemann C, Gurevych I, Zesch T (2012) UKP: computing semantic textual similarity by combining multiple content similarity measures. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 435–440

  • Batet M, Sánchez D, Valls A (2011) An ontology-based measure to compute semantic similarity in biomedicine. J Biomed Inform 44:118–125. https://doi.org/10.1016/j.jbi.2010.09.002

    Article  Google Scholar 

  • Ben Aouicha M, Hadj Taieb MA (2016) Computing semantic similarity between biomedical concepts using new information content approach. J Biomed Inform 59:258–275. https://doi.org/10.1016/j.jbi.2015.12.007

    Article  Google Scholar 

  • Benedetti F, Beneventano D, Bergamaschi S, Simonini G (2019) Computing inter-document similarity with Context Semantic Analysis. Inf Syst 80:136–147. https://doi.org/10.1016/j.is.2018.02.009

    Article  Google Scholar 

  • Blei DM (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32:267D–270D. https://doi.org/10.1093/nar/gkh061

    Article  Google Scholar 

  • Budgen D, Brereton P (2006) Performing systematic literature reviews in software engineering. In: ICSE ‘06. ACM, New York, NY, USA, pp 1051–1052. https://doi.org/10.1145/1134285.1134500

  • Burgun A, Bodenreider O (2001) Comparing terms, concepts and semantic classes in WordNet and the Unified Medical Language System, pp 77–82

  • Burnard L (2007) Reference Guide for the British National Corpus (XML Edition)

  • Castillo JJ, Cardenas ME (2010) Using sentence semantic similarity based on WordNet in recognizing textual entailment. In: Kuri-Morales A, Simari GR (eds) Advances in artificial intelligence – IBERAMIA 2010, vol 6433. Lecture notes in computer science. Springer, Berlin, pp 366–375. https://doi.org/10.1007/978-3-642-16952-6_37

    Chapter  Google Scholar 

  • Chen M, Xu ZE, Weinberger KQ, Sha F (2012) Marginalized Denoising Autoencoders for Domain Adaptation CoRR abs/1206.4683

  • Chen Q, Kim S, Wilbur WJ, Lu Z (2018) Sentence Similarity Measures Revisited: Ranking Sentences in PubMed Documents. In: BCB ‘18. ACM, New York, NY, USA, pp 531–532. https://doi.org/10.1145/3233547.3233640

  • Cordeiro M, Sarmento RP, Brazdil P, Gama J (2018) Evolving networks and social network analysis methods and techniques. Soc Media J Trends Connect Implic 101:8

    Google Scholar 

  • Croft D, Coupland S, Shell J, Brown S (2013) A fast and efficient semantic short text similarity metric. In: 2013 13th UK workshop on computational intelligence (UKCI), pp 221–227. https://doi.org/10.1109/ukci.2013.6651309

  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407

    Article  Google Scholar 

  • Díaz I, Ralescu A (2012) Privacy issues in social networks: a brief survey. In: International conference on information processing and management of uncertainty in knowledge-based systems. Springer, Berlin, pp 509–518

  • Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302. https://doi.org/10.2307/1932409

    Article  Google Scholar 

  • Elavarasi S, Akilandeswari J, Menaga K (2014) A survey on semantic similarity measure. Int J Res Advent Technol 2:389–398

    Google Scholar 

  • Elhadi MT (2012) Text similarity calculations using text and syntactical structures. In: 2012 7th international conference on computing and convergence technology (ICCCT), December 2012, pp 715–719

  • Ferreira JD, Couto FM (2010) Semantic similarity for automatic classification of chemical compounds. PLoS Comput Biol. https://doi.org/10.1371/journal.pcbi.1000937

    Article  Google Scholar 

  • Francis WN, Kucera H (1979) The brown corpus: a standard corpus of present-day edited American English

  • Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI’07. Morgan Kaufmann Publishers Inc., San Francisco, pp 1606–1611

  • Garla VN, Brandt C (2012) Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification. In: 2012 IEEE second international conference on healthcare informatics, imaging and systems biology, pp 22–22. https://doi.org/10.1109/hisb.2012.12

  • Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68:13–18

    Google Scholar 

  • Goth G (2016) Deep or shallow, NLP is breaking out. Commun ACM 59:13–16. https://doi.org/10.1145/2874915

    Article  Google Scholar 

  • Goyal N, Singh J (2016) A review on resemblance of user profiles in social networks using similarity measures. Int J Comput (IJC) 22:1–8

    Google Scholar 

  • Islam M, Inkpen D (2006) Second order co-occurrence PMI for determining the semantic similarity of words. In: European Language Resources Association (ELRA), Genoa, Italy

  • Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data 2:10–25. https://doi.org/10.1145/1376815.1376819

    Article  Google Scholar 

  • Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Société vaudoise des sciences naturelles 37:547–579

    Google Scholar 

  • Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy

  • Kaundal A, Kaur A (2017) A review on WordNet and Vector space analysis for short-text semantic similarity. Int J Innov Eng Technol. https://doi.org/10.21172/ijiet.81.018

    Article  Google Scholar 

  • Kitchenham B (2004) Procedures for performing systematic reviews. Keele 33:1–26

    Google Scholar 

  • Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97. https://doi.org/10.1002/nav.3800020109

    Article  MathSciNet  MATH  Google Scholar 

  • Kusner MJ, Sun Y, Kolkin NI, Weinberger KQ (2015) From word embeddings to document distances. In: ICML’15. JMLR.org, Lille, France, pp 957–966

  • Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database 49:265–283

    Google Scholar 

  • Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18:1138–1150. https://doi.org/10.1109/TKDE.2006.130

    Article  Google Scholar 

  • Li Y, Li H, Cai Q, Han D (2012) A novel semantic similarity measure within sentences. In: Proceedings of 2012 2nd international conference on computer science and network technology, pp 1176-1179. https://doi.org/10.1109/iccsnt.2012.6526134

  • Lin D (1998) An information-theoretic definition of similarity. In: Citeseer, pp 296–304

  • Lin C-Y, Hovy E (2003) Automatic evaluation of summaries using N-gram Co-occurrence statistics. In: NAACL ‘03. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 71–78. https://doi.org/10.3115/1073445.1073465

  • Liu H, Wang P (2013) Assessing sentence similarity using WordNet based word similarity. JSW 8:1451–1458

    Google Scholar 

  • Mabotuwana T, Lee MC, Cohen-Solal EV (2013) An ontology-based similarity measure for biomedical data—application to radiology reports. J Biomed Inform 46:857–868. https://doi.org/10.1016/j.jbi.2013.06.013

    Article  Google Scholar 

  • Majumder G, Pakray P, Gelbukh A, Pinto D (2016) Semantic textual similarity methods, tools, and applications: a survey. Computacion y Sistemas 20:647–665. https://doi.org/10.13053/cys-20-4-2506

    Article  Google Scholar 

  • Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI’06. AAAI Press, Boston, Massachusetts, pp 775–780

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26. Curran Associates, Inc, Red Hook, pp 3111–3119

    Google Scholar 

  • Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38:39–41. https://doi.org/10.1145/219717.219748

    Article  Google Scholar 

  • O’Shea J, Bandar Z, Crockett K, McLean D (2008) A comparative study of two short text semantic similarity measures. In: Nguyen NT, Jo GS, Howlett RJ, Jain LC (eds) Agent and multi-agent systems: technologies and applications, KES-AMSTA 2008, vol 4953. Lecture notes in computer science. Springer, Berlin, pp 172–181. https://doi.org/10.1007/978-3-540-78582-8_18

    Chapter  Google Scholar 

  • Pawar A, Mago V (2018) Calculating the similarity between words and sentences using a lexical database and corpus statistics CoRR abs/1802.05667

  • Perina A, Jojic N, Bicego M, Truski A (2013) Documents as multiple overlapping windows into grids of counts. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26. Curran Associates, inc, Red Hook, pp 10–18

    Google Scholar 

  • Petersen K, Gencel C (2013) Worldviews, research methods, and their relationship to validity in empirical software engineering research. In: 2013 joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement, pp 81–89. https://doi.org/10.1109/iwsm-mensura.2013.22

  • Pirró G, Seco N (2008) Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. In: Meersman R, Tari Z (eds) On the move to meaningful internet systems: OTM 2008, vol 5332. Lecture notes in computer science. Springer, Berlin, pp 1271–1288. https://doi.org/10.1007/978-3-540-88873-4_25

    Chapter  Google Scholar 

  • Pupazan E, Bhulai S (2011) Social networking analytics BMI Paper, VU University Amsterdam, Amsterdam

  • Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybernet 19:17–30. https://doi.org/10.1109/21.24528

    Article  Google Scholar 

  • Rawashdeh A, Ralescu AL (2015) Similarity measure for social networks-a brief survey. In: Maics, pp 153–159

  • Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. Morgan Kaufmann Publishers Inc., Burlington, pp 448–453

    Google Scholar 

  • Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1995) Okapi at TREC-3, pp 109–126

  • Rus V, Niraula N, Banjade R (2013) Similarity measures based on latent Dirichlet allocation. In: Gelbukh A (ed) Computational linguistics and intelligent text processing, CICLing 2013, vol 7816. Lecture notes in computer science. Springer, Berlin, pp 459–470. https://doi.org/10.1007/978-3-642-37247-6_37

    Chapter  Google Scholar 

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24:513–523. https://doi.org/10.1016/0306-4573(88)90021-0

    Article  Google Scholar 

  • Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18:613–620. https://doi.org/10.1145/361219.361220

    Article  MATH  Google Scholar 

  • Sánchez D, Batet M (2013) A semantic similarity method based on information content exploiting multiple ontologies. Expert Syst Appl 40:1393–1399. https://doi.org/10.1016/j.eswa.2012.08.049

    Article  Google Scholar 

  • Šarić F, Glavaš G, Karan M, Šnajder J, Bašić BD (2012) TakeLab: systems for measuring semantic text similarity. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 441–448

  • Shrestha P (2011) Corpus-based methods for short text similarity. Montpellier, France, p 297

  • Soğancıoğlu G, Öztürk H, Özgür A (2017) BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33:i49–i58. https://doi.org/10.1093/bioinformatics/btx238

    Article  Google Scholar 

  • Stapić Z, López EG, Cabot AG, de Marcos Ortega L, Strahonja V (2012) Performing systematic literature review in software engineering. In: CECIIS 2012-23rd international conference

  • Sugathadasa K, Ayesha B, Silva Nd, Perera AS, Jayawardana V, Lakmal D, Perera M (2017) Synergistic union of Word2Vec and lexicon for domain specific semantic similarity. In: 2017 IEEE international conference on industrial and information systems (ICIIS), pp 1–6. https://doi.org/10.1109/iciinfs.2017.8300343

  • Sultana S, Biskri I (2018) Identifying Similar sentences by using n-grams of characters. In: Mouhoub M, Sadaoui S, Ait Mohamed O, Ali M (eds) International conference on industrial, engineering and other applications of applied intelligent systems. Springer International Publishing, Cham, pp 833–843

  • Tabassum S, Pereira FS, Fernandes S, Gama J (2018) Social network analysis: an overview wiley interdisciplinary reviews. Data Min Knowl Disc 8:e1256

    Google Scholar 

  • Takale SA, Nandgaonkar SS (2010) Measuring semantic similarity between words using web documents. Int J Adv Comput Sci Appl (IJACSA) 1:10

    Google Scholar 

  • Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14:249–260. https://doi.org/10.1007/BF01206331

    Article  MathSciNet  MATH  Google Scholar 

  • Vu HH, Villaneau J, Saïd F, Marteau P-F (2014) Sentence similarity by combining explicit semantic analysis and overlapping n-grams. In: Sojka P, Horák A, Kopeček I, Pala K (eds) Text, speech and dialogue, TSD 2014, vol 8655. Lecture notes in computer science. Springer International Publishing, Cham, pp 201–208. https://doi.org/10.1007/978-3-319-10816-2_25

    Chapter  Google Scholar 

  • Wang JZ, Taylor W (2007) Concept forest: a new ontology-assisted text document similarity measurement method. In: IEEE/WIC/ACM international conference on web intelligence (WI’07), pp 395–401. https://doi.org/10.1109/wi.2007.11

  • Wang C, Long L, Li L (2008) HowNet based evaluation for Chinese text summarization. In: 2008 international conference on natural language processing and knowledge engineering, October 2008, pp 1–6. https://doi.org/10.1109/nlpke.2008.4906789

  • Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: EASE ‘14. ACM, New York, NY, USA, pp 38:31–38:10. https://doi.org/10.1145/2601248.2601268

  • Wu Z, Palmer M (1994) Verbs semantics and lexical selection. Association for Computational Linguistics, Stroudsburg, pp 133–138

    Google Scholar 

  • Zhao C, Yao X, Sun S (2009) A HowNet-based feature selection method for Chinese text representation. In: Sixth international conference on fuzzy systems and knowledge discovery, pp 26–30. https://doi.org/10.1109/fskd.2009.280

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Asad Abdi.

Ethics declarations

Conflict of interest

I hereby and on behalf of the co-authors declare all the authors agreed to submit the article exclusively to this journal and also declare that there is no conflict of interests regarding the publication of this article.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prakoso, D.W., Abdi, A. & Amrit, C. Short text similarity measurement methods: a review. Soft Comput 25, 4699–4723 (2021). https://doi.org/10.1007/s00500-020-05479-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-020-05479-2

Keywords

Navigation