Short text similarity measurement methods: a review

Prakoso, Dimas Wibisono; Abdi, Asad; Amrit, Chintan

doi:10.1007/s00500-020-05479-2

Short text similarity measurement methods: a review

Methodologies and Application
Published: 03 January 2021

Volume 25, pages 4699–4723, (2021)
Cite this article

Soft Computing Aims and scope Submit manuscript

3115 Accesses
20 Citations
Explore all metrics

Abstract

Short text similarity measurement methods play an important role in many applications within natural language processing. This paper reviews the research literature on short text similarity (STS) measurement method with the aim to (i) classify and give a broad overview of existing techniques; (ii) find out its strengths and weaknesses in terms of the domain the independence, language independence, requirement of semantic knowledge, corpus and training data, ability to identify semantic meaning, word order similarity and polysemy; and (iii) identify semantic knowledge and corpus resource that can be utilized for the STS measurement methods. Furthermore, our study also considers various issues such as the difference between the various text similarity methods and the difference between semantic knowledge sources and corpora for text similarity. Although there are a few review papers in this area, they focus mostly only on one/two existing techniques. Furthermore, existing review papers do not cover recent research. To the best of our knowledge, this is a comprehensive systematic literature review on this topic. The findings of this research can be as follows: It identified four semantic knowledge and eight corpus resources as external resources that can be classified into general-purpose and domain-specific. Furthermore, the existing techniques can be classified into string-based, corpus-based, knowledge-based and hybrid-based. Moreover, expert researchers can utilize this review as a benchmark as well as reference to the limitations of current techniques. The paper also identifies the open issues that can be considered as feasible opportunities for future research directions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic Textual Similarity Using Various Approaches

Short Text Computing Based on Lexical Similarity Model

Calculation of Textual Similarity Using Semantic Relatedness Functions

Notes

http://www.ebizmba.com/articles/social-networking-websites.

References

Abdi A, Idris N, Alguliyev RM, Aliguliyev RM (2017) Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Comput 21:1785–1801
Article Google Scholar
Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin
Book Google Scholar
Abualigah LM, Khader AT, Al-Betar MA, Alomari OA (2017) Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering. Expert Syst Appl 84:24–36
Article Google Scholar
Abualigah LM, Khader AT, Hanandeh ES (2018a) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48:4047–4071
Article Google Scholar
Abualigah LM, Khader AT, Hanandeh ES (2018b) A novel weighting scheme applied to improve the text document clustering techniques. In: Innovative computing, optimization and its applications. Springer, Berlin, pp 305–320
Agirre E, Diab M, Cer D, Gonzalez-Agirre (2012) A Semeval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of the first joint conference on lexical and computational semantics-volume 1: proceedings of the main conference and the shared task, and volume 2: proceedings of the sixth international workshop on semantic evaluation. Association for Computational Linguistics, pp 385–393
Alguliyev RM, Aliguliyev RM, Isazade NR, Abdi A, Idris N (2017) A model for text summarization. Int J Intell Inf Technol (IJIIT) 13:67–85
Article Google Scholar
Altszyler E, Sigman M, Slezak DF (2016) Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database CoRR abs/1610.01520
Aouicha MB, Taieb MAH, Hamadou AB (2018) SISR: system for integrating semantic relatedness and similarity measures. Soft Comput 22:1855–1879
Article Google Scholar
Banea C, Hassan S, Mohler M, Mihalcea R (2012) UNT: a supervised synergistic approach to semantic text similarity. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 635–642
Bär D, Biemann C, Gurevych I, Zesch T (2012) UKP: computing semantic textual similarity by combining multiple content similarity measures. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 435–440
Batet M, Sánchez D, Valls A (2011) An ontology-based measure to compute semantic similarity in biomedicine. J Biomed Inform 44:118–125. https://doi.org/10.1016/j.jbi.2010.09.002
Article Google Scholar
Ben Aouicha M, Hadj Taieb MA (2016) Computing semantic similarity between biomedical concepts using new information content approach. J Biomed Inform 59:258–275. https://doi.org/10.1016/j.jbi.2015.12.007
Article Google Scholar
Benedetti F, Beneventano D, Bergamaschi S, Simonini G (2019) Computing inter-document similarity with Context Semantic Analysis. Inf Syst 80:136–147. https://doi.org/10.1016/j.is.2018.02.009
Article Google Scholar
Blei DM (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Bodenreider O (2004) The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 32:267D–270D. https://doi.org/10.1093/nar/gkh061
Article Google Scholar
Budgen D, Brereton P (2006) Performing systematic literature reviews in software engineering. In: ICSE ‘06. ACM, New York, NY, USA, pp 1051–1052. https://doi.org/10.1145/1134285.1134500
Burgun A, Bodenreider O (2001) Comparing terms, concepts and semantic classes in WordNet and the Unified Medical Language System, pp 77–82
Burnard L (2007) Reference Guide for the British National Corpus (XML Edition)
Castillo JJ, Cardenas ME (2010) Using sentence semantic similarity based on WordNet in recognizing textual entailment. In: Kuri-Morales A, Simari GR (eds) Advances in artificial intelligence – IBERAMIA 2010, vol 6433. Lecture notes in computer science. Springer, Berlin, pp 366–375. https://doi.org/10.1007/978-3-642-16952-6_37
Chapter Google Scholar
Chen M, Xu ZE, Weinberger KQ, Sha F (2012) Marginalized Denoising Autoencoders for Domain Adaptation CoRR abs/1206.4683
Chen Q, Kim S, Wilbur WJ, Lu Z (2018) Sentence Similarity Measures Revisited: Ranking Sentences in PubMed Documents. In: BCB ‘18. ACM, New York, NY, USA, pp 531–532. https://doi.org/10.1145/3233547.3233640
Cordeiro M, Sarmento RP, Brazdil P, Gama J (2018) Evolving networks and social network analysis methods and techniques. Soc Media J Trends Connect Implic 101:8
Google Scholar
Croft D, Coupland S, Shell J, Brown S (2013) A fast and efficient semantic short text similarity metric. In: 2013 13th UK workshop on computational intelligence (UKCI), pp 221–227. https://doi.org/10.1109/ukci.2013.6651309
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41:391–407
Article Google Scholar
Díaz I, Ralescu A (2012) Privacy issues in social networks: a brief survey. In: International conference on information processing and management of uncertainty in knowledge-based systems. Springer, Berlin, pp 509–518
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26:297–302. https://doi.org/10.2307/1932409
Article Google Scholar
Elavarasi S, Akilandeswari J, Menaga K (2014) A survey on semantic similarity measure. Int J Res Advent Technol 2:389–398
Google Scholar
Elhadi MT (2012) Text similarity calculations using text and syntactical structures. In: 2012 7th international conference on computing and convergence technology (ICCCT), December 2012, pp 715–719
Ferreira JD, Couto FM (2010) Semantic similarity for automatic classification of chemical compounds. PLoS Comput Biol. https://doi.org/10.1371/journal.pcbi.1000937
Article Google Scholar
Francis WN, Kucera H (1979) The brown corpus: a standard corpus of present-day edited American English
Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI’07. Morgan Kaufmann Publishers Inc., San Francisco, pp 1606–1611
Garla VN, Brandt C (2012) Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification. In: 2012 IEEE second international conference on healthcare informatics, imaging and systems biology, pp 22–22. https://doi.org/10.1109/hisb.2012.12
Gomaa WH, Fahmy AA (2013) A survey of text similarity approaches. Int J Comput Appl 68:13–18
Google Scholar
Goth G (2016) Deep or shallow, NLP is breaking out. Commun ACM 59:13–16. https://doi.org/10.1145/2874915
Article Google Scholar
Goyal N, Singh J (2016) A review on resemblance of user profiles in social networks using similarity measures. Int J Comput (IJC) 22:1–8
Google Scholar
Islam M, Inkpen D (2006) Second order co-occurrence PMI for determining the semantic similarity of words. In: European Language Resources Association (ELRA), Genoa, Italy
Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data 2:10–25. https://doi.org/10.1145/1376815.1376819
Article Google Scholar
Jaccard P (1901) Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bulletin de la Société vaudoise des sciences naturelles 37:547–579
Google Scholar
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy
Kaundal A, Kaur A (2017) A review on WordNet and Vector space analysis for short-text semantic similarity. Int J Innov Eng Technol. https://doi.org/10.21172/ijiet.81.018
Article Google Scholar
Kitchenham B (2004) Procedures for performing systematic reviews. Keele 33:1–26
Google Scholar
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97. https://doi.org/10.1002/nav.3800020109
Article MathSciNet MATH Google Scholar
Kusner MJ, Sun Y, Kolkin NI, Weinberger KQ (2015) From word embeddings to document distances. In: ICML’15. JMLR.org, Lille, France, pp 957–966
Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. WordNet: An Electronic Lexical Database 49:265–283
Google Scholar
Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18:1138–1150. https://doi.org/10.1109/TKDE.2006.130
Article Google Scholar
Li Y, Li H, Cai Q, Han D (2012) A novel semantic similarity measure within sentences. In: Proceedings of 2012 2nd international conference on computer science and network technology, pp 1176-1179. https://doi.org/10.1109/iccsnt.2012.6526134
Lin D (1998) An information-theoretic definition of similarity. In: Citeseer, pp 296–304
Lin C-Y, Hovy E (2003) Automatic evaluation of summaries using N-gram Co-occurrence statistics. In: NAACL ‘03. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 71–78. https://doi.org/10.3115/1073445.1073465
Liu H, Wang P (2013) Assessing sentence similarity using WordNet based word similarity. JSW 8:1451–1458
Google Scholar
Mabotuwana T, Lee MC, Cohen-Solal EV (2013) An ontology-based similarity measure for biomedical data—application to radiology reports. J Biomed Inform 46:857–868. https://doi.org/10.1016/j.jbi.2013.06.013
Article Google Scholar
Majumder G, Pakray P, Gelbukh A, Pinto D (2016) Semantic textual similarity methods, tools, and applications: a survey. Computacion y Sistemas 20:647–665. https://doi.org/10.13053/cys-20-4-2506
Article Google Scholar
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI’06. AAAI Press, Boston, Massachusetts, pp 775–780
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26. Curran Associates, Inc, Red Hook, pp 3111–3119
Google Scholar
Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38:39–41. https://doi.org/10.1145/219717.219748
Article Google Scholar
O’Shea J, Bandar Z, Crockett K, McLean D (2008) A comparative study of two short text semantic similarity measures. In: Nguyen NT, Jo GS, Howlett RJ, Jain LC (eds) Agent and multi-agent systems: technologies and applications, KES-AMSTA 2008, vol 4953. Lecture notes in computer science. Springer, Berlin, pp 172–181. https://doi.org/10.1007/978-3-540-78582-8_18
Chapter Google Scholar
Pawar A, Mago V (2018) Calculating the similarity between words and sentences using a lexical database and corpus statistics CoRR abs/1802.05667
Perina A, Jojic N, Bicego M, Truski A (2013) Documents as multiple overlapping windows into grids of counts. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26. Curran Associates, inc, Red Hook, pp 10–18
Google Scholar
Petersen K, Gencel C (2013) Worldviews, research methods, and their relationship to validity in empirical software engineering research. In: 2013 joint conference of the 23rd international workshop on software measurement and the 8th international conference on software process and product measurement, pp 81–89. https://doi.org/10.1109/iwsm-mensura.2013.22
Pirró G, Seco N (2008) Design, implementation and evaluation of a new semantic similarity metric combining features and intrinsic information content. In: Meersman R, Tari Z (eds) On the move to meaningful internet systems: OTM 2008, vol 5332. Lecture notes in computer science. Springer, Berlin, pp 1271–1288. https://doi.org/10.1007/978-3-540-88873-4_25
Chapter Google Scholar
Pupazan E, Bhulai S (2011) Social networking analytics BMI Paper, VU University Amsterdam, Amsterdam
Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybernet 19:17–30. https://doi.org/10.1109/21.24528
Article Google Scholar
Rawashdeh A, Ralescu AL (2015) Similarity measure for social networks-a brief survey. In: Maics, pp 153–159
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. Morgan Kaufmann Publishers Inc., Burlington, pp 448–453
Google Scholar
Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1995) Okapi at TREC-3, pp 109–126
Rus V, Niraula N, Banjade R (2013) Similarity measures based on latent Dirichlet allocation. In: Gelbukh A (ed) Computational linguistics and intelligent text processing, CICLing 2013, vol 7816. Lecture notes in computer science. Springer, Berlin, pp 459–470. https://doi.org/10.1007/978-3-642-37247-6_37
Chapter Google Scholar
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24:513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Article Google Scholar
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18:613–620. https://doi.org/10.1145/361219.361220
Article MATH Google Scholar
Sánchez D, Batet M (2013) A semantic similarity method based on information content exploiting multiple ontologies. Expert Syst Appl 40:1393–1399. https://doi.org/10.1016/j.eswa.2012.08.049
Article Google Scholar
Šarić F, Glavaš G, Karan M, Šnajder J, Bašić BD (2012) TakeLab: systems for measuring semantic text similarity. In: SemEval ‘12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 441–448
Shrestha P (2011) Corpus-based methods for short text similarity. Montpellier, France, p 297
Soğancıoğlu G, Öztürk H, Özgür A (2017) BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 33:i49–i58. https://doi.org/10.1093/bioinformatics/btx238
Article Google Scholar
Stapić Z, López EG, Cabot AG, de Marcos Ortega L, Strahonja V (2012) Performing systematic literature review in software engineering. In: CECIIS 2012-23rd international conference
Sugathadasa K, Ayesha B, Silva Nd, Perera AS, Jayawardana V, Lakmal D, Perera M (2017) Synergistic union of Word2Vec and lexicon for domain specific semantic similarity. In: 2017 IEEE international conference on industrial and information systems (ICIIS), pp 1–6. https://doi.org/10.1109/iciinfs.2017.8300343
Sultana S, Biskri I (2018) Identifying Similar sentences by using n-grams of characters. In: Mouhoub M, Sadaoui S, Ait Mohamed O, Ali M (eds) International conference on industrial, engineering and other applications of applied intelligent systems. Springer International Publishing, Cham, pp 833–843
Tabassum S, Pereira FS, Fernandes S, Gama J (2018) Social network analysis: an overview wiley interdisciplinary reviews. Data Min Knowl Disc 8:e1256
Google Scholar
Takale SA, Nandgaonkar SS (2010) Measuring semantic similarity between words using web documents. Int J Adv Comput Sci Appl (IJACSA) 1:10
Google Scholar
Ukkonen E (1995) On-line construction of suffix trees. Algorithmica 14:249–260. https://doi.org/10.1007/BF01206331
Article MathSciNet MATH Google Scholar
Vu HH, Villaneau J, Saïd F, Marteau P-F (2014) Sentence similarity by combining explicit semantic analysis and overlapping n-grams. In: Sojka P, Horák A, Kopeček I, Pala K (eds) Text, speech and dialogue, TSD 2014, vol 8655. Lecture notes in computer science. Springer International Publishing, Cham, pp 201–208. https://doi.org/10.1007/978-3-319-10816-2_25
Chapter Google Scholar
Wang JZ, Taylor W (2007) Concept forest: a new ontology-assisted text document similarity measurement method. In: IEEE/WIC/ACM international conference on web intelligence (WI’07), pp 395–401. https://doi.org/10.1109/wi.2007.11
Wang C, Long L, Li L (2008) HowNet based evaluation for Chinese text summarization. In: 2008 international conference on natural language processing and knowledge engineering, October 2008, pp 1–6. https://doi.org/10.1109/nlpke.2008.4906789
Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: EASE ‘14. ACM, New York, NY, USA, pp 38:31–38:10. https://doi.org/10.1145/2601248.2601268
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. Association for Computational Linguistics, Stroudsburg, pp 133–138
Google Scholar
Zhao C, Yao X, Sun S (2009) A HowNet-based feature selection method for Chinese text representation. In: Sixth international conference on fuzzy systems and knowledge discovery, pp 26–30. https://doi.org/10.1109/fskd.2009.280

Download references

Author information

Authors and Affiliations

Department of Industrial Engineering and Business Information Systems, University of Twente, Enschede, The Netherlands
Dimas Wibisono Prakoso & Asad Abdi
Department of Operations Management, Amsterdam Business School, University of Amsterdam, Amsterdam, The Netherlands
Chintan Amrit

Authors

Dimas Wibisono Prakoso
View author publications
You can also search for this author in PubMed Google Scholar
Asad Abdi
View author publications
You can also search for this author in PubMed Google Scholar
Chintan Amrit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Asad Abdi.

Ethics declarations

Conflict of interest

I hereby and on behalf of the co-authors declare all the authors agreed to submit the article exclusively to this journal and also declare that there is no conflict of interests regarding the publication of this article.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prakoso, D.W., Abdi, A. & Amrit, C. Short text similarity measurement methods: a review. Soft Comput 25, 4699–4723 (2021). https://doi.org/10.1007/s00500-020-05479-2

Download citation

Published: 03 January 2021
Issue Date: March 2021
DOI: https://doi.org/10.1007/s00500-020-05479-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Short text similarity measurement methods: a review

Abstract

Access this article

Similar content being viewed by others

Semantic Textual Similarity Using Various Approaches

Short Text Computing Based on Lexical Similarity Model

Calculation of Textual Similarity Using Semantic Relatedness Functions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Short text similarity measurement methods: a review

Abstract

Access this article

Similar content being viewed by others

Semantic Textual Similarity Using Various Approaches

Short Text Computing Based on Lexical Similarity Model

Calculation of Textual Similarity Using Semantic Relatedness Functions

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation