Artificial Intelligence Review

, Volume 42, Issue 4, pp 935–943 | Cite as

An overview of textual semantic similarity measures based on web intelligence

  • Jorge Martinez-GilEmail author


Computing the semantic similarity between terms (or short text expressions) that have the same meaning but which are not lexicographically similar is a key challenge in many computer related fields. The problem is that traditional approaches to semantic similarity measurement are not suitable for all situations, for example, many of them often fail to deal with terms not covered by synonym dictionaries or are not able to cope with acronyms, abbreviations, buzzwords, brand names, proper nouns, and so on. In this paper, we present and evaluate a collection of emerging techniques developed to avoid this problem. These techniques use some kinds of web intelligence to determine the degree of similarity between text expressions. These techniques implement a variety of paradigms including the study of co-occurrence, text snippet comparison, frequent pattern finding, or search log analysis. The goal is to substitute the traditional techniques where necessary.


Similarity measures Web intelligence Web search engines Information integration 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: International joint conference on artificial intelligence (IJCAI), pp 805–810Google Scholar
  2. Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using web search engines. In: Proceedings of WWW, pp 757–766Google Scholar
  3. Budanitsky A, Hirst G (2006) Evaluating WordNet-based measures of lexical semantic relatedness. Comput Linguistics 32(1): 13–47CrossRefzbMATHGoogle Scholar
  4. Cilibrasi R, Vitányi PM (2007) The google similarity distance. IEEE Trans Knowl Data Eng 19(3): 370–383CrossRefGoogle Scholar
  5. Deerwester SC, Dumais ST, Landauer TK, Furnas GW, Harshman RA (1990) Indexing by latent semantic analysis. JASIST 41(6): 391–407CrossRefGoogle Scholar
  6. Grubbs F (1969) Procedures for detecting outlying observations in samples. Technometrics 11(1): 1–21CrossRefGoogle Scholar
  7. Leacock C, Chodorow M, Miller GA (1998) Using corpus statistics and WordNet relations for sense identification. Comput Linguistics 24(1): 147–165Google Scholar
  8. Lesk M (1986) Information in data: using the Oxford english dictionary on a computer. SIGIR Forum 20(1–4): 18–21CrossRefGoogle Scholar
  9. Li Y, Bandar A, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans Knowl Data Eng 15(4): 871–882CrossRefGoogle Scholar
  10. Patuwo BE, Hu M (1998) Forecasting with artificial neural networks: the state of the art. Int J Forecast 14(1): 35–62CrossRefGoogle Scholar
  11. Pedersen T, Patwardhan S, Michelizzi J (2004) WordNet::Similarity—measuring the relatedness of concepts. In: Proceedings of AAAI, pp 1024–1025Google Scholar
  12. Pirro G (2009) A semantic similarity metric combining features and intrinsic information content. Data Knowl Eng 68(11): 1289–1308CrossRefGoogle Scholar
  13. Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. In: International joint conference on artificial intelligence (IJCAI), pp 448–453Google Scholar
  14. Rousseeuw PJ, Leroy AM (2005) Robust regression and outlier detection. Wiley, New YorkGoogle Scholar
  15. Wolfe MB, Goldman SR (2003) Use of latent semantic analysis for predicting psychological phenomena: two issues and proposed solutions. Behav Res Methods 35: 22–31CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of ExtremaduraCaceresSpain

Personalised recommendations