Information Systems Frontiers

, Volume 15, Issue 3, pp 399–410 | Cite as

Semantic similarity measurement using historical google search patterns

  • Jorge Martinez-Gil
  • José F. Aldana-Montes


Computing the semantic similarity between terms (or short text expressions) that have the same meaning but which are not lexicographically similar is an important challenge in the information integration field. The problem is that techniques for textual semantic similarity measurement often fail to deal with words not covered by synonym dictionaries. In this paper, we try to solve this problem by determining the semantic similarity for terms using the knowledge inherent in the search history logs from the Google search engine. To do this, we have designed and evaluated four algorithmic methods for measuring the semantic similarity between terms using their associated history search patterns. These algorithmic methods are: a) frequent co-occurrence of terms in search patterns, b) computation of the relationship between search patterns, c) outlier coincidence on search patterns, and d) forecasting comparisons. We have shown experimentally that some of these methods correlate well with respect to human judgment when evaluating general purpose benchmark datasets, and significantly outperform existing methods when evaluating datasets containing terms that do not usually appear in dictionaries.


Information integration Web Intelligence Semantic similarity 



We would like to to thank the reviewers for their time and consideration. We thank Lisa Huckfield for proofreading this manuscript. This work has been funded by Spanish Ministry of Innovation and Science through: REALIDAD: Efficient Analysis, Management and Exploitation of Linked Data., Project Code: TIN2011-25840 and by the Department of Innovation, Enterprise and Science from the Regional Government of Andalucia through: Towards a platform for exploiting and analyzing biological linked data, Project Code: P11-TIC-7529.


  1. Aitken, A. (2007). Statistical mathematics. Oliver & Boyd.Google Scholar
  2. Badea, B., & Vlad, A. (2006). Revealing Statistical Independence of Two Experimental Data Sets: An Improvement on Spearman’s Algorithm. In ICCSA (pp. 1166–1176).Google Scholar
  3. Banek, M., Vrdoljak, B., Min Tjoa, A., Skocir, Z. (2007). Automating the Schema Matching Process for Heterogeneous Data Warehouses. In DaWaK (pp. 45–54). 596Google Scholar
  4. Banek, M., Vrdoljak, B., Tjoa, A.M. (2007). Using Ontologies for Measuring Semantic Similarity in Data Warehouse Schema Matching Process. In CONTEL (pp. 227–234).Google Scholar
  5. Banerjee, S., & Pedersen, T. (2003). Extended Gloss Overlaps as a Measure of Semantic Relatedness. In IJCAI (pp. 805–810).Google Scholar
  6. Bollegala, D., Matsuo, Y., Ishizuka, M. (2007). Measuring semantic similarity between words using web search engines. In WWW (pp. 757–766).Google Scholar
  7. Bollegala, D., Honma, T., Matsuo, Y., Ishizuka, M. (2008). Mining for personal name aliases on the web. In WWW (pp. 1107–1108).Google Scholar
  8. Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks, 30(1–7), 107–117.Google Scholar
  9. Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 32(1), 13–47.CrossRefGoogle Scholar
  10. Choi, H., & Varian, H. (2009). Predicting the present with Google Trends. Technical Report, Economics Research Group, Google.Google Scholar
  11. Cilibrasi, R., & Vitányi, P.M. (2007). The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, 19(3), 370–383.CrossRefGoogle Scholar
  12. Dhurandhar, A. (2011). Improving predictions using aggregate information. In KDD (pp. 1118–1126).Google Scholar
  13. Egghe, L., & Leydesdorff, L. (2009). The relation between Pearson’s correlation coefficient r and Salton’s cosine measure CoRR abs/0911.1318.Google Scholar
  14. Fong, J., Shiu, H., Cheung, D. (2009). A relational-XML data warehouse for data aggregation with SQL and XQuery. Software, Practice and Experience, 38(11), 1183–1213.CrossRefGoogle Scholar
  15. Grubbs, F. (1969). Procedures for Detecting Outlying Observations in Samples. Technometrics, 11(1), 1–21.CrossRefGoogle Scholar
  16. Hliaoutakis, A., Varelas, G., Petrakis, E.G.M.,Milios, E. (2006). Med-Search: A Retrieval System for Medical Information Based on Semantic Similarity. In ECDL (pp. 512–515).Google Scholar
  17. Hu, N., Bose, I., Koh, N.S., Liu, L. (2012). Manipulation of online reviews: An analysis of ratings, readability, and sentiments. Decision Support Systems (DSS), 52(3), 674–684.CrossRefGoogle Scholar
  18. Hjorland, H. (2007). Semantics and knowledge organization. ARIST, 41(1), 367–405.Google Scholar
  19. Jung, J.J., & Thanh Nguyen, N. (2008). Collective Intelligence for Semantic and Knowledge Grid. Journal of Universal Computer Science (JUCS), 14(7), 1016–1019.Google Scholar
  20. Kopcke, H., Thor, A., Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1), 484–493.Google Scholar
  21. Leacock, C., Chodorow, M., Miller, G.A. (1998). Using Corpus Statistics and WordNet Relations for Sense Identification. Computational Linguistics, 24(1), 147–165.Google Scholar
  22. Lesk, M. (1986). Information in Data: Using the Oxford English Dictionary on a Computer. SIGIR Forum, 20(1–4), 18–21.CrossRefGoogle Scholar
  23. Li, J., Alan Wang, G., Chen, H. (2011). Identity matching using personal and social identity features. Information Systems Frontiers, 13(1), 101–113.CrossRefGoogle Scholar
  24. Li, Y., Bandar, A., McLean, D. (2003). An approach for Measuring Semantic Similarity between Words Using Multiple Information Sources. IEEE Transactions on Knowledge and Data Engineering, 15(4), 871–882.CrossRefGoogle Scholar
  25. Liu, B., & Zhang, L. (2012). A Survey of Opinion Mining and Sentiment Analysis. In Mining Text Data (pp. 415–463).Google Scholar
  26. Miller, G., & Charles, W. (1991). Contextual Correlates of Semantic Similarity. Language and Cognitive Processes, 6(1), 1–28.CrossRefGoogle Scholar
  27. Nandi, A., & Bernstein, P.A. (2009). HAMSTER: Using Search Click- logs for Schema and Taxonomy Matching. PVLDB, 2(1), 181–192.Google Scholar
  28. Patuwo, B.E., & Hu, M. (1998) Forecasting with artificial neural networks: The state of the art. International Journal of Forecasting, 14(1), 35–62.CrossRefGoogle Scholar
  29. Patwardhan, S., Banerjee, S., Pedersen, T. (2003). Using Measures of Semantic Relatedness for Word Sense Disambiguation. In CICLing (pp. 241–257).Google Scholar
  30. Pedersen, T., Patwardhan, S., Michelizzi, J. (2004). Word-Net::Similarity - Measuring the Relatedness of Concepts. In AAAI (pp. 1024–1025).Google Scholar
  31. Petrakis, E.G.M., Varelas, G., Hliaoutakis, A., Raftopoulou, P. (2006). X-Similarity: Computing Semantic Similarity between Concepts from Different Ontologies. JDIM, 4(4), 233–237.Google Scholar
  32. Pirro, G. (2009). A semantic similarity metric combining features and intrinsic information content. Data and Knowledge Engineering, 68(11), 1289–1308.CrossRefGoogle Scholar
  33. Resnik, P. (1995). Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In IJCAI (pp. 448–453).Google Scholar
  34. Retzer, S., Yoong, P., Hooper, V. (2012). Inter-organisational knowledge transfer in social networks: A definition of intermediate ties. Information Systems Frontiers, 14(2), 343–361.CrossRefGoogle Scholar
  35. Rousseeuw, P.J., & Leroy, A.M. (2005). Robust Regression and Outlier Detection: John Wiley & Sons Inc.Google Scholar
  36. Sanchez, D., Batet, M., Valls, A. (2010). Web-Based Semantic Similarity: An Evaluation in the Biomedical Domain. International Journal of Software and Informatics, 4(1), 39–52.Google Scholar
  37. Sanchez, D., Batet, M., Valls, A., Gibert, K. (2010). Ontology-driven web-based semantic similarity. Journal of Intelligent Information Systems, 35(3), 383–413.CrossRefGoogle Scholar
  38. Scarlat, E., & Maries, I. (2009). Towards an Increase of Collective Intelligence within Organizations Using Trust and Reputation Models. In ICCCI (pp. 140–151).Google Scholar
  39. Sparck Jones, K. (2006). Collective Intelligence: It’s All in the Numbers. IEEE Intelligent Systems (EXPERT), 21(3), 64–65.CrossRefGoogle Scholar
  40. Tuan Duc, N., Bollegala, D., Ishizuka, M. (2010). Using Relational Similarity between Word Pairs for Latent Relational Search on the Web. In Web Intelligence (pp. 196–199).Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of MalagaMalagaSpain

Personalised recommendations