Journal of Intelligent Information Systems

, Volume 35, Issue 3, pp 383–413 | Cite as

Ontology-driven web-based semantic similarity

  • David SánchezEmail author
  • Montserrat Batet
  • Aida Valls
  • Karina Gibert


Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge—such as the structure of a taxonomy—or implicit knowledge—such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies –like specific domain ontologies- and massive corpus –like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures’ dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.


Semantic similarity Ontologies Information content Web Knowledge discovery 



This research has been partially supported by the Spanish Government within projects ARES (CONSOLIDER-INGENIO 2010 CSD2007-00004) and E-AEGIS (TSI2007-65406-C03-02). The work is partially supported by the Universitat Rovira i Virgili (2009AIRE-04) and the DAMASK project (Data mining algorithms with semantic knowledge, TIN2009-11005). Montserrat Batet is also supported by a research grant provided by the Universitat Rovira i Virgili.


  1. Batet, M., Valls, A., & Gibert, K. (2008). Improving classical clustering with ontologies. In Proceedings of the 4th world conference of the international association for statistical computing (pp. 137–146). Yokohama, Japan.Google Scholar
  2. Berners-lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific American, 284(5), 34–43.CrossRefGoogle Scholar
  3. Bollegala, D., Matsuo, Y., & Ishizuka, M. (2007). WebSim: A web-based semantic similarity measure. In Proceedings of the 21st annual conference of the Japanese society for artificial intelligence. Miyazaki.Google Scholar
  4. Brill, E. (2003). Processing natural language without natural language processing. In Proceedings of the 4th international conference on computational linguistics and intelligent text processing (pp. 360–369). Mexico City, Mexico.Google Scholar
  5. Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1), 13–47.CrossRefGoogle Scholar
  6. Church, K. W., Gale, W., Hanks, P., & Hindle, D. (1991). Using statistics in lexical analysis. In Proceedings of lexical acquisition: Exploiting on-line resources to build a lexicon (pp. 115–164). New Jersey, USA.Google Scholar
  7. Cilibrasi, R., & Vitanyi, P. M. B. (2006). The Google similarity distance. IEEE Transaction on Knowledge and Data Engineering, 19(3), 370–383.CrossRefGoogle Scholar
  8. Cimiano, P. (2006). Ontology learning and population from text. Algorithms, evaluation and applications. Berlin: Springer.Google Scholar
  9. Ding, L., Finin, T., Joshi, A., Pan, R., Cost, R. S., Peng, Y., et al. (2004). Swoogle: A search and metadata engine for the semantic web. In Proceedings of the thirteenth ACM conference on information and knowledge management (pp. 652–659). New York: ACM.CrossRefGoogle Scholar
  10. Domingo-Ferrer, J., & Torra, V. (2001). A quantitative comparison of disclosure control methods for microdata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, disclosure, and data access: Theory and practical applications for statistical agencies (pp. 111–134). Amsterdam: Elsevier.Google Scholar
  11. Downey, D., Broadhead, M., & Etzioni, O. (2007). Locating complex named entities in Web text. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 2733–2739).Google Scholar
  12. Dujmovic, J., & Bai, H. (2006). Evaluation and comparison of search engines using the LSP method. Computer Science and Information Systems, 3(2), 711–722.CrossRefGoogle Scholar
  13. Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., et al. (2005). Unsupervised named-entity extraction form the Web: An experimental study. Artificial Intelligence, 165, 91–134.CrossRefGoogle Scholar
  14. Euzenat, J., & Shvaiko, P. (2007). Ontology matching. Berlin: Springer.zbMATHGoogle Scholar
  15. Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.zbMATHGoogle Scholar
  16. Ferreira da Silva, J., & Lopes, G. P. (1999). Local maxima method and a fair dispersion normalization for extracting multi-word units from corpora. In Proceedings of sixth meeting on mathematics of language (pp. 369–381).Google Scholar
  17. Gómez-Pérez, A., Fernández-López, M., & Corcho, O. (2004). Ontological engineering (2nd printing). Berlin: Springer.Google Scholar
  18. Guarino, N. (1998). Formal ontology in information systems. In N. Guarino (Ed.), 1st international conference on formal ontology in information systems (pp. 3–15). Trento: IOS Press.Google Scholar
  19. Hotho, A., Maedche, A., & Staab, S. (2002). Ontology-based text document clustering. Künstliche Intelligenz, 4, 48–54.Google Scholar
  20. Jiang, J., & Conrath, D. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics (pp. 19–33), Japan.Google Scholar
  21. Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.CrossRefGoogle Scholar
  22. Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.Google Scholar
  23. Lee, J. H., Kim, M. H., & Lee, Y. J. (1993). Information retrieval based on conceptual distance in is-a hierarchies. Journal of Documentation, 49(2), 188–207.CrossRefGoogle Scholar
  24. Lemaire, B., & Denhière, G. (2006). Effects of high-order co-occurrences on word semantic similarities. Current Psychology Letters, 18(1). Accessed 26 May 2009.
  25. Lin, D. (1998). An information-theoretic definition of similarity. In Proceedings of the 15th international conf. on machine learning (pp. 296–304). San Francisco: Kaufmann.Google Scholar
  26. Miller, G., Leacock, C., Tengi, R., & Bunker, R. T. (1993). A semantic concordance. In Proceedings of ARPA workshop on human language technology (pp. 303–308). Morristown: Association for Computational Linguistics.CrossRefGoogle Scholar
  27. Miller, G. A., & Charles, W. G. (1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28.CrossRefGoogle Scholar
  28. Patwardhan, S., & Pedersen, T. (2006). Using WordNet-based context vectors to estimate the semantic relatedness of concepts. In Proceedings of the conference of the European association for computational linguistics (pp. 1–8). Trento, Italy.Google Scholar
  29. Pedersen, T., Pakhomov, S., Patwardhan, S., & Chute, C. (2007). Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40, 288–299.CrossRefGoogle Scholar
  30. Rada, R., Mili, H., Bichnell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 9(1), 17–30.CrossRefGoogle Scholar
  31. Resnik, P. (1995). Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of 14th international joint conference on artificial intelligence (pp. 448–453).Google Scholar
  32. Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95–130.zbMATHGoogle Scholar
  33. Ruch, P., Baud, R. H., Rassinoux, A. M., Bouillon, P., & Robert, G. (2000). Medical document anonymization with a semantic lexicon. In Proceeding of the American medical informatics association symposium (pp. 729–733).Google Scholar
  34. Sánchez, D. (2008). Domain ontology learning from the web. Saabrucken: VDM Verlag.Google Scholar
  35. Sánchez, D., Batet, M., & Valls, A. (2009). Computing knowledge-based semantic similarity from the Web: An application to the biomedical domain. In Proceedings of the 3rd international conference on knowledge science, engineering and management (in press).Google Scholar
  36. Spence, D. P., & Owens, K. C. (1990). Lexical co-occurrence and association strength. Journal of Psycholinguistic Research, 19, 317–330.CrossRefGoogle Scholar
  37. Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems, 10(5), 557–570.zbMATHCrossRefMathSciNetGoogle Scholar
  38. Tadepalli, S., Sinha, A. K., & Ramakrishnan, N. (2004). Ontology driven data mining for geosciences. Abstracts with Programs — Geological Society of America, 36(5), 149.Google Scholar
  39. Turney, P. D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the twelfth European conference on machine learning (pp. 491–499). Freiburg, Germany.Google Scholar
  40. Wu, Z., & Palmer, M. (1994). Verb semantics and lexical selection. In Proceedings of the 32nd annual meeting of the association for computational linguistics (pp. 133–138). New Mexico, USA.Google Scholar
  41. Yarowsky, D. (1995). Unsupervised word-sense disambiguation rivalling supervised methods. In Proceedings of the 33rd annual meeting of the association for computational linguistics (pp. 189–196). Cambridge, MA.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • David Sánchez
    • 1
    Email author
  • Montserrat Batet
    • 1
  • Aida Valls
    • 1
  • Karina Gibert
    • 2
  1. 1.Department of Computer Science and MathematicsUniversitat Rovira i Virgili (URV)TarragonaSpain
  2. 2.Department of Statistics and Operations ResearchUniversitat Politècnica de CatalunyaBarcelonaSpain

Personalised recommendations